Both sides previous revisionPrevious revisionNext revision | Previous revision |
public:t-720-atai:atai-20:evaluation [2020/09/07 11:22] – [Piaget-McGyver Room] thorisson | public:t-720-atai:atai-20:evaluation [2024/04/29 13:33] (current) – external edit 127.0.0.1 |
---|
| |
====The Challenge of Evaluating Intelligence==== | ====The Challenge of Evaluating Intelligence==== |
| Without Evaluation... | ...there can be no comparison. \\ Without comparison there can be no indication of direction. \\ Without direction there can be no systematic scientific (or otherwise) effort to deepen understanding. | | | \\ Without Evaluation... | ...there can be no comparison. \\ Without comparison there can be no indication of direction. \\ Without direction there can be no systematic scientific (or otherwise) effort to deepen understanding. | |
| Without Definition ... | ...there can be no evaluation. | | | Without Definition ... | ...there can be no evaluation. | |
| The Challenge | 'Intelligence' is an ill-defined concept. | | | The Challenge | 'Intelligence' is an ill-defined concept. | |
====What Are We Trying to Evaluate?==== | ====What Are We Trying to Evaluate?==== |
| Proposed Definitions | "Intelligence" as a concept must be broken into smaller parts. \\ "Adaptation" seems too broad. \\ "Behavior" is difficult to measure unless it's codified in domain-dependent methods (e.g. verbal, motor, ...). | | | Proposed Definitions | "Intelligence" as a concept must be broken into smaller parts. \\ "Adaptation" seems too broad. \\ "Behavior" is difficult to measure unless it's codified in domain-dependent methods (e.g. verbal, motor, ...). | |
| Alternatives | What if we could avoid definitions? Competitions (e.g. games, robofootball, specific (single-goal tasks) have been proposed in its place. \\ Turing proposed the 'imitation game' ("Turing Test") as a placeholder for a definitive definition (the Turing Test is most correctly seen as a working definition). | | | \\ Alternatives | What if we could avoid definitions? Competitions (e.g. games, robofootball, specific (single-goal tasks) have been proposed in its place. \\ Turing proposed the 'imitation game' ("Turing Test") as a placeholder for a definitive definition (the Turing Test is most correctly seen as a working definition). | |
| Shortcomings | Mostly single-goal (physical world requires multiple simultaneous goals. \\ Mostly easily measurable goals (PW often has ill-defined goals). \\ Mostly toy-like (no noise; PW has lots of noise.) \\ Mostly limited-count variables (PW has infinite number of vars). | | | \\ Shortcomings | Mostly single-goal (physical world requires multiple simultaneous goals). \\ Mostly easily measurable goals (PW often has ill-defined goals). \\ Mostly toy-like (no noise; PW has lots of noise.) \\ Mostly limited-count variables (PW has infinite number of vars). | |
| Current Status | Scientists still working on how to properly measure learning and intelligence. | | | Current Status | Scientists still working on how to properly measure learning and intelligence. | |
| |
| What it is | A test for intelligence proposed by Alan Turing in 1950. | | | What it is | A test for intelligence proposed by Alan Turing in 1950. | |
| Why it's relevant | Proposed as a way to get a pragmatic/working definition of the //concept of intelligence//. \\ The first proposal for how to evaluate an intelligent machine. | | | Why it's relevant | Proposed as a way to get a pragmatic/working definition of the //concept of intelligence//. \\ The first proposal for how to evaluate an intelligent machine. | |
| Method | It is played with three people, a man (A), a woman (B), and an interrogator (C) who may be of either sex. The interrogator stays in a room apart front the other two. The object of the game for the interrogator is to determine which of the other two is the man and which is the woman. He knows them by labels X and Y, and at the end of the game he says either "X is A and Y is B" or "X is B and Y is A." We now ask the question, "What will happen when a machine takes the part of A in this game?" | | | \\ Method | It is played with three people, a man (A), a woman (B), and an interrogator (C) who may be of either sex. The interrogator stays in a room apart front the other two. The object of the game for the interrogator is to determine which of the other two is the man and which is the woman. He knows them by labels X and Y, and at the end of the game he says either "X is A and Y is B" or "X is B and Y is A." We now ask the question, "What will happen when a machine takes the part of A in this game?" | |
| Pros | It is difficult to imagine an honest, collaborative machine playing this game for several days or months could ever fool a human into thinking it was a grown human unless it really understood a great deal. | | | Pros | It is difficult to imagine an honest, collaborative machine playing this game for several days or months could ever fool a human into thinking it was a grown human unless it really understood a great deal. | |
| Cons | Targets evaluation at a single point in time. Anchored in human language, social convention and dialogue. The Loebner Prize competition has been running for some decades, offering a large financial prize for the first machine to "pass the Turing Test". None of the competing machines has thus far offered any significant advances in the field of AI, and most certainly not to AGI. //"It's important to note that Turing never meant for his test to be the official benchmark as to whether a machine or computer program can actually think like a human"// (- Mark Reidl) | | | \\ Cons | Targets evaluation at a single point in time. Anchored in human language, social convention and dialogue. The Loebner Prize competition has been running for some decades, offering a large financial prize for the first machine to "pass the Turing Test". None of the competing machines has thus far offered any significant advances in the field of AI, and most certainly not to AGI. //"It's important to note that Turing never meant for his test to be the official benchmark as to whether a machine or computer program can actually think like a human"// (- Mark Reidl) | |
| \\ Implementations | The Loebner Prize competition has been running for some decades, offering a large financial prize for the first machine to "pass the Turing Test". None of the competing machines has thus far offered any significant advances in the field of AI, and most certainly not to AGI. | | | \\ Implementations | The Loebner Prize competition has been running for some decades, offering a large financial prize for the first machine to "pass the Turing Test". None of the competing machines has thus far offered any significant advances in the field of AI, and most certainly not to AGI. | |
| Bottom Line | //"It's important to note that Turing never meant for his test to be the official benchmark as to whether a machine or computer program can actually think like a human"// (- Mark Reidl) | | | Bottom Line | //"It's important to note that Turing never meant for his test to be the official benchmark as to whether a machine or computer program can actually think like a human"// (- Mark Reidl) | |
| Short Description | [W]e define a room, the Piaget-MacGyver Room (PMR), which is such that, an [information-processing] artifact can credibly be classified as general-intelligent if and only if it can succeed on any test constructed from the ingredients in this room. No advance notice is given to the engineers of the artifact in question as to what the test is going to be. | | | Short Description | [W]e define a room, the Piaget-MacGyver Room (PMR), which is such that, an [information-processing] artifact can credibly be classified as general-intelligent if and only if it can succeed on any test constructed from the ingredients in this room. No advance notice is given to the engineers of the artifact in question as to what the test is going to be. | |
| Why it's relevant | One of the first attempts at explicitly getting away from a specific test or test suite for testing intelligence. | | | Why it's relevant | One of the first attempts at explicitly getting away from a specific test or test suite for testing intelligence. | |
| | \\ Pros | Being very open ended, the evaluation method prevents specific targeted skills to be pre-built into the AI to be evaluated. \\ Targeting the physical world means perception must be integrated into the cognition. \\ Could also be constructed virtually. | |
| | \\ Cons | Perhaps too open-ended. \\ Leaves almost everything undefined. \\ Requires further gradients on the "level of difficulty" to be provided by the evaluators. | |
| REF | [[http://kryten.mm.rpi.edu/Bringsjord_Licato_PAGI_071512.pdf|Bringsjord & Licato]] | | | REF | [[http://kryten.mm.rpi.edu/Bringsjord_Licato_PAGI_071512.pdf|Bringsjord & Licato]] | |
| |
====The Toy Box Problem==== | ====The Toy Box Problem==== |
| What it is | A proposal for evaluating the intelligence of an agent. | | | What it is | A proposal for evaluating the intelligence of an agent. | |
| | Short Description | Based on a box filled with toys of various kinds that will be the subject of evaluation, either directly or in reference to new unseen objects that only bear a resemblance to them. | |
| Why it's relevant | One of several new and novel methods proposed for this purpose; focuses on variety, novelty and exploration. | | | Why it's relevant | One of several new and novel methods proposed for this purpose; focuses on variety, novelty and exploration. | |
| Method | A robot is given a box of previously unseen toys. The toys vary in shape, appearance and construction materials. Some toys may be entirely unique, some toys may be identical, and yet other toys may share certain characteristics (such as shape or construction materials). The robot has an opportunity to rst play and experiment with the toys, but is subsequently tested on its knowledge of the toys. It must predict the responses of new interactions with toys, and the likely behavior of previously unseen toys made from similar materials or of similar shape or appearance. Furthermore, should the toy box be emptied onto the floor, it must also be able to generate an appropriate sequence of actions to return the toys to the box without causing damage to any toys (or itself). | | | \\ Method | A robot is given a box of previously unseen toys. The toys vary in shape, appearance and construction materials. Some toys may be entirely unique, some toys may be identical, and yet other toys may share certain characteristics (such as shape or construction materials). The robot has an opportunity to play and experiment with the toys, but is subsequently tested on its knowledge of the toys. It must predict the responses of new interactions with toys, and the likely behavior of previously unseen toys made from similar materials or of similar shape or appearance. Furthermore, should the toy box be emptied onto the floor, it must also be able to generate an appropriate sequence of actions to return the toys to the box without causing damage to any toys (or itself). | |
| Pros | Includes perception and action explicitly. Specifically designed as a stepping stone towards general intelligence; a solution to the simplest instances should not require universal or human-like intelligence. | | | Pros | Includes perception and action explicitly. Specifically designed as a stepping stone towards general intelligence; a solution to the simplest instances should not require universal or human-like intelligence. | |
| Cons | Limited to a single instance in time. Somewhat too limited to dexterity guided by vision, missing out on reasoning, creativity, and many other factors. | | | Cons | Limited to a single instance in time. Somewhat too limited to dexterity guided by vision, missing out on reasoning, creativity, and many other factors. | |
====Lovelace Test 2.0==== | ====Lovelace Test 2.0==== |
| What it is | A proposal for how to evaluate the creativity. | | | What it is | A proposal for how to evaluate the creativity. | |
| | Short Description | Replacing that which is to be tested - intelligence - with the related concept of creativity. | |
| Why it's relevant | The only test focusing explicitly on creativity. | | | Why it's relevant | The only test focusing explicitly on creativity. | |
| Method | Artificial agent <m>a</m> is challenged as follows: 1. <m>a</m> must create an artifact o of type t; o must conform to a set of constraints C where <m>c_i</m> ∈ C is any criterion expressible in natural language; a human evaluator h, having chosen t and C, is satisfied that o is a valid instance of t and meets C; and a human referee r determines the combination of t and C to not be unrealistic for an average human. | | | Method | Artificial agent <m>a</m> is challenged as follows: \\ <m>a</m> must create an artifact <m>o</m> of type t; \\ <m>o</m> must conform to a set of constraints <m>C</m> where <m>c_i~in~C</m> is any criterion expressible in natural language; \\ a human evaluator <m>h</m>, having chosen <m>t</m> and <m>C</m>, is satisfied that <m>o</m> is a valid instance of <m>t</m> and meets <m>C</m>; and \\ a human referee <m>r</m> determines the combination of <m>t</m> and <m>C</m> to not be unrealistic for an average human. | |
| Pros | Brings creativity to the forefront of intelligence testing. | | | Pros | Brings creativity to the forefront of intelligence testing. | |
| Cons | Narrow focus on creativity. Too restricted to human experience and knowledge (last point). | | | Cons | Narrow focus on creativity. Too restricted to human experience and knowledge (last point). | |
====Requirements for Evaluation: Settings That Must Be Obtainable==== | ====Requirements for Evaluation: Settings That Must Be Obtainable==== |
| |
| Complexity | ENVIRONMENT IS COMPLEX WITH DIVERSE INTERACTING OBJECTS | | | Complexity | Environment is complex with diverse interacting objects. | |
| Dynamicity | ENVIRONMENT IS DYNAMIC | | | Dynamicity | Environment is dynamic. | |
| Regularity | TASK-RELEVANT REGULARITIES EXIST AT MULTIPLE TIME SCALES | | | Regularity | Task-relevant regularities exist at multiple time scales. | |
| Task Diversity | TASKS CAN BE COMPLEX, DIVERSE, AND NOVEL | | | Task Diversity | Tasks can be complex, diverse, and novel. | |
| Interactions | AGENT/ENVIRONMENT/TASK INTERACTIONS ARE COMPLEX AND LIMITED | | | Interactions | Agent/environment/task interactions are complex and limited. | |
| Computational limitations | AGENT COMPUTATIONAL RESOURCES ARE LIMITED | | | Computational limitations | Agent computational resources are limited. | |
| Persistence | AGENT EXISTENCE IS LONG-TERM AND CONTINUAL | | | Persistence | Agent existence is long-term and continual. | |
| REF | [[http://www.atlantis-press.com/php/download_paper.php?id=1900|Laird et al.]] | | | REF | [[http://www.atlantis-press.com/php/download_paper.php?id=1900|Laird et al.]] | |
| |
====Example Frameworks for Evaluating AI Systems==== | ====Example Frameworks for Evaluating AI Systems==== |
| \\ \\ Merlin | A significant problem facing researchers in reinforcement and multi-objective learning is the lack of good benchmarks. Merlin (for Multi-objective Environments for Reinforcement LearnINg) is a software tool and method for enabling the creation of random problem instances, including multi-objective learning problems, with specific structural properties. Merlin provides the ability to control task features in predictable ways allowing researchers to build a more detailed understanding about what features of a problem interact with a given learning algorithm, improving or degrading its performance. | [[http://alumni.media.mit.edu/~kris/ftp/Tunable-generic-Garrett-etal-2014.pdf|Paper]] by Garrett et al. | | | \\ \\ Merlin | A significant problem facing researchers in reinforcement and multi-objective learning is the lack of good benchmarks. Merlin (for Multi-objective Environments for Reinforcement LearnINg) is a software tool and method for enabling the creation of random problem instances, including multi-objective learning problems, with specific structural properties. Merlin provides the ability to control task features in predictable ways allowing researchers to build a more detailed understanding about what features of a problem interact with a given learning algorithm, improving or degrading its performance. | [[http://alumni.media.mit.edu/~kris/ftp/Tunable-generic-Garrett-etal-2014.pdf|Paper]] by Garrett et al. | |
| \\ SAGE | Framework that allows modular construction of simulated physical task-environments for evaluating intelligent control systems. A proto-task theory on which the framework is built aims for a deeper understanding of tasks in general, with a future goal of providing a theoretical foundation for all resource-bounded real-world tasks. Tasks constructed in the framework can be rooted in physics, to varying desired degrees, allowing their execution to analyze the performance of control systems in terms of expended time and energy. SAGE is intended for evaluating both narrow AI and AGI systems on numerous easily-constructed tasks. | \\ [[http://alumni.media.mit.edu/~kris/ftp/SAGE-EberdingEtAl-AGI-2020.pdf|Paper]] by Eberding et al. | | |
| AI Gym | Gym is a toolkit developed by OpenAI for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing games like Pong or Pinball. | [[https://gym.openai.com|Link]] to Website. | | | AI Gym | Gym is a toolkit developed by OpenAI for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing games like Pong or Pinball. | [[https://gym.openai.com|Link]] to Website. | |
| | \\ SAGE | Framework that allows modular construction of simulated physical task-environments for evaluating intelligent control systems. A proto-task theory on which the framework is built aims for a deeper understanding of tasks in general, with a future goal of providing a theoretical foundation for all resource-bounded real-world tasks. Tasks constructed in the framework can be rooted in physics, to varying desired degrees, allowing their execution to analyze the performance of control systems in terms of expended time and energy. SAGE is intended for evaluating both narrow AI and AGI systems on numerous easily-constructed tasks. | \\ [[http://alumni.media.mit.edu/~kris/ftp/SAGE-EberdingEtAl-AGI-2020.pdf|Paper]] by Eberding et al. | |
| |
| |
| |
====State of the Art==== | ====State of the Art==== |
| Summary | Practically all proposals to date for evaluating intelligence leave out some major important aspects of intelligence. Virtually no proposals exist for evaluation of knowledge transfer, attentional capabilities, knowledge acquisition, knowledge capacity, knowledge retention, multi-goal learning, social intelligence, creativity, reasoning, cognitive growth, and meta-learning / integrated cognitive control -- all of which are quite likely vital to achieving general intelligence on par with human. | | | \\ Summary | Practically all proposals to date for evaluating intelligence leave out some major important aspects of intelligence. Virtually no proposals exist for evaluation of knowledge transfer, attentional capabilities, knowledge acquisition, knowledge capacity, knowledge retention, multi-goal learning, social intelligence, creativity, reasoning, cognitive growth, and meta-learning / integrated cognitive control -- all of which are quite likely vital to achieving general intelligence on par with human. | |
| What is needed | A theory of intelligence that allows us to construct adequate, thorough, and comprehensive tests of intelligence and intelligent behavior. | | | What is needed | A theory of intelligence that allows us to construct adequate, thorough, and comprehensive tests of intelligence and intelligent behavior. | |
| What can be done | In leu of such a theory (which still is not forthcoming after over 100 years of psychology and 60 years of AI) we could use a multi-dimensional "Lego" kit for exploring various means of measuring intelligence and intelligent performance, so as to be able to evaluate the pros and cons of various approaches, methods, scales, etc. \\ Some sort of kit meeting part or all of the requirements listed above would go a long way to bridging the gap, and possibly generate some ideas that could speed up theoretical development. | | | \\ What can be done | In leu of such a theory (which still is not forthcoming after over 100 years of psychology and 60 years of AI) we could use a multi-dimensional "Lego" kit for exploring various means of measuring intelligence and intelligent performance, so as to be able to evaluate the pros and cons of various approaches, methods, scales, etc. \\ Some sort of kit meeting part or all of the requirements listed above would go a long way to bridging the gap, and possibly generate some ideas that could speed up theoretical development. | |
| |
\\ | \\ |