Center for Analysis and Design of Intelligent Agents

T-720-ATAI-2016

Lecture Notes F-8 05.02.2016

Evaluation of Intelligent Systems

Sources of Evaluation Methods

Psychology	Uses tests based on a single measure at a single point in time. Produces a single “IQ” score.
	Method	Creates a set of test items that can be assigned to a sample pool of people at various ages and measured on their ability to distinguish them from each other (diversity). Subset of test items selected based on the “largest discriminatory power” and normalized for age groups.
	Pros	Well established method for human intelligence.
	Cons	Present and future AI systems are very different from human intelligence. Worse, the normalization of standard psychometrics for humans isn't possible for AIs because they are not likely to consist of populations of similar AI systems. Even if they did, these methods only provide relative measurements. Another serious problem is that they rely heavily on a subject's prior knowledge and training.
AI	Board games, robo-football, a handful of toy problems (e.g. mountain car, diving for gold).
	Method	Standard board games that humans play are used unmodified or in simplified versions to distinguish between the best AI systems capable of playing these board games.
	Pros	Simple tests with a single measure provide unequivocal scores that can be compared. Relatively easy to implement and administer.
	Cons	A single dimension to measure intelligence on is too simplistic, subject to the same problems that IQ tests are subject to. All systems in the first 40 years of AI could only play a single board game (the General Game Playing Competition was intended to address this limitation).
AGI	Turing Test, Piaget-McGyver Room, Lovelace Test, Toy-Box Problem
	Method	Human-like conditions extended to apply to intelligent machines.
	Pros	Better than single-measure methods in many ways.
	Cons	Measure intelligence at a single point in time. Many are difficult to implement and administer.

Turing Test

What it is	A test for intelligence proposed by Alan Turing in 1950.
Why it's relevant	The first proposal for how to evaluate an intelligent machine. Proposed as a way to get a pragmatic/working definition of the concept of intelligence.
Method	It is played with three people, a man (A), a woman (B), and an interrogator (C) who may be of either sex. The interrogator stays in a room apart front the other two. The object of the game for the interrogator is to determine which of the other two is the man and which is the woman. He knows them by labels X and Y, and at the end of the game he says either “X is A and Y is B” or “X is B and Y is A.” We now ask the question, “What will happen when a machine takes the part of A in this game?”
Pros	It is difficult to imagine an honest, collaborative machine playing this game for several days or months could ever fool a human into thinking it was a grown human unless it really understood a great deal.
Cons	Targets evaluation at a single point in time. Anchored in human language, social convention and dialogue. The Loebner Prize competition has been running for some decades, offering a large financial prize for the first machine to “pass the Turing Test”. None of the competing machines has thus far offered any significant advances in the field of AI, and most certainly not to AGI. “It's important to note that Turing never meant for his test to be the official benchmark as to whether a machine or computer program can actually think like a human” (- Mark Reidl)
Implementations	The Loebner Prize competition has been running for some decades, offering a financial prize for the first machine to “pass the Turing Test”. None of the competing machines has thus far offered any significant advances in the field of AI, and most certainly not to AGI.
Bottom Line	“It's important to note that Turing never meant for his test to be the official benchmark as to whether a machine or computer program can actually think like a human” (- Mark Reidl)
Paper	Loebner prize article by Charlie Moloney

Piaget-McGyver Room

What it is	[W]e define a room, the Piaget-MacGyver Room (PMR), which is such that, an [information-processing] artifact can credibly be classified as general-intelligent if and only if it can succeed on any test constructed from the ingredients in this room. No advance notice is given to the engineers of the artifact in question as to what the test is going to be.
Why it's relevant	One of the first attempts at explicitly getting away from a specific test or test suite for testing intelligence.
REF	Bringsjord & Licato

The Toy Box Problem

What it is	A proposal for evaluating the intelligence of an agent.
Why it's relevant	One of several new and novel methods proposed for this purpose; focuses on variety, novelty and exploration.
Method	A robot is given a box of previously unseen toys. The toys vary in shape, appearance and construction materials. Some toys may be entirely unique, some toys may be identical, and yet other toys may share certain character- istics (such as shape or construction materials). The robot has an opportunity to rst play and experiment with the toys, but is subsequently tested on its knowledge of the toys. It must predict the responses of new interactions with toys, and the likely behavior of previously unseen toys made from similar materials or of similar shape or appearance. Furthermore, should the toy box be emptied onto the floor, it must also be able to generate an appropriate sequence of actions to return the toys to the box without causing damage to any toys (or itself).
Pros	Includes perception and action explicitly. Specifically designed as a stepping stone towards general intelligence; a solution to the simplest instances should not require universal or human-like intelligence.
Cons	Limited to a single instance in time. Somewhat too limited to dexterity guided by vision, missing out on reasoning, creativity, and many other factors.
REF	Johnston

Lovelace Test 2.0

What it is	A proposal for how to evaluate the creativity.
Why it's relevant	The only test focusing explicitly on creativity.
Method	Artificial agent <m>a</m> is challenged as follows: 1. <m>a</m> must create an artifact o of type t; o must conform to a set of constraints C where <m>c_i</m> ∈ C is any criterion expressible in natural language; a human evaluator h, having chosen t and C, is satisfied that o is a valid instance of t and meets C; and a human referee r determines the combination of t and C to not be unrealistic for an average human.
Pros	Brings creativity to the forefront of intelligence testing.
Cons	Narrow focus on creativity. Too restricted to human experience and knowledge (last point).
REF	Reidl

State of the Art

Summary	Practically all proposals to date for evaluating intelligence leave out some major important aspects of intelligence. Virtually no proposals exist for evaluation of knowledge transfer, attentional capabilities, knowledge acquisition, knowledge capacity, knowledge retention, multi-goal learning, social intelligence, creativity, reasoning, cognitive growth, and meta-learning / integrated cognitive control – all of which are quite likely vital to achieving general intelligence on par with human.
What is needed	A theory of intelligence that allows us to construct adequate, thorough, and comprehensive tests of intelligence and intelligent behavior.
What can be done	In leu of such a theory (which still is not forthcoming after over 100 years of psychology and 60 years of AI) we could use a multi-dimensional “Lego” kit for exploring various means of measuring intelligence and intelligent performance, so as to be able to evaluate the pros and cons of various approaches, methods, scales, etc.

EOF

Table of Contents

T-720-ATAI-2016

Lecture Notes F-8 05.02.2016

Evaluation of Intelligent Systems

Sources of Evaluation Methods

Turing Test

Piaget-McGyver Room

The Toy Box Problem

Lovelace Test 2.0

State of the Art