Thursday, January 12, 2017

Implicit association tests aren't great tests of racism

Perhaps no new concept from the world of academic psychology has taken hold of the public imagination more quickly and profoundly in the 21st century than implicit bias — that is, forms of bias which operate beyond the conscious awareness of individuals. That’s in large part due to the blockbuster success of the so-called implicit association test, which purports to offer a quick, easy way to measure how implicitly biased individual people are. ...

Since the IAT was first introduced almost 20 years ago, its architects, as well as the countless researchers and commentators who have enthusiastically embraced it, have offered it as a way to reveal to test-takers what amounts to a deep, dark secret about who they are: They may not feel racist, but in fact, the test shows that in a variety of intergroup settings, they will act racist. ...

A pile of scholarly work, some of it published in top psychology journals and most of it ignored by the media, suggests that the IAT falls far short of the quality-control standards normally expected of psychological instruments. The IAT, this research suggests, is a noisy, unreliable measure that correlates far too weakly with any real-world outcomes to be used to predict individuals’ behavior — even the test’s creators have now admitted as such. ...

Take the concept of test-retest reliability, which measures the extent to which a given instrument will produce similar results if you take it, wait a bit, and then take it again. ... test-retest reliability is often one of the first things a psychologist will look for when deciding whether to use a given tool.

Test-retest reliability is expressed with a variable known as r, which ranges from 0 to 1. To gloss over some of the gory statistical details, r = 1 means that if a given test is administered multiple times to the same group of people, it will rank them in exactly the same order every time. Hypothetically, if the IAT had a test-retest reliability of r = 1, and you administered the test to ten people over and over and over, they’d be placed in the same order, least to most implicitly biased, every time. At the other end of the spectrum, when r = 0, that means the ranking shifts every time the test is administered, completely at random. ...

What constitutes an acceptable level of test-retest reliability? It depends a lot on context, but, generally speaking, researchers are comfortable if a given instrument hits r = .8 or so. ...

In a 2007 chapter on the IAT, for example, Kristin Lane, Banaji, Nosek, and Greenwald included a table (Table 3.2) running down the test-retest reliabilities for the race IAT that had been published to that point: r = .32 in a study consisting of four race IAT sessions conducted with two weeks between each; r = .65 in a study in which two tests were conducted 24 hours apart; and r = .39 in a study in which the two tests were conducted during the same session (but in which one used names and the other used pictures)...

...the most IAT-friendly numbers, published in a 2009 meta-analysis lead-authored by Greenwald, which found fairly unimpressive correlations (race IAT scores accounted for about 5.5 percent of the variation in discriminatory behavior in lab settings, and other intergroup IAT scores accounted for about 4 percent of the variance in discriminatory behavior in lab settings), were based on some fairly questionable methodological decisions on the part of the authors. ... the Greenwald team took a questionable approach to handling so-called ironic IAT effects, or published findings in which high IAT scores correlated with better behavior toward out-group than in-group members, the theory being the implicitly biased individuals were overcompensating. Greenwald and his team counted both ironic and standard effects as evidence of a meaningful IAT–behavior correlation, which, in effect, allowed the IAT to double-dip at the validity bowl: Unless the story being told is extremely pretzel-like, it can’t be true that high IAT scores predict both better and worse behavior toward members of minority groups. ...

One important upcoming meta-analysis, which we’ll return to later in another context, found that such scores can explain less than one percent of the variance observed in discriminatory behavior. The researchers Rickard Carlsson and Jens Agerström, in a meta-analysis of their own published in the Scandinavian Journal of Psychology last year, pinned the figure at about 2 percent — but argued that the extant research is of such low statistical quality it’s impossible to draw any meaningful conclusions from it.
--Jesse Singal, New York, on reasons to stop assessing yourself with IAT tests. HT: NS