Brendan just sent me a link to this fresh-off-the-press ICML paper:

- Raykar, Vikas C., Shipeng Yu, Linda H. Zhao, Anna Jerebko, Charles Florin, Gerardo Hermosillo Valadez, Luca Bogoni, and Linda Moy. Supervised Learning from Multiple Experts: Whom to trust when everyone lies a bit. In
*ICML*.

The scientific zeitgest says to assess inter-annotator agreement, infer gold standards, and use them to train classifiers.

Raykar et al. use EM for a binomial model of annotator sensitivity and specificity (like Dawid and Skene’s original multinomial approach from the 1970s paper and the Snow et al. EMNLP paper). My experiments showed full Bayesian models slightly outperform EM, which slightly outperforms naive voiting (the effects are stronger with fewer annotators).

The obvious thing to do is to take the output of the gold standard inference and use that to train a classifier. With EM, you can use the MAP estimate of category likelihoods (a fuzzy gold standard); with Bayesian models, you can sample from the posterior, which provides more dispersion. Smyth et al.’s 1995 NIPS paper showed EM-style training was effective for simulations.

I was just in San Francisco presenting this work to the Mechanical Turk Meetup, and Jenny Finkel opined that fuzzy training wouldn’t work well in practice. Even taking the discussion offline, I’m still not sure why she thinks that [update: see her comments below]. In some ways, if we use the fuzzy truth as the gold standard, then using it to train should perform better than quantizing the gold standard to 0/1. There’s not a problem with convexity; we just impute a big data set with Gibbs sampling and train on that. We could even train up an SVM or naive Bayes system that way.

The interesting twist in the Raykar et al. paper is to jointly estimate a logistic regression classifier along with the gold standard. That is, throw the regression coefficients into the model and estimate them along with everything else. That’s the same linkage as I suggested above. But Raykar et al. go further — they let the trained model vote on the gold standard just like another annotator.

Even though the annotation model corrects for individual annotator bias (or in this case, the logistic regression classifier’s bias as estimated), each annotator still affects the overall model through its bias-adjusted vote (if it didn’t, you couldn’t get off the ground at all). If you evaluate the classifier on a “gold standard” which was voted upon by a committee including the classifier itself, the classifier should perform better because it’s getting a vote on the truth!

The right question is whether Raykar et al.’s jointly estimated classifiers are “better” in some sense than ones trained on the imputed gold standard. For that, I’d think we’d need some kind of held-out eval, but that begs the question on inferring the gold standard. The gold standards behind Snow et al.’s work weren’t that pure after all (I have some commentary on discrepancies in the paper cited below).

I have considered using the trained classifier as another annotator when doing active learning of the kind proposed in Sheng et al.’s 2008 KDD paper on getting another label for an existing item vs. annotating a new item. In fact, there’s no reason in principle why you can’t have more than one classifier being trained along with annotator sensitivities and specificities.

Another nice idea in the Raykar et al. paper is the use of simulation from a known gold standard to create a fuzzy gold standard. That’s still questionable, in that it’s generating fake data that are known to follow the model. But everyone should do this in every way possible for all parts of their models, so you can bet I’ll be saving this one for my bag of tricks.

I’m a little unclear on why the numbers in the lefthand plots in figures 1 and 2 don’t have the same AUC value for the proposed algorithm. Figure 2 actually does evaluate the gold-standard estimation followed by classifier estimation. If I’m reading that figure right, then training on the imputed gold standard didn’t do measurably better than the majority voted baseline.

[Update with comment: The right hand side plot in figure 2 is of the inferred gold standard versus the “golden gold standard”. It’s possible to plot this because the inferred gold standard is actually a point probability estimate of the item being in category 1.]

If we’re lucky, Raykar et al. will share their data. [Update 2: no luck — the data belongs to Siemens.]

P.S. All of these models assume the annotators don’t actually lie. Specifically, in order for the models to be identifiable, we need to assume the annotators are not adversarial (that is, they don’t know the right answer and intentionally lie, and thus perform worse than chance). There was, to reinforce the zeitgeist, also a paper about mixing in adversarial coders at ICML, Dekel and Shamir’s Good learners for evil teachers.

June 16, 2009 at 7:12 pm |

I still don’t think fuzzy training would work well. The reason why I think it wouldn’t work well to incorporate the likelihood of each datum having each possible label into the objective function is that I tried it and it didn’t work. Or, to be clear, it worked slightly worse than just using the most likely label. Clearly, every setup is different, and my task was slightly different (I was attempting semi-supervised learning), so YMMV. As for the Gibbs sampling approach, this just strikes me as far too noisy. (In the limit, I don’t see how this is even different from modifying the objective, but I haven’t thought that all the way through.) I feel like training on samples is something that you do when you have no other choice. Noise sucks, and this seems noisier than simple voting, or other easy-to-imagine approaches where you estimate the goodness of each annotator and then weight their vote accordingly.

I know these reasons are not that concrete – like I said I basically tried this and failed and then never thought about it again until your talk. But I wonder if it has something to do with the shape of the sigmoid. You can try to train it to learn that something is 0.7, but the curve is relatively flat there, so slight changes at test time will result in much larger changes in the likelihood in the middle portion than at the extremities. I think if you want to specifically model that something is 70% probable (to be some particular label) then maybe you want to change the objective function entirely. And of course I am making assumptions here that you are training a log-linear model, but that seems like a reasonable assumption.

(ps – it was nice to actually meet you!)

June 16, 2009 at 8:58 pm |

I can certainly understand an empirical objection. I should just stop talking about this and do an eval. I’m thinking everyone will object to using the imputed fuzzy gold standard as the evaluation data. That is, just more samples.

With 0/1 loss, I may be setting myself up for a fall, but for total corpus log loss, logistic regression’s the right metric.

The posterior samples follow the posterior distribution, so they’re only as noisy as your uncertainty. My intuition then runs the opposite way. Which frightens me, because you have a lot more experience with these kinds of models than I do.

My intuition is that training on 0/1 outcome data tries too hard to force decisions about marginal cases one way or another. If there isn’t posterior certainty, then training with 95 cases where an example is positive and 5 where it’s negative (or even 50/50) performs a kind of regularization. For instance, in the 50/50 sampled case, it’s equivalent to extra regularization. The optimal parameters in the 50/50 case for the defined features are all zero, so training pushes the vectors that way. The logistic regression error function is the right one (at the combined corpus level) for fuzzy examples. For instance, it penalizes anything other than a 70% prediction. 0/1 doesn’t penalize any prediction in (0.5,1] and applies a 1 penalty to predictions in [0,0.5].

Of course, most “gold standards” fudge this problem by censoring uncertain training data.

In the Snow et al. recreate-the-linguistic-gold-standard task, averaging over the posterior samples did help a little bit. But that’s different in that it’s averaging over annotator accuracy estimates, not over the samples themselves. With more posterior samples, I could’ve just sampled the categories and gotten the same result.

You could swap out SVMs or other models — anything that can handle non-separable training data.

June 16, 2009 at 9:32 pm |

I definitely don’t disagree with your intuition behind sampling as a substitution for modeling the uncertainty. The problem that I envision is that if you think something is 95% likely to be X and 5% likely to be Y, you are not actually going to get 95 samples of X and 5 of Y. You’ll get something close to that, but not quite. It seems far better to have one datum with weight 0.95 that is label X and one that is weight 0.05 that is Y, but I’m guessing you can’t do that, or you would. So maybe the question comes down to how many samples you are generating, and I’m just being unnecessarily pessimistic. When you first described this I was picturing you trying to train on 100 copies of your dataset, but thinking about it a bit more you could obviously just generate a ton of samples and then use the proportions to make a weighted training set. In which case my noise objections don’t hold. But then I’m not sure its any different than the modified objective, which I’ve established my doubts about already. But clearly, yeah, this is just an empirical question, and I would be happy to be proven wrong.

I want to elaborate my thoughts on logistic a little bit more. If you are trying to learn that something is 0.7, then the values for the weights are sort of being pushed from both sides – you don’t want them too big or too small. Also, because of how the curve looks, the interactions between the weights has to be very precise to actually get it to be 0.7. Now, at test time, you get a new datum and its gonna have some slightly different set of features present. Those features really only have to be slightly different for the likelihood (0.7) to change a lot. Then, take the 0/1 case. Here the weights are only really being pushed from one side – make it true or make it false. While the interactions between the features are certainly still there, because some will appear with both positive and negative examples, my hypothesis is that the interactions between the weights don’t need to be as precise. If you have some feature that is pretty indicative of true, you can just keep pushing it’s weight upwards (modulo regularization). In your case you are trying to get the likelihood for the datum to very specific place, while in my case I just keep pushing it further and further in some direction. Then, I think if the features change slightly there is a lot more wiggle room for me to still get the right classification.

June 17, 2009 at 12:48 pm |

That’s exactly right w.r.t. posterior samples. I could use more than 100 samples for more accuracy; it’s actually practical to do so with SGD. In the limit, it’s the same as the modified objective.

I can compute a smoother category posterior directly from the annotator accuracy posteriors and prevalence (p(category)) posteriors. That’s what I did for the dentistry data (and I believe I took more than 100 samples, but I’d have to go look at the code).

The likelihoods will change a lot if there’s a feature with a very high coefficient that changes. Whether you can keep pushing the feature up again goes against my experience with regularization, which almost always helps in posterior predictions (obviously not on fitting the training data).

And you’re right that 0.7 on the logistic scale is a much more specific place than 0 or 1, which are super broad on the logistic scale (basically any predictor with absolute value greater than 400 or so rounds to 1 or 0 in double-precision arithmetic).

But how hard it tries to get that datum there depends on its relative weight in the corpus. It’s the same as regularization which is trying to get every feature to 0. You just have to weight how hard it tries. I think the 0.7 and 0.3 weightings are right.

PS: I also want to point out that I’m thinking log loss on held out data as the target, not 0/1 loss. One reason I care so much is that we’re trying to build very high recall gene mention extraction and database linkage.

June 22, 2009 at 2:30 pm |

Vikas Raykar set me straight about that pesky figure 2 (right). It’s plotting the inferred gold standard vs. the “golden gold standard”. Remember that their inferred gold standard is a point estimate of p(cat[k] = 1), so the inferred gold standard’s output looks like the output of any other classifier.

This is a much stronger result than I got with the other NLP data, which looks much more like this paper’s figure 4, where the advantage of the model tails off as you approach 100% of the given annotations.

August 15, 2009 at 12:49 am |

I have a slightly different question to experts in this field.

For natural language processing experiments, say an annotation problem, how many annotatiors would be enough for your experiments to be considered satisfied. I was told that for NLP conference papers such as EACL, NAACL or ACL, two annotators would be enough to do experiments. Is this right? If that’s not the case, how many annotators would be needed for such labeleing taks to collect data and do supervised learning methods such as CRF or what not.

Many thanks for your timely answes.

Best regards,

Kemal.

August 16, 2009 at 6:22 pm |

Other than Snow et al.’s Amazon mechanical Turk data and our experiments, I don’t know of anyone who has used more than two labelers on average for NLP tasks. Sometimes corpora are adjudicated by a third annotator when the first two disagree, but more often, they’re just censored to remove examples on which annotators do not agree.

Some of the LDC corpora have had better quality control, but not much. When you adjudicate, there’s no reason to believe the third annotator (or the so-called “expert”) is any more consistent or accurate than other annotators.

The real problem is correlation between annotators or item-level difficulty. Even if you’re 90% agreement on average, it’s clear from real data that the data’s overdispersed relative to the independence assumption. If you consider item difficulty/correlation, the expected number of errors even with three annotators remains pretty high. You see this more clearly when you get 10 annotators, or when you sit down and annotate data yourself — some cases are just very hard.

My own rejected ACL submission had comments that considering 10 annotators was unrealistic (despite the fact that I’d collected the data for about $200 using Amazon Mechanical Turk).

Luckily, most of the estimation techniques, such as CRFs or logistic regresison or naive Bayes are pretty robust with respect to noise. The effect on evaluation tends to be larger than that on estimation.