Sunday, March 8, 2020

Artifacts in NLP

I think the work that Sebastian Ruder is doing with Tracking Progress in Natural Language Processing is very admirable. Keeping track of new frameworks and tasks and datasets is a lot of work, so many kudos to him for the idea and implementation of this keeping track GitHub repository.
I love it!

However, some times a leaderboard is not the best, most reasonable way of tracking progress. Let me explain why I say so and then you will see what the beautiful Mycenaean vase above has to do with it. (this whole post is about the questions I got when I spoke on Natural Language inference at SRI on  5th March. thanks for the big conversation guys! slides in "Between a Rock and a Hard Place")

For a very long time, computational linguists have been creating datasets to evaluate their semantic systems, which tended to be as varied as their syntactic systems. In particular,  the FRACAS testsuite, a collection of questions and answers created in the early nineties, was meant to be about what semanticists thought computational systems should be able to do when they eventually became good enough for a basic level of understanding.  As the unshared task in 2016 described it, "The FraCaS test suite was created by the FraCaS Consortium as a benchmark for measuring and comparing the competence of semantic theories and semantic processing systems. It contains inference problems that collectively demonstrate basic linguistic phenomena that a semantic theory has to account for; including quantification, plurality, anaphora, ellipsis, tense, comparatives, and propositional attitudes. Each problem has the form: there is some natural language input T, then there is a natural language claim H, giving the task to determine whether H follows from T. Problems are designed to include exactly one target phenomenon, to exclude other phenomena, and to be independent of background knowledge."

In the 2000s a collection of shared tasks, then called Recognizing Textual Entailment (RTE) were established, running from 2006 till  2013. The ACL Wiki says that RTE was proposed as a generic task that captures major semantic inference needs across many Natural Language Processing (NLP) applications, such as Question Answering, Information Retrieval, Information Extraction, and Text Summarization. This task requires systems to recognize, given two text fragments, whether the meaning of one text (H) is entailed (can be inferred) from the other text (T).

More recently, the task has been renamed as Natural Language Inference (NLI). In  Ruder's benchmark site, we read that Natural language inference is the task of determining whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise”. Now by looking at Ruder's leaderboard, you may think that things are really going well, as the community has been able to produce more than four fairly large datasets (e.g. SNLI, MultiNLI, XNLI, SciTail, etc) to measure NLI and the numbers are around the 92%,  beating human performance in some cases.

What this leaderboard picture does not show is that there are at least ten or so papers that have appeared in the last two years, showing that this performance is "fake" (note the scare quotes).

While the toothbrush holder above is apparently 4000 years old, the above 90 percent results on the NLI task seem to be the result of biases on the datasets constructed to detect inference, and these are called artifacts, just like the beautiful pottery above. In particular, the first paper below (Hypothesis Only Baselines in Natural Language Inference) shows that some machine-learning models in fact decide whether to call an inference pair entailment, contradiction or neutral, only given the first half of the pair, the hypothesis (Kudos to the John Hopkins group for this insight!). Hence they are not doing inference at all, as far as I'm concerned.

There are more papers than the ones below, but this sample should be enough to convince people that more work is needed here, as the data we have does not seem to enable learning of what is meant by humans as inference.
  1.  Hypothesis Only Baselines in Natural Language Inference. Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, Benjamin Van Durme. 2018 
  2.   Exploring Semantic Properties of Sentence Embeddings. Xunjie Zhu, Tingfeng Li, Gerard de Melo. 2018.
  3.   Annotation Artifacts in Natural Language Inference Data. Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. 2018.
  4.    Breaking NLI Systems with Sentences that Require Simple Lexical Inferences. Max Glockner, Vered Shwartz and Yoav Goldberg. 2018
  5.    Stress Test Evaluation for Natural Language Inference. Aakanksha Naik, Abhilasha Ravichander, Norman Sadeh, Carolyn Rose, Graham Neubig. 2019. 
  6.   Analyzing Compositionality-Sensitivity of NLI Models. Yixin Nie, Yicheng Wang, Mohit Bansal. 2018.
  7.   Evaluating Compositionality in Sentence Embeddings. Ishita Dasgupta,  Demi Guo, Andreas Stuhlmüller, Samuel J. Gershman, Noah D. Goodman. 2018. 
  8.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. R. Thomas McCoy, Ellie Pavlick & Tal Linzen. 2019.
  9.   Don’t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference. 2019. Yonatan Belinkov, Adam Poliak, Stuart M. Shieber,  Benjamin Van Durme,  Alexander M. Rush. 2019.
  10. AdvEntuRe: Adversarial Training for Textual Entailment with Knowledge-Guided Examples. 2018. Dongyeop Kang, Tushar Khot, Ashish Sabharwal, Eduard Hovy.

(the original Mechanical Turker from Wikipedia)

No comments:

Post a Comment