Saturday, March 21, 2020

SICK (the dataset) in these trying times

 
SICK, the dataset, stands for Sentences Involving Compositional Knowledge, nothing more. In particular, it has nothing to do with the coronavirus apocalypse, we're living now. Actually, the 'nothing more' above is an exaggeration. There was an EU-funded project COMPOSES (2010-2014) and they have many interesting publications. But the main  references seem to be:
  • L. Bentivogli, R. Bernardi, M. Marelli, S. Menini, M. Baroni and R. Zamparelli (2016). SICK Through the SemEval Glasses. Lesson learned from the evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. Journal of Language Resources and Evaluation, 50(1), 95-124
  • M. Marelli, S. Menini, M. Baroni, L. Bentivogli, R. Bernardi and R. Zamparelli (2014). A SICK cure for the evaluation of compositional distributional semantic models. Proceedings of LREC 2014, Reykjavik (Iceland): ELRA, 216-223.

 My history here begins when, in 2016, Google announced Parsey McParseface (for the name see Boaty McBoatface). I thought it was time to check SICK using SyntaxNet and other off-the-shelf tools, like WordNet and SUMO. The promise of near-human parsing accuracy (94% they said), a dataset curated by linguists to be simple and concrete (no temporal phenomena, no named entities, no multiword expressions, no coreference, they said) and with a small vocabulary (2K content words), I thought "we can do this!".

The GitHub repo https://github.com/vcvpaiva/rte-sick has the processing results of this work with Alexandre and Chalub. I still want to write something about what we learned from this first experiment. later.

But more importantly, in 2017, Katerina, Livy and I got together and we did several bits of work checking the SICK corpus. A partial list is below in reverse chronological order and the GitHub repo is SICK-processing. So what did we do?



First, we realized that many (but we didn't know how many) of the annotations provided by the SICK turkers did not agree with our intuitions. We talked about this in Textual Inference: getting logic from humans, explaining that we believe that our systems should learn from datasets that agree with human experience.  In particular, we discussed the single implication cases in SICK, where we expected to find many problems.  (anyone who has tried to teach logic has noticed how the difference between single implications and double implications is hard to introduce to students.) 

Continuing our study of the SICK dataset, we showed that contradictions were not working well, probably because different annotators understood the task differently from others. This time we could quantify the issue and we wrote Correcting Contradictions, where we drove the point that contradictions need to be symmetric and they weren't in the annotations.

We then provided a simple method for detecting easier inferences, when the sentences compared were only "one-word apart", for a slightly special version of the phrase "one-word apart". This was written up in WordNet for “Easy” Textual Inferences. We then described briefly the kind of graphical representations, geared towards inference that we wanted to produce in Graph Knowledge Representations for SICK. We also produce with many other friends a Portuguese version of the SICK corpus in SICK-BR: a Portuguese corpus for inference.

Then we started a collaboration with Martha Palmer and Annebeth Buis in Colorado (the picture has Katerina, Martha, Annebeth and Livy at ACL2019), where we investigated how much of the issues we had were examples of annotation problems. From this experiment, we wrote Explaining Simple Natural Language Inference, where we point out that without clear guidelines experiments on inference do not produce sensible results, that explanations from annotators are really useful to improve annotation quality and that certain linguistic phenomena seem hard for humans, let alone machines.

Meanwhile, our system is getting better all the time and other people's systems are getting better too. I really would like to be able to incorporate the best of all the systems and see if we can complement each other. No, I don't know how to do it, yet. But I would like to take this "sheltering-at-home" time to think about it. (see below the references).
  • Aikaterini-Lida Kalouli, Livy Real, Valeria de Paiva. Textual Inference: getting logic from humans. Proceedings of the 12th International Conference on Computational Semantics (IWCS), 22 September 2017. Held in Montpellier, France. [PDF]
  • Aikaterini-Lida Kalouli, Livy Real, Valeria de Paiva. Correcting Contradictions. Proceedings of the Computing Natural Language Inference (CONLI) Workshop, 19 September 2017. Held in Montpellier, France. [PDF]
  • Aikaterini-Lida Kalouli, Livy Real, Valeria de Paiva. WordNet for “Easy” Textual Inferences. Proceedings of the Globalex Workshop, associated with LREC 2018, 08 May 2018. Miyazaki, Japan. [PDF]
  • Katerina Kalouli, Dick Crouch, Valeria de Paiva, Livy Real. Graph Knowledge Representations for SICK. informal Proceedings of the 5th Workshop on Natural Language and Computer Science, Oxford, UK, 08 July 2018. short paper. [PDF]
  • Kalouli, A.-L., L. Real, and V. de Paiva. 2018. Annotating Logic Inference Pitfalls. Poster presentation at the Workshop on Data Provenance and Annotation in Computational Linguistics 2018. Prague - Czech Republic (http://ling.uni-konstanz.de/pages/home/kalouli/assets/img/poster_WDP18.pdf)
  • Livy Real, Ana Rodrigues, Andressa Vieira e Silva, Beatriz Albiero, Bruna Thalenberg, Bruno Guide, Cindy Silva, Guilherme de Oliveira Lima, Igor C. S. Camara, Miloˇ Stanojevic, Rodrigo Souza, Valeria de Paiva. SICK-BR: a Portuguese corpus for inference.  PROPOR2018 (International Conference on the Computational Processing of Portuguese), Canela Brazil, 26 September 2018. [PDF]
  • Kalouli, A.-L., Buis, A., Real, L., Palmer, M., de Paiva, V. Explaining Simple Natural Language Inference. Proceedings of the 13th Linguistic Annotation Workshop (LAW 2019), 01 August 2019. ACL 2019, Florence, Italy. [PDF]

Sunday, March 8, 2020

Artifacts in NLP

I think the work that Sebastian Ruder is doing with Tracking Progress in Natural Language Processing is very admirable. Keeping track of new frameworks and tasks and datasets is a lot of work, so many kudos to him for the idea and implementation of this keeping track GitHub repository.
I love it!

However, some times a leaderboard is not the best, most reasonable way of tracking progress. Let me explain why I say so and then you will see what the beautiful Mycenaean vase above has to do with it. (this whole post is about the questions I got when I spoke on Natural Language inference at SRI on  5th March. thanks for the big conversation guys! slides in "Between a Rock and a Hard Place")

For a very long time, computational linguists have been creating datasets to evaluate their semantic systems, which tended to be as varied as their syntactic systems. In particular,  the FRACAS testsuite, a collection of questions and answers created in the early nineties, was meant to be about what semanticists thought computational systems should be able to do when they eventually became good enough for a basic level of understanding.  As the unshared task in 2016 described it, "The FraCaS test suite was created by the FraCaS Consortium as a benchmark for measuring and comparing the competence of semantic theories and semantic processing systems. It contains inference problems that collectively demonstrate basic linguistic phenomena that a semantic theory has to account for; including quantification, plurality, anaphora, ellipsis, tense, comparatives, and propositional attitudes. Each problem has the form: there is some natural language input T, then there is a natural language claim H, giving the task to determine whether H follows from T. Problems are designed to include exactly one target phenomenon, to exclude other phenomena, and to be independent of background knowledge."

In the 2000s a collection of shared tasks, then called Recognizing Textual Entailment (RTE) were established, running from 2006 till  2013. The ACL Wiki says that RTE was proposed as a generic task that captures major semantic inference needs across many Natural Language Processing (NLP) applications, such as Question Answering, Information Retrieval, Information Extraction, and Text Summarization. This task requires systems to recognize, given two text fragments, whether the meaning of one text (H) is entailed (can be inferred) from the other text (T).

More recently, the task has been renamed as Natural Language Inference (NLI). In  Ruder's benchmark site, we read that Natural language inference is the task of determining whether a “hypothesis” is true (entailment), false (contradiction), or undetermined (neutral) given a “premise”. Now by looking at Ruder's leaderboard, you may think that things are really going well, as the community has been able to produce more than four fairly large datasets (e.g. SNLI, MultiNLI, XNLI, SciTail, etc) to measure NLI and the numbers are around the 92%,  beating human performance in some cases.

What this leaderboard picture does not show is that there are at least ten or so papers that have appeared in the last two years, showing that this performance is "fake" (note the scare quotes).

While the toothbrush holder above is apparently 4000 years old, the above 90 percent results on the NLI task seem to be the result of biases on the datasets constructed to detect inference, and these are called artifacts, just like the beautiful pottery above. In particular, the first paper below (Hypothesis Only Baselines in Natural Language Inference) shows that some machine-learning models in fact decide whether to call an inference pair entailment, contradiction or neutral, only given the first half of the pair, the hypothesis (Kudos to the John Hopkins group for this insight!). Hence they are not doing inference at all, as far as I'm concerned.

There are more papers than the ones below, but this sample should be enough to convince people that more work is needed here, as the data we have does not seem to enable learning of what is meant by humans as inference.
  1.  Hypothesis Only Baselines in Natural Language Inference. Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, Benjamin Van Durme. 2018 
  2.   Exploring Semantic Properties of Sentence Embeddings. Xunjie Zhu, Tingfeng Li, Gerard de Melo. 2018.
  3.   Annotation Artifacts in Natural Language Inference Data. Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. 2018.
  4.    Breaking NLI Systems with Sentences that Require Simple Lexical Inferences. Max Glockner, Vered Shwartz and Yoav Goldberg. 2018
  5.    Stress Test Evaluation for Natural Language Inference. Aakanksha Naik, Abhilasha Ravichander, Norman Sadeh, Carolyn Rose, Graham Neubig. 2019. 
  6.   Analyzing Compositionality-Sensitivity of NLI Models. Yixin Nie, Yicheng Wang, Mohit Bansal. 2018.
  7.   Evaluating Compositionality in Sentence Embeddings. Ishita Dasgupta,  Demi Guo, Andreas Stuhlmüller, Samuel J. Gershman, Noah D. Goodman. 2018. 
  8.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. R. Thomas McCoy, Ellie Pavlick & Tal Linzen. 2019.
  9.   Don’t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference. 2019. Yonatan Belinkov, Adam Poliak, Stuart M. Shieber,  Benjamin Van Durme,  Alexander M. Rush. 2019.
  10. AdvEntuRe: Adversarial Training for Textual Entailment with Knowledge-Guided Examples. 2018. Dongyeop Kang, Tushar Khot, Ashish Sabharwal, Eduard Hovy.


(the original Mechanical Turker from Wikipedia)