Saturday, March 21, 2020

SICK (the dataset) in these trying times

SICK, the dataset, stands for Sentences Involving Compositional Knowledge, nothing more. In particular, it has nothing to do with the coronavirus apocalypse, we're living now. Actually, the 'nothing more' above is an exaggeration. There was an EU-funded project COMPOSES (2010-2014) and they have many interesting publications. But the main  references seem to be:
  • L. Bentivogli, R. Bernardi, M. Marelli, S. Menini, M. Baroni and R. Zamparelli (2016). SICK Through the SemEval Glasses. Lesson learned from the evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. Journal of Language Resources and Evaluation, 50(1), 95-124
  • M. Marelli, S. Menini, M. Baroni, L. Bentivogli, R. Bernardi and R. Zamparelli (2014). A SICK cure for the evaluation of compositional distributional semantic models. Proceedings of LREC 2014, Reykjavik (Iceland): ELRA, 216-223.

 My history here begins when, in 2016, Google announced Parsey McParseface (for the name see Boaty McBoatface). I thought it was time to check SICK using SyntaxNet and other off-the-shelf tools, like WordNet and SUMO. The promise of near-human parsing accuracy (94% they said), a dataset curated by linguists to be simple and concrete (no temporal phenomena, no named entities, no multiword expressions, no coreference, they said) and with a small vocabulary (2K content words), I thought "we can do this!".

The GitHub repo has the processing results of this work with Alexandre and Chalub. I still want to write something about what we learned from this first experiment. later.

But more importantly, in 2017, Katerina, Livy and I got together and we did several bits of work checking the SICK corpus. A partial list is below in reverse chronological order and the GitHub repo is SICK-processing. So what did we do?

First, we realized that many (but we didn't know how many) of the annotations provided by the SICK turkers did not agree with our intuitions. We talked about this in Textual Inference: getting logic from humans, explaining that we believe that our systems should learn from datasets that agree with human experience.  In particular, we discussed the single implication cases in SICK, where we expected to find many problems.  (anyone who has tried to teach logic has noticed how the difference between single implications and double implications is hard to introduce to students.) 

Continuing our study of the SICK dataset, we showed that contradictions were not working well, probably because different annotators understood the task differently from others. This time we could quantify the issue and we wrote Correcting Contradictions, where we drove the point that contradictions need to be symmetric and they weren't in the annotations.

We then provided a simple method for detecting easier inferences, when the sentences compared were only "one-word apart", for a slightly special version of the phrase "one-word apart". This was written up in WordNet for “Easy” Textual Inferences. We then described briefly the kind of graphical representations, geared towards inference that we wanted to produce in Graph Knowledge Representations for SICK. We also produce with many other friends a Portuguese version of the SICK corpus in SICK-BR: a Portuguese corpus for inference.

Then we started a collaboration with Martha Palmer and Annebeth Buis in Colorado (the picture has Katerina, Martha, Annebeth and Livy at ACL2019), where we investigated how much of the issues we had were examples of annotation problems. From this experiment, we wrote Explaining Simple Natural Language Inference, where we point out that without clear guidelines experiments on inference do not produce sensible results, that explanations from annotators are really useful to improve annotation quality and that certain linguistic phenomena seem hard for humans, let alone machines.

Meanwhile, our system is getting better all the time and other people's systems are getting better too. I really would like to be able to incorporate the best of all the systems and see if we can complement each other. No, I don't know how to do it, yet. But I would like to take this "sheltering-at-home" time to think about it. (see below the references).
  • Aikaterini-Lida Kalouli, Livy Real, Valeria de Paiva. Textual Inference: getting logic from humans. Proceedings of the 12th International Conference on Computational Semantics (IWCS), 22 September 2017. Held in Montpellier, France. [PDF]
  • Aikaterini-Lida Kalouli, Livy Real, Valeria de Paiva. Correcting Contradictions. Proceedings of the Computing Natural Language Inference (CONLI) Workshop, 19 September 2017. Held in Montpellier, France. [PDF]
  • Aikaterini-Lida Kalouli, Livy Real, Valeria de Paiva. WordNet for “Easy” Textual Inferences. Proceedings of the Globalex Workshop, associated with LREC 2018, 08 May 2018. Miyazaki, Japan. [PDF]
  • Katerina Kalouli, Dick Crouch, Valeria de Paiva, Livy Real. Graph Knowledge Representations for SICK. informal Proceedings of the 5th Workshop on Natural Language and Computer Science, Oxford, UK, 08 July 2018. short paper. [PDF]
  • Kalouli, A.-L., L. Real, and V. de Paiva. 2018. Annotating Logic Inference Pitfalls. Poster presentation at the Workshop on Data Provenance and Annotation in Computational Linguistics 2018. Prague - Czech Republic (
  • Livy Real, Ana Rodrigues, Andressa Vieira e Silva, Beatriz Albiero, Bruna Thalenberg, Bruno Guide, Cindy Silva, Guilherme de Oliveira Lima, Igor C. S. Camara, Miloˇ Stanojevic, Rodrigo Souza, Valeria de Paiva. SICK-BR: a Portuguese corpus for inference.  PROPOR2018 (International Conference on the Computational Processing of Portuguese), Canela Brazil, 26 September 2018. [PDF]
  • Kalouli, A.-L., Buis, A., Real, L., Palmer, M., de Paiva, V. Explaining Simple Natural Language Inference. Proceedings of the 13th Linguistic Annotation Workshop (LAW 2019), 01 August 2019. ACL 2019, Florence, Italy. [PDF]

No comments:

Post a Comment