You may think that after some six papers (all with Katerina and Livy) discussing it, I would never want to see the SICK (Sentences Involving Compositional Knowledge) corpus of Marelli et al ever again.
But you would be wrong, as the problem of deciding on correct and reliable annotations for a corpus of Natural Language Inferences (NLI) that seems at first sight so approachable and sensible is overwhelmingly compelling. We need to be able to solve this problem!
A previous blog post is SICK (dataset) in these trying times.
Our published papers are, in reverse chronological order:
1. Kalouli, A.-L., Buis, A., Real, L., Palmer, M., de Paiva, V. Explaining Simple Natural Language Inference. Proceedings of the 13th Linguistic Annotation Workshop (LAW 2019), 01 August 2019. ACL 2019, Florence, Italy. [PDF]
2. Katerina Kalouli, Dick Crouch, Valeria de Paiva, Livy Real. Graph Knowledge Representations for SICK. informal Proceedings of the 5th Workshop on Natural Language and Computer Science, Oxford, UK, 08 July 2018.
3. Aikaterini-Lida Kalouli, Livy Real, Valeria de Paiva. WordNet for “Easy” Textual Inferences. Proceedings of the Globalex Workshop, associated with LREC 2018, 08 May 2018. Miyazaki, Japan. [PDF]
4. Aikaterini-Lida Kalouli, Livy Real, Valeria de Paiva. Correcting Contradictions. Proceedings of the Computing Natural Language Inference (CONLI) Workshop, September 2017. [PDF]
5. Aikaterini-Lida Kalouli, Livy Real, Valeria de Paiva. Textual Inference: getting logic from humans. Proceedings of the 12th International Conference on Computational Semantics (IWCS), September 2017. [PDF]
(There's still one paper submitted and one being written.)
So I have been thinking about this problem for some four years! And the task that seemed so easy at the beginning is still not done. Why is that? Why is this task so difficult? What would I do differently, if I was starting this project now?
The task at hand is to determine if a sentence follows from, contradicts or is neutral with respect to another given sentence. Thus, using an example in the original Marelli et al paper the hypothesis A sea turtle is hunting for fish entails the thesis sentence A sea turtle is hunting for food. The same sentence contradicts A sea turtle is not hunting for fish and this sentence is neutral with respect to A fish is hunting for a turtle in the sea.
Seems simple, no?
In 2017 we wrote: "The long-term goal [of our work] is to be able to provide conceptual semantics for sentences so that entailment-contradiction-neutrality relations between sentences can be identified." Now, in 2021, this project has been completed to a large extent. Now we do have a system that produces conceptual semantics for sentences, the Graphical Knowledge Representation (GKR) system. We also have a system that decides inference relations between pairs of sentences, the hybrid natural language inference Hy-NLI system. Both systems are now fully operational, are open source and can be played with in the demos at https://cis.lmu.de/~kalouli/resources.html.
But the cleaning up and checking of our golden corpus have not been finished. Why?
There is a collection of reasons, from different sources and these are discussed in the papers. But here we will try to analyze these reasons a little.
First of all, the systems we produce, all have some noise. The corpus we use is not as well constructed as we wanted it, the task is not as easily understood as we had hoped, the guidelines we provided do not cover all difficult cases, the annotators are not as clever as we expect them to be, the systems that are used to tally results have some issues too.
The task of detecting entailment and contradiction requires  minimization of the amount of common sense and world knowledge required, when constructing your corpus: we're mostly trying to measure "linguistic entailment". (To measure encyclopedic knowledge of the world and/or commonsense knowledge of the population we  would use different tools -- possibly factoid question answering and/or Winograd pronouns). We do not want to measure specific knowledge of a domain such as Law or Biology or Mathematics. These have specific vocabulary, while we want to stay with the vocabulary of someone who can  read the news.
The corpus SICK
Well, we needed a simple, common-sense corpus that uses concrete and well known concepts such as cats and dogs and boys and girls and running, eating, playing and sleeping, that is a corpus with simple sentences about concrete, every day, common events or activities. With short and direct sentences and grammar. Since SICK was built to contain such concrete, common-sense sentences, it seemed ideal for our kind of testing.
The linguists behind the SICK corpus tried to avoid as much as possible the use of world knowledge and tried to remove any temporal and aspectual issues. They removed most of the named entities and possessive pronouns. (see below the list of linguistic phenomena that they've tried to remove from the corpus)
They also used a reduced vocabulary of some 2K words only. But the process of creating the pairs of sentences was semi-automatic and despite claims that all the sentences had been checked by a native speaker, some rather strange sentences do show up in the corpus. And the process used to semi-automatically construct sentences leaves a lot of be desired.
Crowdsourcing techniques are very useful for this kind of annotation work. However, the quality and consistency of these annotations can be poor. When looking at the corpus to investigate what humans considered entailment and contradiction, we realized that there were many troublesome annotations. At the time of our first paper Textual Inference: getting logic from humans (2017) we thought we could simply correct all these wrong annotations.
So what were the problems and the corrections? Why hasn't it worked yet?
Let me list some Issues with the task: 
1.[referents] There was the original problem with referents, already hinted in the original SICK paper A SICK cure for the evaluation of compositional distributional semantic models.
The SICK authors say:
"Not unreasonably, subjects found that, say, A woman is wearing an Egyptian headdress does not contradict A woman is wearing an Indian headdress, since one could easily imagine both sentences truthfully uttered to refer to a single scene where two different women are wearing different headdresses."
We've written our guidelines to help dispel this issue, as it seemed to us that annotators were not given enough information on what to consider as referents of expressions. So we came up with a system to try to assign referents in a way that maximizes the relationships between the two sentences of a pair.
But even if we force (via guidelines) annotators to think that we're talking about the same woman, couldn't an Indian headdress be worn over an Egyptian headdress or under it?
Also in the pair example A= A fearful little boy is on a climbing wall. B= A fearful little boy is on the ground, again there could be two fearful little boys, one climbing the wall and one on the ground. Since there is no reason to assume that the indefinite little boys were the same, the annotators assumed that A and B are neutral with respect to each other because the two sentences can be talking about different little boys.
So when do we want to assume that we're talking about the same entities? We needed specific instructions and thought detailed guidelines would solve the issue.
2.[asymmetric contradictions] Then the first big issue that we pointed was the problem with contradictions. Contradictions in logic are symmetric; if proposition A is contradictory to B, then B must be contradictory to A. This is not what happens with the annotations of SICK. Processing the corpus we could see that we have many asymmetrical pairs:
•8 pairs AeBBcA, meaning that A entails B, but B contradicts A;
•327 pairs AcBBnA, meaning that A contradicts B, but B is neutral with respect to A;
•276 pairs AnBBcA, meaning that A is neutral with respect to B, but B contradicts A. 
In total 611 pairs out of 9840 are annotated in a way that logically does not make sense. These may seem few (6% of the pairs) but since the sentences chosen ought to describe simple, common-sense situations and since these wrong annotations are not even self-consistent, this is a cause for concern.
 We didn't expect non-symmetric contradictions, but we found plenty of them. In our second paper Correcting Contradictions,
 we were still thinking of manually correcting the whole corpus. But we 
realized that without strict criteria of when contexts were close 
enough, we could not decide on contradictions. We follow Zeanen  et al and de Merneffe et al, who  argue that the events or entities of a sentence pair need to be coreferent, if the pair is to be deemed a contradiction.  Thus we established a guideline: try to assume that the pairs are talking about the same event and entities no matter what definite or indefinite determiners are involved (subject to some restrictions). To assume that the entities and the events are coreferent is not so easy, especially in cases where the two sentences are very distant from each other in meaning. But when the two sentences are closer in meaning it can also be tricky.
 3.[non-grammatical] We discovered, when investigating the problematic non-symmetric pairs, we have some  issues of ungrammatical sentences and of nonsense in sentences. With a sentence like The black and white dog isn’t running and there is no person standing behind depending on how you correct it to make it grammatical different inference relations may appear. With a sentence like  A motorcycle is riding standing up on the seat of the vehicle, the annotators had to use quite a lot of imagination to try to make sense of the sentence. 
But there are  perfectly formed sentences that  do not make much sense either (for instance 
A= A large green ball is missing a potato. B= A large green ball is hitting a potato.) One can always (with imagination and effort) construct a scenario where the sentences fit. If one reads only the first sentence, one might think of recipes and culinary (the ball of dough is missing a potato, strange, but possible?) but the second sentence seems to indicate some game, a bit like bowling with potatoes, instead of bowling balls? But the point is not to show great leaps of imagination, but to actually be as boring as possible and stay as close as possible to the usual, commonsense meanings.
 And thinking of commonsense YouTube changed some of the rules. I'd expect a sentence about "hamsters singing" or "monkey practising martial arts" to be nonsense, but apparently they are more common that one might think.
4. [lexical issues] We expected different annotators having different interpretations of words. We thought we should recommend a dictionary and we ourselves used Princeton WordNet. PWN is large but it doesn't have everything. In particular it doesn't have neologisms like "ATV" for all-terrain vehicle or "seadoo" for a kind of jet-powered aqua scooter or even "shirtless", which is not a new word. It also does not have newer sports like "snowboard, kickbox, wakeboard" or even "jet ski", the verb for riding a seadoo. Also it doesn't have verbs that are made of prefixes or suffixes like "unstitch, debone".
Other concepts of SICK cannot be found in PWN simply because of tokenization issues. Wordnet lists "fistfight" instead of "fist fight", and "ping-pong" instead of "ping pong", for instance.
Princeton WordNet is well-known for not having as many MWEs (multiword expressions) as one might need for processing English. Verbs like "break dance" (no physical break), "rock climb" and nouns like "slip and slide" (a water toy for kids) confuse parsers and humans alike.
Out  of the 435 unique compound nouns found in our processing of SICK, only 84 are included in PWN. (These numbers could certainly be improved now that there is a github version of PWN: one could import all the mwes from Wikipedia or from wiktionary, for example.)
There are also  pairs of sentences involving meronymy relations and precisely what specific nouns are made of. A representative pair is  A dog is running on the beach and chasing a ball,   while the second sentence is  A dog is running on the sand and chasing a ball. Now some beaches do not have much sand, but still people do associate beach with sand.
Our paper WordNet for “Easy” Textual Inferences has many other cases of lexical similarity that are not really logical. For example, consider the pair A= A dog is licking a toddler. B= A dog is licking a baby. Toddlers are not babies, the words are not synonyms, but they are similar enough that people will use them as if they were synonyms in certain contexts. They will not, if the context is the medication to give to a baby or toddler, but when describing photos, the terms can be seen as very similar. These similarity cases are interesting, as they prompt the question of how this kind of information should be encoded in lexical resources, if at all.
But another issue is the subsumption of lexical senses: it is clear to everyone that "a river is a very big stream and a stream a very small river", but which one is the generic one that goes in the ontology? The same question can be asked of pots and pans: are pots tall  pans with short handles or are pans medium-size kinds of pots with a single, long handle? Are bowls dishes (i.e containers) that are deep? Or are dishes bowls that are shallow? Are girls women that are young? Or are women old girls? Different annotators will have different views.
5. [logical issues] We expected the logical problems that most humans have with reasoning: people tend to confuse implications with bi-implications. Hence in our first paper we investigated the entailments of the form AeBBnA. We found the expected gender biases (bikers and climbers are men only, etc) and some of the expected confusions between biconditionals and conditionals.
Through all the work described here, we have followed principles described in WordNet for “Easy” Textual Inferences:
(i) We believe that the task of inference can and should be broken down to “easy” inferences like these ones and that therefore it is of great importance to have trustworthy, high-coverage resources that can solve big parts of them.
(ii) Lexical resources should always be expanded and then further supported by other state-of-the-art techniques such as word embeddings.
(iii) Evaluating lexical resources is a time-consuming task, mainly because we need to find appropriate test data which should on the one hand efficiently test the coverage of the resources themselves and on the other hand originate from real NLP scenarios that bring to light the whole complexity of language and thus the challenging cases.
 
 Aligning predicate-argument structures to decide on referents for contradictions is not an easy task. But we thought we had enough explanations and guidelines that we would be able to extract the kind of information we wanted from linguistics students.  Our work with Martha Palmer and Annebeth Buis on
Explaining Simple Natural Language Inference showed otherwise.
The task NLI (Natural Language Inference), also known as Recognizing Textual Entailment (RTE) (Dagan et al., 2006), is the task of defining the semantic relation between a premise text p and a conclusion text c. Premise p can a) entail, b) contradict or c) be neutral with respect to conclusion c. The premise p is taken to entail conclusion c when a human reading p would infer that c is *most probably* true (Dagan et al., 2006).
This notion of “human reading” assumes human common sense and common background knowledge, two conditions difficult to pin down. But this means that a successful automatic NLI system is a suitable evaluation measure for real natural language understanding, as discussed by Condoravdi et al. (2003) and others.
A conclusion from the paper above concerns the annotation procedure: having an inference label is not enough; knowing why a human subject decides that an inference is an entailment or a contradiction is useful information that we should also be collecting. Thus we tried an experiment where we (hope to have)
provided uncontroversial, clear guidelines and gave annotators the chance to justify their decisions.
Our goal was to evaluate the guidelines based on the resulting agreement rates and gain insights into the NLI annotation task by collecting the annotators’ comments on the annotations.
Apart from the inference relation and the justification, the annotators were also asked to give a score from 0-10 for what we would like to call “computational feasibility” (CF), i.e. their estimation of the likelihood of an automatic system getting the inference right.
This led to some conclusions. We discover some linguistic phenomena hard for humans to annotate and showed that these do not always coincide with what is assumed to be difficult for automatic systems. We need to start devising corpora based on the notion of human inference which includes some inherent variability, and we need to find appropriate methods to train our systems on such data and measure their performance on them.
We showed that it is essential to include a justification method in similar annotation tasks as a suitable way of checking the guidelines and improving the training and evaluation processes of automatic systems towards explainable AI.
Summarizing: Over these four years and five papers we investigated shortcomings of the task and of the datasets for Natural Language Inference, through the lenses of the corpus SICK. The idea of a simplified corpus, with text based on captions of everyday activities created by humans was definitely a powerful one. Its implementation, especially, the semi-automatic generation of new pairs via transformations of a kernel set of sentences, left some to be desired. It created some nonsensical sentences which made the already difficult task of measuring agreement between humans, an even harder task than it started as.
Clear guidelines are necessary to annotate such style of corpora, with examples. But as we argue these are not enough. Even a single annotator might be more strict one day, more relaxed another (what's the difference between cutting and slicing mushrooms after all?). Different annotators will have different levels of precision when using the language, depending not only of their vocabulary, but also on aspects of information they want to convey. Different linguistic phenomena might need special ways of measuring. Justifications for annotations seem a great thing to have, but one has to deal with the logistic problem of obtaining them. To quote Slav Petrov's blog post "our work is still cut-out for us!"







