Friday, December 24, 2021

White Christmas

People don't usually remember to consider mathematics as a protector shield against bad thoughts. But it is one of its less well-known good side-effects. The son wouldn't say so, as the problems he tackles are human problems, which I find much harder and more important. But  what he said yesterday could've come straight from one of my attempts to explain the kind of mathematics I do: we seek patterns and structure, because this is the  way humans have to survive the chaos that's life.

So I recommend keeping at least one or two problems 'on the go' at all times. This, to my mind can be effective, especially in  emotionally charged situations like the holidays. All this accounting of things accomplished or not, opportunities taken or missed, goals and deadlines swashing by, can lead to cyclic feelings of inadequacy. We don't need that, we have enough work to do. Our work is cut-out for us!
 

Sunday, December 12, 2021

Retrospective of 2021?

I have a bit of an issue with the Topos Institute blog: I don't like the very staid style,  stodgy and academic. I think blog posts are supposed to be fun, make jokes, have pictures. Even if they're only pictures of weeds in the corner of the garden. So I have not been writing much there. But we just had our end of the year post and this was more fun. The post is linked above and I will repeat my bit below, as I would like to add more pictures than the ones I could fit in the Topos post.

 Davide Trotta and Matteo Spadetto

2021 was a very difficult year for most of us: the pandemic which looked like it was receding in the Spring (at least it looked like that in California, when the vaccine became available) came back, showing that it hasn’t been tamed. Yet, we hope. But we mourn the thousands and thousands that were taken too soon. The economic crisis, brought about in part by the pandemic, is just starting to unfold. (Economics is always a few steps behind politics: it’s much harder to tell the numbers of deaths it causes.) The climatic crisis catastrophes, not as bad as they were in California in 2020, are still very much with us: we have droughts, wildfires, floods, you name it…

Against this backdrop of suffering, it feels almost inconsiderate to think that I am working exactly on what I wanted to for so many years. It feels good to tell people that the Topos mission is to “shape technology for public benefit by advancing sciences of connection and integration”. And it is great to have similarly minded people to do it with! Topos is still starting, but we believe in community, diversity, equity and inclusion; and we’re working for that. 

 

The challenges are enormous: they go from curbing the arrogance of mathematicians and tech people who think they know how to do others’ work better than themselves, to convincing biologists, social scientists, and humanities researchers that we can bring something to the table, in a respectful manner. But if the challenges are considerable, the payoff is incredible. We hope not only “To invent the future” (as in a previous place), but to invent a just, sustainable and equitable future for us all. Thank you for all the joint work, friends!!

Both Davide and Matteo (far above) and Elena and Wilmer  (just above) are presenting our joint work at SYCO  (Symposium on Compositional Structures) tomorrow and the next day. Good luck friends!

Now this is just one slice of the work. Below to the left, we have our little annotation on NLI group: Katerina, Martha, Annebeth and Livy. To the right Elaine talking at the meeting about 'Women in Logic in Brazil 2019'. 

Katerina Kalouli, Martha Palmer, Annebeth Buis, and Livy Real
 

 Now there are several other people that I should be adding pictures of in here. Amongst others Luiz Carlos Pereira, Samuel Gomes da Silva, Jacob Collard and Eswaran Subrahmanian. But I am very tired now, so just one more group picture.

Happy Holidays!



Wednesday, November 10, 2021

Dagstuhl 2021? I wish

 


Looking at my Google Scholar Profile last week, I discovered that I had a new publication in 2021 together with Josef van Genabith, Eike Ritter and Dick Crouch. This was a big surprise, as I haven't even talked to Josef or Eike in a long while! Our Dagstuhl seminar, Linear Logic and Applications, happened in 1999. But yes, Dagstuhl must have decided to put all the older reports online, after a long while, which is nice.

I am very proud of this Dagstuhl meeting and wish I could organize another, as Dagstuhl is a great place to do science of high quality. Looking at the abstracts 22 years later, it is very interesting to see the trends that turned out to be very productive. In particular, David Pym and Peter O'Hearn's "Bunched Logic" made its' first appearance in the seminar and it did go places!

Tuesday, November 2, 2021

History of Ideas?

 (This was a FB post from 31st October 2013. I still want to solve this problem, I still like Banksy. But now I want to leave FB as soon as I can. So it makes sense to copy previous old posts in this blog.)

I have a problem that would be nice to see someone helping to solve... 
Look at this paper: Studying the History of Ideas Using Topic Models 
The guys study the development of ideas in a scientific field (computational linguistics) over time (25 years) using the abstracts of the papers published in the field. Now, wouldn't it be great to see the same  for category theory? 
 
I'm particularly interested because I just wrote a paper with Andrei Rodin, `Elements of Categorical Logic: Fifty Years Later' celebrating the 50th anniversary of Bill Lawvere's PhD thesis.
I think a study of topics in category theory would show `n-category theory' rising sharply in importance, recently.
 
There is, of course, the first issue of getting the papers. Apparently some 15 hundred abstracts is enough though...  From TAC we have some 800 or thereabouts. Maybe some of our friends will get interested? 
It would be nice.
(31st October 2013)

Sunday, October 31, 2021

Dialectica Logical Principles

 


 `In the past fifty years, the dialectica interpretation proposed by Gödel has become one of the most fundamental conceptual tools in logic and the foundations of mathematics' says Thomas Strahm in his introduction to the special volume of Dialectica dedicated to 50 years of the original paper. Shame that despite being more than ten years old (2008) the articles are not open access, so you can read about them in Richard Zach's blog post, but you'll have to find the articles themselves in the authors' pages or somewhere else.

`Gödel’s functional interpretation can be seen as a possible realization of the so-called modified Hilbert program in the sense that it enables a reduction of a classical system to a quantifier-free theory of functionals of finite type, thereby reducing the consistency problem for the classical theory to the consistency of a quantifier-free system of higher-type recursion, the latter being informally more finitistic and sufficiently well understood. Gödel’s interpretation of arithmetic has been substantially extended in the past fifty years both to stronger and weaker theories. In particular, versions of the dialectica interpretation have been proposed for impredicative classical analysis, subsystems of classical arithmetic, theories of ordinals, predicative subsystems of second order arithmetic, analysis with game quantifiers, systems of feasible arithmetic and analysis, admissible and constructive set theories, as well as iterated arithmetical fixed point theories. A comprehensive survey of many of these results can be found in Avigad and Feferman (1995) as well as Troelstra (1973). In more recent years, work on functional interpretations has shifted from purely foundational purposes to applications to concrete proofs in mathematics in the sense of Kreisel’s unwinding program. In connection with this, Kohlenbach’s proof mining program provides very impressive results making use of variants of Gödel’s dialectica interpretation.'

So far, so good. The mathematicians interested in foundations ought to be happy. I am. Especially because Davide Trotta (above) spoke  last Thursday in the Ottawa Category Theory Seminar about our second paper together with Matteo Spadetto. This is "Dialectica Logical Principles" which explains how our categorical modelling of Gödel’s dialectica interpretation also models the principles of Independence of Premise (IP) and Markov's Principle (MP), just like the dialectica does. This is not surprising, but it underscores how faithful the modelling is, a big win for categorical logic. Just like our previous paper,  The Gödel Fibration, which is nice because it puts some more explicit steps in the connections between the categorical modelling and the logical system modelled: see the case of quantifier-free objects discussed here.

But 'mathematicians interested in foundations' is a very small demographic. What can we say to others? What about the mythical "person-on-the street"? Have been thinking a lot about it. Recently google Scholar sent me news about Modal Functional (“Dialectica”) Interpretation (by HERNEST AND TRIFONOV. Maybe this will help.





Tuesday, October 19, 2021

Conversations with Sophie

One of the pleasures of going to Topos on Mondays is the conversations with Sophie in the car journey. Some have been quite revealing to me, as I do believe I think better when talking, then when simply thinking. When talking I need to finish thoughts, complete sentences, all these boring trifles that actually help you make sense. I know, it sounds strange, but rings true to me.

So I thought I'd record some of our conversations--or perhaps a cleaned up version of them. This starting one could be called "Lies Math Teachers Tell us", as it's a list of things that are wrong, but many people in maths departments (or coming out of them) believe to be true. The ones I'm about to discuss are mostly about Natural Language, which contrary to widespread belief (cf.  Emily Bender's Rule) is *not* a synonym for English.

The list goes as follows, for the time being:

1. Natural language is not TOO HARD to be modelled mathematically. We have plenty of models, some better than others. Natural language semantics is all about it and many researchers spent decades producing a body of literature that shows the difficulties, but also the progress that has been made.

Of course there's no point in all mathematicians trying to do it. As any other application of mathematics, some people like it, some don't. But saying that it cannot be done, is plainly wrong.

2. Just because I say it's possible to do it, it doesn't mean that I think it's easy.

Ambiguity is not a bug of Natural Language, it's a feature. A feature that evolution has been working on for thousands of years. This is one of the features that makes the subject difficult to model, but it's also why it's so fascinating: we can do wonderful things with words. And all so easily that we take it for granted. Like breathing, we only noticed we're doing it, when somehow the mechanism has problems.

 Finesse and sophistication are necessary. The exercises we give in the logic books about formalization of sentences, won't cut it in real life. We don't communicate in sentences used to teach 5-year-old kids  how to read (that is, unless this is what we're doing). We understand these sentences too, mostly, but the effort to create  a formal model of the language needs to be in the direction of all kinds of sentences: from the abstruse Law contract constructions to slang on Twitter. 

This suggests two different implications.

3.  We need to deal with intensional phenomena. It's no good saying, I can ditch, say,  attitude predicates, and only add them as a bonus feature later on. As we argued, as a collective (rdc+)  in `Entailment, Intensionality and Text Understanding',  intensionality, which is widespread in natural language, raises a number of meaning detection issues that cannot be brushed aside. I will not repeat the arguments here, as I think they are clearly expounded in the paper.


4. Coverage of different types of text is essential. Anyone can build a model that deals with only their ten favorite sentences. That is not what we're talking about, when we talk about a model. Models need, in principle, to be able to deal with any sentence we throw at them.

Now, a controversial one.

5. Models need to be compositional. This will not bother my kind of mathematicians, as category theorists do believe that compositionality is key to all modelling we do, but it will be controversial to some. So I will postpone this side of the conversation for a little while.

 

Wednesday, October 13, 2021

Ada Lovelace Day 2021


 So this is the 10th anniversary of me doing Ada Lovelace Day posting, as I started in 2011 with Christine Ladd-Franklin.  From Algebraic Logic to Optics and Psychology this honoree is very fashionable nowadays.

Ok, quite a few times I was late with the goods, but this is OK, I reckon. I also varied quite a lot my own criteria for choosing honorees: I started with a deceased female mathematician thinking that there would be more consensus on them, but then moved on to living people, as consensus is overrated (systemic sexism, anyone?) and other criteria maybe should play a role. So interdisciplinarity has been a favorite criterion. My list so far consists of

Christine Ladd-Franklin(2011), Karen Sparck-Jones(2012), Marta Bunge(2013), Helena Rasiowa(2014)Nyedja Nascimento(2015), Manuela Sobral(2016), Maryam Mirzakhani(2016), Christine Paulin-Mohring(2017), Angela Olinto(2018), Andrea Loparic(2020).

  This list is very personal and idiosyncratic, of course. It has three Brazilians, five friends, two category theorists. It's not representativie of anything, but my own "on the spur of the moment" decisions. But it reflects more than ten years thinking about the issues that make Ada Lovelace Day necessary. As necessary now, as it was ten years ago, despite big social movements like #MeToo and #BlackLivesMatter. 

One thing I've written a lot about, but that the list does not reflect, is the danger of losing the memory of the fights. Forgetting the plight of the suffragists, forgetting all the sisters that could not get degrees, that worked for free to show women could do the work, the ones that could not stay in academia because the pressure to leave was too huge.


 
Now recently, reading the lovely interview of Hélène Langevin Joliot reminded me that this fight is not so old.

She's the real granddaughter of Marie Curie and in this interview she says "It's a myth that the Curies sacrificed their lives for Science".

Hélène is 94 years old, but she still debating, giving interviews, a living link to a past where things were indeed much worse. Like me, like many others much younger, she thought that discussing gender equality was not necessary, had been done before. Like us, she discovered that this fight needs to happen every day, all the days. That things will only be good, when everyone is treated fairly, when inclusivity prevails. So

Hélène is my Ada Lovelace Day heroine for 2021!

oh, if you don't read Spanish you might need to use translation software as Helene Joliot's interview above is only in Spanish, so far. Thanks to Women with Science for making it available.

(I promise to put a version in English here pretty soon!)

Thursday, September 23, 2021

Pots and Pans? Dishes and bowls? NLI!

 


You may think that after some six papers (all with Katerina and Livy) discussing it,  I would never want to see the SICK (Sentences  Involving Compositional Knowledge) corpus of Marelli et al ever again. 

But you would be wrong, as the problem of deciding on correct and reliable annotations for a corpus of Natural Language Inferences (NLI) that seems at first sight so approachable and sensible is overwhelmingly compelling. We need to be able to solve this problem!

A previous blog post is SICK (dataset) in these trying times.

Our published papers are, in reverse chronological order:

1. Kalouli, A.-L., Buis, A., Real, L., Palmer, M., de Paiva, V. Explaining Simple Natural Language Inference. Proceedings of the 13th Linguistic Annotation Workshop (LAW 2019), 01 August 2019. ACL 2019, Florence, Italy. [PDF]

2. Katerina Kalouli, Dick Crouch, Valeria de Paiva, Livy Real. Graph Knowledge Representations for SICK. informal Proceedings of the 5th Workshop on Natural Language and Computer Science, Oxford, UK, 08 July 2018.  

3. Aikaterini-Lida Kalouli, Livy Real, Valeria de Paiva. WordNet for “Easy” Textual Inferences. Proceedings of the Globalex Workshop, associated with LREC 2018, 08 May 2018. Miyazaki, Japan. [PDF]

4. Aikaterini-Lida Kalouli, Livy Real, Valeria de Paiva. Correcting Contradictions. Proceedings of the Computing Natural Language Inference (CONLI) Workshop,  September 2017. [PDF]

5. Aikaterini-Lida Kalouli, Livy Real, Valeria de Paiva. Textual Inference: getting logic from humans. Proceedings of the 12th International Conference on Computational Semantics (IWCS),  September 2017.  [PDF]

(There's still one paper submitted and one being written.)

So I have been thinking about this problem for some four years! And the task that seemed so easy at the beginning is still not done. Why is that? Why is this task so difficult? What would I do differently, if I was starting this project now?

The task at hand is to determine if a sentence follows from, contradicts or is neutral with respect to another given sentence. Thus, using an example in the original Marelli et al paper the hypothesis  A sea turtle is hunting for fish entails the thesis sentence A sea turtle is hunting for food. The same sentence contradicts A sea turtle is not hunting for fish and this sentence is neutral with respect to A fish is hunting for a turtle in the sea.  

Seems simple, no?

In 2017 we wrote: "The long-term goal [of our work] is to be able to provide conceptual semantics for sentences so that entailment-contradiction-neutrality relations between sentences can be identified."  Now, in 2021, this project has been completed to a large extent. Now we do have  a system that produces conceptual semantics for sentences, the Graphical Knowledge Representation (GKR) system. We also have a system that decides inference relations between pairs of sentences, the hybrid natural language inference Hy-NLI system. Both systems are now fully operational, are open source and can be played with in the demos at https://cis.lmu.de/~kalouli/resources.html. 

But the cleaning up and checking of our golden corpus have not been finished. Why?

There is a collection of reasons, from different sources and these are discussed in the papers. But here we will try to analyze these reasons a little.

First of all, the systems we produce, all have some noise. The corpus we use is not as well constructed as we wanted it, the task is not as easily understood as we had hoped, the guidelines we provided do not cover all difficult cases, the annotators are not as clever as we expect them to be, the systems that are used to tally results have some issues too. 

The task of detecting entailment and contradiction requires  minimization of the amount of common sense and world knowledge required, when constructing your corpus: we're mostly trying to measure "linguistic entailment". (To measure encyclopedic knowledge of the world and/or commonsense knowledge of the population we  would use different tools -- possibly factoid question answering and/or Winograd pronouns). We do not want to measure specific knowledge of a domain such as Law or Biology or Mathematics. These have specific vocabulary, while we want to stay with the vocabulary of someone who can  read the news.

The corpus SICK

Well,  we needed a simple, common-sense corpus that uses concrete and well known concepts such as cats and dogs and boys and girls and running, eating, playing  and sleeping, that is a corpus with simple sentences about concrete, every day, common events or activities. With short and direct sentences and grammar. Since SICK was built to contain such concrete, common-sense sentences, it seemed ideal for our kind of testing. 

The linguists behind the SICK corpus tried to avoid as much as possible the use of world knowledge and  tried to remove any temporal and aspectual issues. They removed most of the  named entities and possessive pronouns. (see below the list of linguistic phenomena that they've tried to remove from the corpus)

They also used a reduced vocabulary of some 2K words only. But the process of creating the pairs of sentences was semi-automatic and despite claims that all the sentences had been checked by a native speaker, some rather strange sentences do show up in the corpus. And the process used to semi-automatically construct sentences leaves a lot of be desired. 


Crowdsourcing techniques are very useful for this kind of annotation work. However, the quality and consistency of these annotations can be poor. When looking at the corpus to investigate what humans considered entailment and contradiction, we realized that there were many troublesome annotations. At the time of our first paper Textual Inference: getting logic from humans (2017) we thought we could simply correct all these wrong annotations. 

So what were the problems and the corrections? Why hasn't it worked yet?

Let me list some Issues with the task:

1.[referents] There was the original problem with referents, already hinted in the original SICK paper A SICK cure for the evaluation of compositional distributional semantic models

The SICK authors say:
"Not unreasonably, subjects found that, say, A woman is wearing an Egyptian headdress does not contradict A woman is wearing an Indian headdress, since one could easily imagine both sentences truthfully uttered to refer to a single scene where two different women are wearing different headdresses."

We've written our guidelines to help dispel this issue, as it seemed to us that annotators were not given enough information on what to consider as referents of expressions. So we came up with a system to try to assign referents in a way that maximizes the relationships between the two sentences of a pair.

But even if we force (via guidelines) annotators to think that we're talking about the same woman, couldn't an Indian headdress be worn over an Egyptian headdress or under it?  

Also  in the pair example  A= A fearful little boy is on a climbing wall. B= A fearful little boy is on the ground, again there could be two fearful little boys, one climbing the wall and  one on the ground. Since there is no reason to assume that the indefinite little boys were the same, the annotators assumed that A and B are neutral with respect to each other because the two sentences can be talking about different little boys

So when do we want to assume that we're talking about the same entities? We needed specific instructions and thought detailed guidelines would solve the issue.

2.[asymmetric contradictions] Then the first big issue that we pointed was the problem with contradictions. Contradictions in logic are symmetric; if proposition A is contradictory to B, then B must be contradictory to A. This is not what happens with the annotations of SICK. Processing  the corpus we could see that we have many asymmetrical pairs:

8 pairs AeBBcA, meaning that A entails B, but B contradicts A;
327 pairs AcBBnA, meaning that A contradicts B, but B is neutral with respect to A;
276 pairs AnBBcA, meaning that A is neutral with respect to B, but B contradicts A.

In total 611 pairs out of 9840 are annotated in a way that logically does not make sense. These may seem few (6% of the pairs) but since the sentences chosen ought to describe simple, common-sense situations and since these wrong annotations are not even self-consistent, this is a cause for concern. 

We didn't expect non-symmetric contradictions, but we found plenty of them. In our second paper Correcting Contradictions, we were still thinking of manually correcting the whole corpus. But we realized that without strict criteria of when contexts were close enough, we could not decide on contradictions. We follow Zeanen  et al and de Merneffe et al, who  argue that the events or entities of a sentence pair need to be coreferent, if the pair is to be deemed a contradiction.  Thus we established a guideline: try to assume that the pairs are talking about the same event and entities no matter what definite or indefinite determiners are involved (subject to some restrictions). To assume that the entities and the events are coreferent is not so easy, especially in cases where the two sentences are very distant from each other in meaning. But when the two sentences are closer in meaning it can also be tricky.

 3.[non-grammatical] We discovered, when investigating the problematic non-symmetric pairs, we have some  issues of ungrammatical sentences and of nonsense in sentences. With a sentence like The black and white dog isn’t running and there is no person standing behind depending on how you correct it to make it grammatical different inference relations may appear. With a sentence like  A motorcycle is riding standing up on the seat of the vehicle, the annotators had to use quite a lot of imagination to try to make sense of the sentence.

But there are  perfectly formed sentences that  do not make much sense either (for instance
A= A large green ball is missing a potato. B= A large green ball is hitting a potato.) One can always (with imagination and effort) construct a scenario where the sentences fit. If one reads only the first sentence, one might think of recipes and culinary (the ball of dough is missing a potato, strange, but possible?) but the second sentence seems to indicate some game, a bit like bowling with potatoes, instead of bowling balls? But the point is not to show great leaps of imagination, but to actually be as boring as possible and stay as close as possible to the usual, commonsense meanings.

 And thinking of commonsense YouTube changed some of the rules. I'd expect a sentence about "hamsters singing" or "monkey practising martial arts" to be nonsense, but apparently they are more common that one might think.

4. [lexical issues] We expected different annotators having different interpretations of words. We thought we should recommend a dictionary and we ourselves used Princeton WordNet. PWN is large but it doesn't have everything. In particular it doesn't have neologisms like "ATV" for all-terrain vehicle or "seadoo" for a kind of jet-powered aqua scooter or even "shirtless", which is not a new word. It also does not have newer sports like "snowboard, kickbox, wakeboard" or even "jet ski", the verb for riding a seadoo. Also it doesn't have verbs that are made of prefixes or suffixes like "unstitch, debone".

 Other concepts of SICK cannot be found in PWN simply because of tokenization issues. Wordnet lists "fistfight" instead of "fist fight", and "ping-pong" instead of "ping pong", for instance.  

 Princeton WordNet is well-known for not having as many MWEs (multiword expressions) as one might need for processing English. Verbs like "break dance" (no physical break), "rock climb" and nouns like "slip and slide" (a water toy for kids) confuse parsers and humans alike. 

Out  of the 435 unique compound nouns found in our processing of SICK, only 84 are included in PWN. (These numbers could certainly be improved now that there is a github version of PWN: one could import all the mwes from Wikipedia or from wiktionary, for example.)

There are also  pairs of sentences involving meronymy relations and precisely what specific nouns are made of. A representative pair is  A dog is running on the beach and chasing a ball,   while the second sentence is  A dog is running on the sand and chasing a ball. Now some beaches do not have much sand, but still people do associate beach with sand.

Our paper WordNet for “Easy” Textual Inferences has many other cases of lexical similarity that are not really logical. For example, consider the pair A= A dog is licking a toddler. B= A dog is licking a baby. Toddlers are not babies, the words are not synonyms, but they are similar enough that people will use them as if they were synonyms in certain contexts. They will not, if the context is the medication to give to a baby or toddler, but when describing photos, the terms can be seen as very similar. These similarity cases are interesting, as they prompt the question of how this kind of information should be encoded in lexical resources, if at all.

But another issue is the subsumption of lexical senses: it is clear to everyone that "a river is a very big stream and a stream a very small river", but which one is the generic one that goes in the ontology? The same question can be asked of pots and pans: are pots tall  pans with short handles or are pans medium-size kinds of pots with a single, long handle? Are bowls dishes (i.e containers) that are deep? Or are dishes bowls that are shallow? Are girls women that are young? Or are women old girls? Different annotators will have different views.

5. [logical issues] We expected the logical problems that most humans have with reasoning: people tend to confuse implications with bi-implications. Hence in our first paper we investigated the entailments of the form AeBBnA. We found the expected gender biases (bikers and climbers are men only, etc) and some of the expected confusions between biconditionals and conditionals.

Through all the work described here, we have followed principles described in WordNet for “Easy” Textual Inferences:

(i) We believe that the task of inference can and should be broken down to “easy” inferences like these ones and that therefore it is of great importance to have trustworthy, high-coverage resources that can solve big parts of them.

(ii) Lexical resources  should always be expanded and then further supported by other state-of-the-art techniques such as word embeddings.

(iii) Evaluating lexical resources is a time-consuming task, mainly because we need to find appropriate test data which should on the one hand efficiently test the coverage of the resources themselves and on the other hand originate from real NLP scenarios that bring to light the whole complexity of language and thus the challenging cases. 


 

 Aligning predicate-argument structures to decide on referents for contradictions is not an easy task. But we thought we had enough explanations and guidelines that we would be able to extract the kind of information we wanted from linguistics students.  Our work with Martha Palmer and Annebeth Buis on

Explaining Simple Natural Language Inference showed otherwise.

 

The task NLI (Natural Language Inference), also known as Recognizing Textual Entailment (RTE) (Dagan et al., 2006), is the task of defining the semantic relation between a premise text p and a conclusion text c. Premise p can a) entail, b) contradict or c) be neutral with respect to conclusion c. The premise p is taken to entail conclusion c when a human reading p would infer that c is *most probably* true (Dagan et al., 2006). 

This notion of “human reading” assumes human common sense and common background knowledge, two conditions difficult to pin down. But this means that a successful automatic NLI system is a suitable evaluation measure for real natural language understanding, as discussed by Condoravdi et al. (2003) and others.

A conclusion from the paper above concerns the annotation procedure: having an inference label is not enough; knowing why a human subject decides that an inference is an entailment or a contradiction is useful information that we should also be collecting. Thus we tried an experiment where we (hope to have)

provided uncontroversial, clear guidelines and gave  annotators the chance to justify their decisions.

Our goal was to evaluate the guidelines based on the resulting agreement rates and gain insights into the NLI annotation task by collecting the annotators’ comments on the annotations. 

 

Apart from the inference relation and the justification, the annotators were also asked to give a score from 0-10 for what we would like to call “computational feasibility” (CF), i.e. their estimation of the likelihood of an automatic system getting the inference right. 

 

This led to some conclusions. We discover some linguistic phenomena hard for humans to annotate and showed that these do not always coincide with what is assumed to be difficult for automatic systems. We need to start devising corpora based on the notion of human inference which includes some inherent variability, and we need to find appropriate methods to train our systems on such data and measure their performance on them.

We showed that it is essential to include a justification method in similar annotation tasks as a suitable way of checking the guidelines and improving the training and evaluation processes of automatic systems towards explainable AI

 

 Summarizing: Over these four years and five papers we investigated shortcomings of the task and of the datasets for Natural Language  Inference, through the lenses of the corpus SICK. The idea of a simplified corpus, with text based on captions of everyday activities created by humans was definitely a powerful one. Its implementation, especially, the semi-automatic generation of new pairs via transformations of a kernel set of sentences, left some to be desired. It created some nonsensical sentences which made the already difficult task of measuring agreement between humans, an even harder task than it started as. 

Clear guidelines are necessary to annotate such style of corpora, with examples. But as we argue these are not enough. Even a single annotator might be more strict one day, more relaxed another (what's the difference between cutting and slicing mushrooms after all?). Different annotators will have different levels of precision when using the language, depending not only of their vocabulary, but also on  aspects of information they want to convey. Different linguistic phenomena might need special ways of measuring. Justifications for annotations seem a great thing to have, but one has to deal with the  logistic problem of obtaining them. To quote Slav Petrov's blog post  "our work is still cut-out for us!"