Sunday, August 9, 2020

Understanding Portuguese

(Illustration by Jana Walczyk)

It might be possible to find students and programmers to develop an old dream of mine, I am told. This dream is a project about producing logic from texts in  Portuguese. I have been giving talks about this project since 2010 (paper from 2011). Thus I want to explain (to possible volunteers) what does this project entail, what is the work that we should be doing, and why.

Explaining why we should be doing this work is very easy. 

The amount of information published in scientific articles,  preprints, news, blog posts, fiction, as well as unstructured data has increased many-fold in the last few years. A major bottleneck in the discovery of relevant information for business and researchers alike arises when connecting new results with the previously established state-of-the-art. A potential solution to this problem is to transform the unstructured raw-text of the novel information onto structured database entries, which would allow us to reason with this new information in the same way that one already organizes and reasons with the previous content, using Knowledge Graphs. Thus this would allow programmatic querying of the content, checking it for contradictions, checking for new changes, as well as all manners of analytics of this content. The fact that one can do most of this processing in English, but not in Portuguese (or for that matter not in many other languages) should be a reason for concern.  Brazilian science, as well as its industry, cannot progress as well as others, if our native language is not processed as well as others.

Semantic Parsing Portuguese

Now explaining exactly what the work on a semantic parser for Portugues amounts to, is somewhat harder. The project of transforming unstructured text into knowledge is very hard, language is way too ambiguous and difficult to deal with. While many open-source tools and resources for processing English texts exist, very few can be used for Portuguese. So we describe in parallel what we do have for English and what we need to build for Portuguese.

The project of extracting semantic information from English sentences is very hard. ur best shot can be seen at the moment in the preliminary demo. This prototype, developed by Katerina Kalouli and Dick Crouch, goes over ideas developed when I worked with Crouch at Xerox PARC, but re-implements these ideas from scratch, using new technologies for all software that is proprietary technology of either Xerox PARC or Microsoft. (There is a paper explaining the system and a version showing how this can be hybridized with machine learning systems.)  

This new semantic parser project has a pipeline that depends on several other open-source projects: we discuss these several "steps" below. 

Steps for Semantic Parser in Portuguese

Semantic parsers for English abound, but we are following a specific line of work that starts with Daniel Bobrow and  Ronald Kaplan at PARC.

1. Grammatical parsing is improving every year. A recent development is the new Stanford system called "Stanza".  Stanza is multilingual, includes Portugues, it is written in Python and has a better (less restrictive) license than the previous CoreNLP Stanford systems. We need to fine-tune it for our experiments.

2. The semantic parser we have in English depends on the grammatical parsing of sentences using the Stanford-Google based project "Universal Dependencies". Actually, it uses "enriched dependencies", we need to check how they behave for Portuguese.

The Universal Dependencies project has been going on since 2016.  This has already a branch in Portuguese, with which I am associated through my work with Alexandre Rademaker and Livy Real, but the corpus we have in Portuguese is small and there are still many issues with the Portuguese Universal Dependencies. These need expanding and possibly some annotation effort to increase the size of the corpus.

3. The semantic parser also depends essentially on Princeton WordNet.  Building up the Portuguese version of the WordNet thesaurus and dictionary has been a much harder task than we had anticipated, but our system (for browsing and downloading) has been in operation since 2012, here's the original description. It is still being constructed or is ``in progress", but it is getting close to the end of its first (translation only) phase. 

4. The semantic parser also depends on some version of tool for disambiguation and we have been using JIGSAW (available from GitHub), but this has not been updated since 2012. And this will not work for Portuguese. We need a tool for the disambiguation of Portuguese that can be plugged into this pipeline.

5. The system also depends on a generic upper ontology, for which we are using SUMO  in English. But an upper ontology is not enough to provide the world knowledge necessary for our applications. The project of expanding SUMO into an appropriate ontology for Brazilian culture, a Knowledge Graph for Brazil and its different facets (be they history, culture or geology or tourism, etc) is another major undertaking.

6. Finally, we need a reasoner on top of the representations that the semantic parser produces. This could be an off-the-shelf system like Lean or Isabelle, or it could be an NLI (Natural Language Inference) like the ones produced via neural nets and/or hybrid methods described in this SEMEVAL meeting special issue proceedings.

I need to emphasize that these steps can be done in any scientific or commercial field that one is interested in. We could do it for History, Chemistry, or Mathematics, for example. We could do it to help integrate IoT (Internet of Things) appliances or to help design customer service automated systems. Of course, an application to dialogue will require a further module, a dialogue manager, which orchestrates the possible conversations and actions of the automated system. The different domains should correspond to different Knowledge Graphs.

However, each one of these steps is a considerable amount of work, possibly worth a master thesis, or maybe even a PhD. Putting them all together should also be a major engineering feat. I hope we will find people willing to take up this challenge.

No comments:

Post a Comment