Monday, August 31, 2020

What did you do in your Summer vacation?



This is kind of silly. It would make more sense to have this list hanging of the homepage of OWN-PT itself, and maybe it will get there soon, but for the time being I need a list of our publications about OpenWordnet-PT and so this is it. Mostly copied from Alexandre's list of publications (thanks for being so organized Alexandre!!!).

  1. de Paiva, Valeria, and Alexandre Rademaker. 2012. “Revisiting a Brazilian WordNet.” In Proceedings of Global Wordnet Conference. Matsue: Global Wordnet Association.
  2. de Paiva, Valeria, Alexandre Rademaker, and Gerard de Melo. 2012. “OpenWordNet-PT: An Open Brazilian Wordnet for Reasoning.” In Proceedings of COLING 2012: Demonstration Papers, 353–60. Mumbai, India: The COLING 2012 Organizing Committee. http://www.aclweb.org/anthology/C12-3044.
  3. Rademaker, Alexandre, Valeria de Paiva, Gerard de Melo, Livy Real, and Maira Gatti. 2014. “OpenWordNet-PT: A Project Report.” In Proceedings of the 7th Global WordNet Conference, edited by Heili Orav, Christiane Fellbaum, and Piek Vossen. Tartu, Estonia. http://globalwordnet.org/global-wordnet-conferences-2/
  4. Real, Livy, Alexandre Rademaker, Valeria de Paiva, and Gerard de Melo. 2014. “Embedding NomLex-BR Nominalizations into OpenWordnet-PT.” In Proceedings of the 7th Global WordNet Conference, edited by Heili Orav, Christiane Fellbaum, and Piek Vossen, 378–82. Tartu, Estonia. http://globalwordnet.org/global-wordnet-conferences-2/
  5. de Paiva, Valeria, Livy Real, Alexandre Rademaker, and Gerard de Melo. 26AD. “NomLex-PT: A Lexicon of Portuguese Nominalizations.” In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), edited by Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis. Reykjavik, Iceland: European Language Resources Association (ELRA)
  6. Freitas, Cláudia, Valeria de Paiva, Alexandre Rademaker, Gerard de Melo, Livy Real, and Anne de Araujo Correia da Silva. 2014. “Extending a Lexicon of Portuguese Nominalizations with Data from Corpora.” In Computational Processing of the Portuguese Language, 11th International Conference, PROPOR 2014, edited by Jorge Baptista, Nuno Mamede, Sara Candeias, Ivandré Paraboni, Thiago A. S. Pardo, and Maria das Graças Volpe Nunes. São Carlos, Brazil: Springer.
  7. de Paiva, Valeria, Cláudia Freitas, Livy Real, and Alexandre Rademaker. 2014. “Improving the Verb Lexicon of OpenWordnet-PT.” In Proceedings of Workshop on Tools and Resources for Automatically Processing Portuguese and Spanish (ToRPorEsp), edited by Laura Alonso Alemany, Muntsa Padró, Alexandre Rademaker, and Aline Villavicencio. São Carlos, Brazil: Biblioteca Digital Brasileira de Computação, UFMG, Brazil. http://www.lbd.dcc.ufmg.br/bdbcomp/servlet/Evento?id=755.
  8. de Paiva, Valeria De, Dário Oliveira, Suemi Higuchi, Alexandre Rademaker, and Gerard De Melo. 2014. “Exploratory Information Extraction from a Historical Dictionary.” In IEEE 10th International Conference on e-Science (e-Science), 2:11–18. IEEE. https://doi.org/http://dx.doi.org/10.1109/eScience.2014.50.
  9. Oliveira, Hugo Gonçalo, Valeria de Paiva, Cláudia Freitas, Alexandre Rademaker, Livy Real, and Alberto Simões. 2015. “As Wordnets Do Português.” Oslo Studies in Language 7 (1): 397–424
  10. Rademaker, Alexandre, Dário Augusto Borges Oliveira, Valeria de Paiva, Suemi Higuchi, Asla Medeiros e Sá, and Moacyr Alvim. 2015. “A Linked Open Data Architecture for the Historical Archives of the Getulio Vargas Foundation.” International Journal on Digital Libraries 15 (2-4): 153–67. https://doi.org/10.1007/s00799-015-0147-1
  11. Real, Livy, Fabricio Chalub, Valeria de Paiva, Claudia Freitas, and Alexandre Rademaker. 2015. “Seeing Is Correcting: Curating Lexical Resources Using Social Interfaces.” In Proceedings of 53rd Annual Meeting of The Association for Computational Linguistics and The 7th International Joint Conference on Natural Language Processing of Asian Federation of Natural Language Processing - Fourth Workshop on Linked Data in Linguistics: Resources and Applications (LDL 2015). Beijing, China.
  12. Paiva, Valeria de, Livy Real, Hugo Gonçalo Oliveira, Alexandre Rademaker, Cláudia Freitas, and Alberto Simões. 2016. “An Overview of Portuguese WordNets.” In Global Wordnet Conference 2016. Bucharest, Romenia.
  13. Real, Livy, Valeria de Paiva, Fabricio Chalub, and Alexandre Rademaker. 2016. “Gentle with Gentilics.” In Joint Second Workshop on Language and Ontologies (LangOnto2) and Terminology and Knowledge Structures (TermiKS) (Co-Located with LREC 2016). Slovenia.
  14. Chalub, Fabricio, Livy Real, Alexandre Rademaker, and Valeria de Paiva. 2016. “Semantic Links for Portuguese.” In 10th Edition of Its Language Resources and Evaluation Conference (LREC). Portoroz, Slovenia.
  15. de Paiva, Valeria, Fabricio Chalub, Livy Real, and Alexandre Rademaker. 2016. “Making Virtue of Necessity: a Verb Lexicon.” In PROPOR – International Conference on the Computational Processing of Portuguese. Tomar, Portugal.
  16. Rademaker, Alexandre, Valeria de Paiva, Fabricio Chalub, Livy Real, and Claudia Freitas. 2016. “Introducing OpenWordnet-PT: an Open Portuguese Wordnet for Reasoning.” In International FrameNet Workshop Part of 9th International Conference on Construction Grammar (ICCG9), edited by Tiago Timponi Torrent. Juiz de Fora, Brazil: Universidade Federal de Juiz de Fora - UFJF.
  17. Rademaker, Alexandre, Fabricio Chalub, Livy Real, Cláudia Freitas, Eckhard Bick, and Valeria de Paiva Universal Dependencies for Portuguese. 2017. “Universal Dependencies for Portuguese.” In Proceedings of the Fourth International Conference on Dependency Linguistics (Depling), 197–206. Pisa, Italy.
  18. Muniz, Henrique, Fabricio Chalub, Alexandre Rademaker, and Valeria de Paiva. 2018. “Extending Wordnet to Geological Times.” In Global Wordnet Conference 2018. Singapore.
  19. Real, Livy, Alexandre Rademaker, Fabricio Chalub, and Valeria de Paiva. 2018. “Towards Temporal Reasoning in Portuguese.” In Proceedings of 6th Workshop on Linked Data in Linguistics. Miyazaki, Japan. http://lrec-conf.org/workshops/lrec2018/W23/summaries/8_W23.html.
  20. de Paiva, Valeria, Alexandre Rademaker, Livy Real, Fabricio Chalub, and Gerard de Melo. 2018. “OpenWordNet-PT: Taking Stock.” Proceedings of Fifth Workshop on Natural Language and Computer Science (Affiliated with Federated Logic Conference 2018). Oxford, UK. https://doi.org/10.29007/tvgw
  21. Cid, Alessandra, Alexandre Rademaker, Bruno Cuconato, and Valeria de Paiva. 2018. “Linguistic Legal Concept Extraction in Portuguese.” In Legal Knowledge and Information Systems, edited by Monica Palmirani. Vol. 313. Frontiers in Artificial Intelligence and Applications. http://ebooks.iospress.nl/volumearticle/50848
  22. de Paiva, Valeria de, and Alexandre Rademaker. 2019. “Portuguese Manners of Speaking.” In Proceedings of the 10th Global Wordnet Conference. Global Wordnet Association.


Sunday, August 30, 2020

Wildfires near us

This has been a difficult week: heatwave, pandemic, possible power cuts and to cap it all, wildfires! the possible need to evacuate, the need to prepare go-bags, to find documents and photos. our bags are still sitting by the porch. The wildfires are only 40% contained, as I type this. Very unsettling! and we're the lucky ones: the ones who did not have to evacuate, who only had to consider it.

It took quite a lot of determination to keep making jokes about it. Like the one I made in facebook, stealing from someone I don't know on twitter.

"It’s raining ash in California, forcing us to wear a different kind of mask than we wear for the pandemic when we go buy the generator we need for either rolling blackouts or preemptive outages so we can work from home if we haven’t been evacuated, if we have work or our house hasn't burned down. well, I don't need a generator, I just need some more wine to keep it going!"


But all is fine, thank you! The heatwave has gone away, the air is clear again, there's wine in the house and I've managed to chair sessions at AiML and WBL, and watch lots of interesting talks. I even managed to produce slides and talk for some 20 min on Friday. So we're back to the old worries about all the deadlines missed and all the work not delivered, yet. but hey, this is normal! Wildfires no, they're not normal.

and this picture is from 2016, not now, but somehow is even more frightening that I had no idea it was happening!


Sunday, August 9, 2020

Understanding Portuguese

(Illustration by Jana Walczyk)

It might be possible to find students and programmers to develop an old dream of mine, I am told. This dream is a project about producing logic from texts in  Portuguese. I have been giving talks about this project since 2010 (paper from 2011). Thus I want to explain (to possible volunteers) what does this project entail, what is the work that we should be doing, and why.

Explaining why we should be doing this work is very easy. 

The amount of information published in scientific articles,  preprints, news, blog posts, fiction, as well as unstructured data has increased many-fold in the last few years. A major bottleneck in the discovery of relevant information for business and researchers alike arises when connecting new results with the previously established state-of-the-art. A potential solution to this problem is to transform the unstructured raw-text of the novel information onto structured database entries, which would allow us to reason with this new information in the same way that one already organizes and reasons with the previous content, using Knowledge Graphs. Thus this would allow programmatic querying of the content, checking it for contradictions, checking for new changes, as well as all manners of analytics of this content. The fact that one can do most of this processing in English, but not in Portuguese (or for that matter not in many other languages) should be a reason for concern.  Brazilian science, as well as its industry, cannot progress as well as others, if our native language is not processed as well as others.

Semantic Parsing Portuguese

Now explaining exactly what the work on a semantic parser for Portugues amounts to, is somewhat harder. The project of transforming unstructured text into knowledge is very hard, language is way too ambiguous and difficult to deal with. While many open-source tools and resources for processing English texts exist, very few can be used for Portuguese. So we describe in parallel what we do have for English and what we need to build for Portuguese.

The project of extracting semantic information from English sentences is very hard. ur best shot can be seen at the moment in the preliminary demo. This prototype, developed by Katerina Kalouli and Dick Crouch, goes over ideas developed when I worked with Crouch at Xerox PARC, but re-implements these ideas from scratch, using new technologies for all software that is proprietary technology of either Xerox PARC or Microsoft. (There is a paper explaining the system and a version showing how this can be hybridized with machine learning systems.)  

This new semantic parser project has a pipeline that depends on several other open-source projects: we discuss these several "steps" below. 

Steps for Semantic Parser in Portuguese

Semantic parsers for English abound, but we are following a specific line of work that starts with Daniel Bobrow and  Ronald Kaplan at PARC.

1. Grammatical parsing is improving every year. A recent development is the new Stanford system called "Stanza".  Stanza is multilingual, includes Portugues, it is written in Python and has a better (less restrictive) license than the previous CoreNLP Stanford systems. We need to fine-tune it for our experiments.

2. The semantic parser we have in English depends on the grammatical parsing of sentences using the Stanford-Google based project "Universal Dependencies". Actually, it uses "enriched dependencies", we need to check how they behave for Portuguese.

The Universal Dependencies project has been going on since 2016.  This has already a branch in Portuguese, with which I am associated through my work with Alexandre Rademaker and Livy Real, but the corpus we have in Portuguese is small and there are still many issues with the Portuguese Universal Dependencies. These need expanding and possibly some annotation effort to increase the size of the corpus.

3. The semantic parser also depends essentially on Princeton WordNet.  Building up the Portuguese version of the WordNet thesaurus and dictionary has been a much harder task than we had anticipated, but our system (for browsing and downloading) has been in operation since 2012, here's the original description. It is still being constructed or is ``in progress", but it is getting close to the end of its first (translation only) phase. 

4. The semantic parser also depends on some version of tool for disambiguation and we have been using JIGSAW (available from GitHub), but this has not been updated since 2012. And this will not work for Portuguese. We need a tool for the disambiguation of Portuguese that can be plugged into this pipeline.

5. The system also depends on a generic upper ontology, for which we are using SUMO  in English. But an upper ontology is not enough to provide the world knowledge necessary for our applications. The project of expanding SUMO into an appropriate ontology for Brazilian culture, a Knowledge Graph for Brazil and its different facets (be they history, culture or geology or tourism, etc) is another major undertaking.

6. Finally, we need a reasoner on top of the representations that the semantic parser produces. This could be an off-the-shelf system like Lean or Isabelle, or it could be an NLI (Natural Language Inference) like the ones produced via neural nets and/or hybrid methods described in this SEMEVAL meeting special issue proceedings.

I need to emphasize that these steps can be done in any scientific or commercial field that one is interested in. We could do it for History, Chemistry, or Mathematics, for example. We could do it to help integrate IoT (Internet of Things) appliances or to help design customer service automated systems. Of course, an application to dialogue will require a further module, a dialogue manager, which orchestrates the possible conversations and actions of the automated system. The different domains should correspond to different Knowledge Graphs.

However, each one of these steps is a considerable amount of work, possibly worth a master thesis, or maybe even a PhD. Putting them all together should also be a major engineering feat. I hope we will find people willing to take up this challenge.