(Reposted from the Topos Institute blog, by Valeria de Paiva and Jacob Collard [originally posted Wednesday, 16 Nov 2022])

Humans are very good at recognizing the words they do not know, the concepts they haven’t met yet. To help with these they invented dictionaries, glossaries, encyclopedias, wikipedias, crib sheets, etc. Initially, they did it via laborious manual work, more recently through automated construction techniques. Natural language processing (NLP) tools have improved impressively in the last decade. Most of this incredible improvement happened to newspaper text and hence named entities of the kind found in news text can be detected and classified (usually into people, organizations, places and dates) with much greater accuracy than in the recent past.

Dr Johnson, the author of the 1755 A Dictionary of the English Language.

However, for domain specific text, like the academic literature in different sciences (e.g. Medicine, Biology, Chemistry), or in the Humanities (History, Sociology, Philosophy, etc.) things are more complicated. Thus extracting technical terms/concepts from papers is important, interesting and usually difficult.

How do you detect mathematical concepts in text? As a human mathematician (or not), you probably recognize that the ovals in the snippet in this picture correspond to mathematical concepts. But there is some uncertainty in this choice, and several metrics might be used to decide whether the automated system used is getting it right or not.

We wanted to see how well ‘automatic term extractors’ (ATE) systems work for mathematical text, especially for Category Theory, if we simply use them out-the-box. It seems reasonable to establish baselines using available, open source systems and this is what we do in our paper “Extracting Mathematical Concepts from Text”. The paper was presented at the 8th Workshop on Noisy User-generated Text (W-NUT) at Coling 2022 and you can find a short, technical video describing the work below.

This work is part of a larger project on Networked Mathematics that we started describing in “Introducing the MathFoldr Project” and broadened the discussion in “The many facets of Networked Mathematics”. We hope to provide more information on the project soon.

Logic ForAll

Saturday, December 24, 2022

Extracting Mathematical Concepts

(Reposted from the Topos Institute blog, by Valeria de Paiva and Jacob Collard [originally posted Wednesday, 16 Nov 2022])

No comments:

Post a Comment