Content gets old and sometimes not accessible any more. Hence, I've decided to duplicate here my blogposts in the Topos blog. At Topos, I have to follow a style that sometimes clashes with mine, while here I can do things my own way. And this is a huge plus.
Below you can see what was the blog post we wrote about our MathFoldr project 17 months ago. Much can happen in 17 months, but also much can not happen. Despite a very enthusiastic response from our ACT community, we were not able to convince the funding agencies that this is an important project. Thus, the work was carried on, but at a slow pace. I thought I'd update a little the original post with comments, after reproducing the original.
Introducing the MathFoldr Project
Sunday, 11 Jul 2021Categories: [research]
Tags: [MathFoldr] , [AI] , [NLP]
At Topos we believe knowledge empowers people, and that our community’s expertise should be available to all who seek it out. This is why, for example, we broadcast most of our seminars live on YouTube, and actively support numerous open publishing projects, such as the journal Compositionality, the nLab community wiki, or simply making our books freely available online.
But simple availability is not enough. True access is more than an open door: it’s clear, legible street signs, elevators, and gently sloping on-ramps. And with modern AI and natural language processing tools, we believe it’s beyond time to build these accessibility tools for science and mathematics.
This blog post provides an overview of our nascent MathFoldr project, sharing our dreams and our approach so far—and, at the end, a way for you to help just by doing a concrete, 5 minute activity!
1How do we organise mathematics?
A cornerstone of accessibility is search, and math is not easy to search. A striking, recent example comes from Quanta Magazine, November 2019. A group of physicists discovered a useful identity relating eigenvectors and eigenvalues, and did not know if it was novel. To check, they emailed a number of mathematicians, including Fields Medallist Terence Tao. Despite believing the result was “so short and simple—it should have been in textbooks already”, Tao had not previously heard of it. This led to a paper submitted for publication and, soon after, the article in Quanta. In the weeks after the story emerged, more than three dozen previously published instances of the result were reported, dating back to 1934. How can it be that even eminent mathematicians cannot find a widely published, basic result within their field of expertise?
The simple answer is that the mathematical literature has grown far too vast for even an expert to keep track of it all. A recent analysis finds over 120,000 math papers published in 2017 alone, with this rate growing exponentially at 3% a year.
Our infrastructure for organizing and communicating these results has not kept up. The ramifications are significant: wasted search time, duplication of research, and missed connections between fields.
2The MathFoldr vision and path so far
We seek to address this with our MathFoldr project, part of our Networked Mathematics theme. MathFoldr will provide search and literature curation tools that will make mathematics more accessible, with the ultimate goal of transforming the way mathematics is created and navigated.
Today mathematics is rather artisanal: mathematicians craft pdfs of new knowledge, and share these via posting on websites and advertising them in talks. Many recent technologies, from Google, to GitHub, to materials discovered via NLP over materials science literature, show the potential for something much more efficient and effective.
Our strategy for improving this begins with improving the organization and dissemination of math via NLP-powered tools, such as search engines, knowledge graphs, and glossaries, as an entry way to shift publication practices towards ever more formal representations. So the first task is to build ontologies, and to get the community excited and involved with good UI/visualizations.
Right now, we’re doing pilot studies with two corpora, both available on our GitHub:
- a comprehensive but messy one of nLab (community wiki) articles: [ToposInstitute/nlab-corpus]
- a smaller, clean one of TAC (journal) abstracts: [ToposInstitute/tac-corpus]
These pilot studies aim to create a synthesis of machine- and community-driven methods of extracting and curating an ontology of categorical concepts, which will then be maintained via WikiData. From this ontology, we will then build tools to do concept recognition and other semantic processing tasks. A prototype tool, led by Antonin Delpeuch, is nLab.OpenTapioca.
To extract ontologies, we’ve been collaborating with Jacob Collard and Eswaran Subrahmanian at NIST (the US National Institute of Standards and Technology), who have built an exciting pipeline that preprocesses the text with spaCy, and then uses a root- and rule-based linguistics method (R&R) to extract concepts. You can navigate the results with their Parmesan interface:
3What’s next?
At present, we’re thinking about how to further clean these corpora, and refine the R&R methods to extract more accurate and precise concepts.
Simultaneously, we’re also thinking about the word embedding models that are both used by these toolkits, and that we could use separately to refine search and other methods. A central problem is that mathematical text is quite different from standard newswire English, and so, as with any domain-specific text, we’re seeing a number of processing errors. Can we tune these better for mathematical text?
But ultimately, as with any data-driven enterprise, the quality of the output depends crucially on the quality of the input. And so throughout this all, we’re working to improve our corpora, to more accurately capture the expertise and intuitions of mathematicians. Here, we have a request of you: please contribute your expertise and intuitions.
More precisely, we’d love some help identifying concepts in abstracts from Theory and Applications of Categories, to produce what’s known as an “annotated corpus”, which we will share openly for NLP experiments to benefit the scientific community. To contribute, just choose your favorite TAC abstract, and click the button below!
========================
well, I now would say
>This is why, for example, we broadcast and record most of our seminars live on YouTube.
because having the recordings, after the event, seems to me, more important than broadcasting the talks. our permantly available library of exciting talks
No comments:
Post a Comment