build a paragraph-level index for arXiv

There is plenty of technology ready-to-use that does author-topic modeling, in part because so many of the machine-learning community's successes have been in text processing and classification. Some of this technology has been run on the arXiv at the abstract, title, meta-data level, but rarely at the full-text level. If we ran it there, we could build an author-topic model over every paragraph in the full corpus of the arXiv. This would tell us what every paragraph is about (and, amusingly, who wrote every paragraph), with (if we do it right) probabilistic output. That is, it would give probability distributions over what every paragraph is about.

If done correctly, and with a little hand-labeling of classes (and this is easy because each class would have characteristic words and phrases and authors), this could lead to a complete and exceedingly useful paragraph-level index into the arXiv. But even without the hand-labeling it would be incredibly useful: It could be used to find paragraphs in the literature that are useful and relevant to every paragraph you have written in one of your own papers, thus locating related work you might not know about. It could drive search services that find paragraphs relevant to your search terms even when they don't, themselves, contain those terms. And so on!

Dan Foreman-Mackey (NYU) first made it clear to me that this would be possible and he also took some steps towards making it happen. David Blei (Princeton) suggested to me that even from a machine-learning perspective the outcomes could be very interesting. The plagiarism paper from the arXiv people suggests working at PDF level rather than LaTeX source level; I am not sure myself which to do.

1 comment:

  1. Nature magazine recently did something like this, using the 30 papers on the ENCODE (Encyclopedia of DNA elements) project. Obviously on a much smaller scale than the full arXiv, but a step in the right direction perhaps.