A digital library and cyberinfrastructure facilitating the discovery and utilization of data & knowledge in published documents

1,600,000 documents

30,000 ingested this month

1,432 ingested this week

123 ingested in the last 24 hours

Enabling TDM

In collaboration with our UW Library staff team members, GeoDeepDive negotiates agreements with publishers that allow programatic downloading and mining of published content.

All documents are securely stored on an access-controlled server at the heart of our digital library infrastructure (GeoDeepDive team members and our collaborators do not have access to original content via our infrastructure). UW-Madison's Center for High Throughput Computing supplies the computational power for processing documents using NLP, OCR, and other software tools useful for TDM tasks, which also allows for deploying new tools quickly against all existing documents.

Our app-template allows collaborators to quickly bootstrap TDM applications that use the NLP and OCR ouput and easily identify potentially relevant documents. Development is done with samples of documents, but applications operating on the full document set can be run on the GeoDeepDive infrastructure.

End-user Workflow

Have an idea

A question that can be answered by mining the scientific literature. 1.5 million documents from 8 publishers are currently available.

Fork the application template

Find the application template on Github

Update config file

Use words of interest to identify relevant documents. A subset of the literature that contains these words is then generated for testing purposes.

Write an application

Identify an output schema that can be used to answer the original question, and write an application to parse the input into the desired structure. Python, Postgres, R, and associated modules are currently supported.

Run on full dataset

Commit your application to a GeoDeepDive infrastructure repository on Github to run your application on our infrastructure and generate results.

Analyze results

Download and analyze results, identify strengths and weaknesses. We will provide bibliographical information about all relevant documents.


Troubleshoot application, resubmit, generate new results. We will continue to grow the dataset as more matching documents are fetched.

Get in touch

Whether you're a publisher interested in contributing your content to our infrastructure, a scientist interested in collaboration, or just curious to know more, let us know!

The Team

The GeoDeepDive team is based at the University of Wisconsin - Madison and is made up of domain experts in both the Geosciences and Computer Sciences, librarians, infrastructure developers, and undergraduate, graduate, and postdoctoral researchers

Wisconsin Infrastructure Team

Shanan Peters

Project Lead


Miron Livny

Project Lead

Computer Science

Ian Ross

Lead Developer

Computer Science

Tim Thiesen

Infrastructure Coordinator

Computer Science

John Czaplewski


Aimee Glassel

Academic Librarian


Jon Husson

Postdoctoral Researcher


Andrew Zaffos

Postdoctoral Researcher


Valerie Syverson

Postdoctoral Researcher


Chao Liu

Postdoctoral App Builder

Carnegie Institute

Julia Wilcots

Undergrad App Builder


Erika Ito

Undergrad App Builder

UW Madison

Stanford DeepDive Team

Chris Ré's team is focused on the DeepDive platform for knowledge base creation, and ensuring the datasets produced by the UW-Madison infrastructure team are DeepDive-ready.

Christoper Ré

Project Lead

Computer Science

Ce Zhang

Postdoctoral Researcher

Computer Science