The Coleridge Initiative is pleased to announce a “Rich Context” text competition in which we hope you will participate. The goal of this competition is to automate the discovery of research datasets and the associated research methods and fields in social science research publications. Participants should use any combination of machine learning and data analysis methods to identify the datasets used in a corpus of social science publications and infer both the scientific methods and fields used in the analysis and the research fields.

Problem Description

Researchers and analysts who want to use data for evidence and policy cannot easily find out who else worked with the data, on what topics and with what results. As a result, good research is underused, great data go undiscovered and are undervalued, and time and resources are wasted redoing empirical work.

We want you to help us develop and identify the best text analysis and machine learning techniques to discover relationships between data sets, researchers, publications, research methods, and fields. We will use the results to create a rich context for empirical research – and build new metrics to describe data use.


The competition has two phases. In the first phase, you will be provided labeled data, consisting of a corpus of 2,500 publications matched to the datasets cited within them. You can use this data to train and tune your algorithms. In the second phase, you will be provided with a large corpus of unlabeled documents and asked to identify the datasets used in the documents in a test corpus, as well as the associated methods and research fields. You will be scored on the accuracy of your techniques, the quality of your documentation and code, and the efficiency of the algorithm– and also on your ability to find methods and research fields in the associated passage retrieval.


  • September 30 2018: Participants submit a letter of intent (see How to Participate)
  • October 15 2018: Participants notified and the first phase data is provided (see First Phase Participation)
  • December 1 2018: Final first phase algorithms submitted (see Program Requirements)
  • December 10 2018: 5 finalists selected (see First Phase Evaluation)
  • December 15 2018 – January 15 2019: Finalists refine algorithms (see Second Phase Participation)
  • January 15, 2019: Refined second phase algorithms submitted (see Second Phase Evaluation)
  • January 15, 2019 – February 14, 2019: Second phase algorithms applied to second phase corpus and evaluated by competition panels
  • February 15 2019: Workshop is held in New York, NY for final presentation and selection of winning algorithms (see Second Phase Evaluation)


Finalists will be awarded a prize of $2,000 each. A stipend of $20,000 will be awarded to the winning team; the winning team will work with the sponsors in the subsequent implementation of the algorithm.

Additional Information

Please do help us spread the word to computer science scholars all across the world. And do not hesitate to get in touch if you have any questions. Questions and letters of intent can be sent to

