I work for a nonprofit whose mission is to develop cutting edge data sources and harness the latest analytical tools to help people in poverty around the world achieve their true potential. We are leveraging the fruits of LLM revolution to mobilize historical datasets of great social significance, which previously would have been prohibitively expensive to structure into an analysis-ready format.
In India, persistent social division is rooted in caste inequality; dependence on caste networks hinders migration, caste differences act as barriers to trade, and caste identity influences lender-borrower relationships and public service delivery. India’s >5000 largely endogamous social groups have vastly different norms, customs, values, and beliefs, but these have been impossible to study due to a paucity of reliable data.
An ethnographic survey conducted by the Anthropological Survey of India from 1985–1992 constructed a brief descriptive anthropological profile of all the communities in India. It took 500 scholars, 26,000 field days in 1985–1992, 25,000 interviews (5 per community, 5000 women). Each community studied in multiple places wherever possible. These data are machine-readable, but cannot be compared to earlier records that would allow us to understand how cultural norms are changing over time. A previous round of the People of India project was run by the British Colonial administration in 1880–1910, which resulted in 30 long-form volumes of ethnographic narratives, with similar methodology to the 1980s POI but locked in image-based PDFs of long form text.
We have deployed GPT4 to digitize the early 1900s volumes into norm∗group datasets, like the 1980s data. This presentation will describe the construction of these data and briefly summarize how modern generative AIs and data pipeline tooling could revolutionize social research in the nonprofit and international development sector.
Tobias is the Co-founder and Chief Data Scientist of Development Data Lab. He works with an array of tools and techniques to extract meaning from messy, complex, large datasets - delivering insights and open-sourcing data and software to fight poverty around the world. Inspired by altruism, resilient automation, machine, human learning and wonder, Tobias is interested in finding ways to leave the world better than we found it, as he believes most people are.