Working with data that is collected by public bodies is crucial to conducting public policy research. Open data and data that are easily accessible and re-useable are a fast growing and important part of this economy. Most public data is published as PDFs, which is not the most convenient format if one wants to use the data for any further analyses.
As a step in this direction, we began with the idea of ‘liberating’ these PDFs and ‘free’ the data extracted from them. To do so, CITAPP, along with Fields of View and Datameet organised a day-long PDF Liberation Hackathon on the 21st of March 2015. This hackathon aimed to introduce why open data is important, followed by a session on extracting data from PDFs.
The event was attended by 42 students from IIITB, from the MTech, MS/PhD and iMtech programs. The participants divided themselves into ten teams and chose a PDF document to work with. The idea was to have participants convert tables of data out of PDFs into more accessible formats, including CSV and Speadsheets. The PDFs which the participants worked on can be found here.
The day started off with an introduction at 10.30 AM by Nisha from Datameet, followed by a presentation of the PDFs, and a presentation on the tools the participants could use. The hackathon began right after lunch, and extended until 5.30 PM, with seven of the ten teams submitting ‘freed’ data from PDFs.
— datameet (@datameet) March 21, 2015
The extracted data is uploaded onto a public Google Driver folder, and can be found here.
(This post is cross-posted from the event report published on CITAPP’s website here).