GovWild: the wealth of freely available, structured information on the Web

Guestpost from Markus Freitag from the Govwild team. GovWILD was started as a bachelor’s project of seven students from the Hasso Plattner Institute of Potsdam in October 2009. The purpose of the project is to reveal new information and connections in government data by linking existent information as well as providing a new level of transparency. Large amounts of data from the US and the EU are connected with Open Data from various sources.

The goal is to create a new unified data set that comprises enriched information and connections between objects which appear in the original heterogeneous sources. Therefore, an integration process including extracting, cleansing, merging and fusing steps is built. Our project team implemented the process in a scalable Hadoop environment with JAQL as query language. All data sources which are used for this project provide official data published by the corresponding government except for the Freebase data. The following table shows all acquired data sources with information about size, given format, number of tuples (entries/rows) and the number of attributes we used from this source (e.g. names, addresses, amounts).

US-Spending lists the expenses of the US federal government. It provides data about the contracted companies, the responsible agencies and the purpose of the spending. The listed sponsors are
the responsible agencies.

US-Earmarks includes data of earmarks spent by the US government. Earmarks are long-term fundings for programs or projects. This source contains information about the responsible politicians and the recipients, which are mostly companies.

US-Congress provides information on all actual and former members of the US Congress. It includes also biographical information.

EU-Financial Transparency System (FTS) specifies all funds spent by the EU. It contains the amount of funds, the address and the name of the recipient and the funding agency.

EU-Parliament is the equivalent to US-Congress. It contains information about all members of the European Parliament.

DE-Party Donations is the German equivalent to FEC. This source includes data of the spendings German parties received from persons or companies.

DE-Agricultural Subsidies lists all German subsidies which are funded for agricultural purposes. Recipients are either persons or companies.

Freebase provides a huge amount of user generated data. In this project, only data about persons, especially politicians, and companies is used.

The article here (german) points out various difficulties of the project. Besides the web application, we provide the data as a SQL database dump, RDF triples and JSON files for download. The future work will focus on acquiring more sources especially from Germany and providing a better interface and visualization for the data. The project as a bachelor’s work is finished, but we are going to continue it with three students. The HPI supports our work until the end of 2010. After that, the future of the project depends on sponsors and external financial support, which we are optimistic to find :).

GovWild started on July 12 during a live interview on RBB Radio 1.

