ויקיפדיה:הטילדה הרביעית/24/נספח: להפוך את ויקיפדיה למסד נתונים

מתוך ויקיפדיה, האנציקלופדיה החופשית
נספח לכתבה: 'להפוך את ויקיפדיה למסד נתונים'
מאת ירון קורן

To show how the proposed system would work, let's take an example of data: the population of countries. Populations change all the time, and often pages for countries like to show different populations in different years, so this page, which we could all "Country populations", would need a separate row for each combination of country and year for which we have data. Here's what the first few lines of the page might look like:

The first row is a header that shows the name of each field. After that, the data would keep going, in alphabetical order (in English) with an unlimited number of rows for each country. You could then have a "wrapper" API for the entire data set; a call to get all the rows of Afghanistan's population data might look like:

http://data.wikipedia.org/data_api?Page=Country_populations&Country=Afghanistan

And if you only wanted to get a single number - for instance, the most recent population estimate - you might have a call that looked like:

http://data.wikipedia.org/data_api?Page=Country_populations&Country=Afghanistan&Date=_latest_&_fields_=Population

Here, the underscores around a string indicate that it's a standard term, to distinguish it from actual data.

Note that the data, data structure (the names of the page and fields) and API are all in English - which is contrary to the initial goal of making a universal data set that anyone can edit and anyone can read from. There are ways to get the data, once it's accessed, to be translated into different languages; but I think having a single language in which the data is stored is necessary to get it to work - and since there's no universal language for data (Esperanto doesn't count), English is the best option. English is already a de facto universal language in the technical world, and Wikipedia already has a repository that's based in English - Wikimedia Commons, which holds images and other media files - though it should be noted that Commons also holds translations into other languages when possible.

So, I think the data should be in English. But what about viewing it in another language? Let's take our previous example, but for display in a non-English language like, say, Japanese. The API call for that might look like this:

http://data.wikipedia.org/data_api?Page=Country_populations&Country=Afghanistan&_language_=ja

What would the API do at this point? Some things are easy to put into another language, like numbers - note that the original data example doesn't contain commas in the numbers, since some languages use periods instead of commas to separate digits.

For text, there's a possibility that I think is worth exploring. Besides data contained in infoboxes, another extremely valuable set of data that Wikipedia holds, and that has been underused so far, is a set of translations for the name of each page, based on interwiki links. If I go to to the page called "Giraffe", I can find out the names for it in over 50 different languages, and, more importantly, so can a machine. There are various ways in which this capability can be used. For the example before, the first population figure came from the United Nations Department of Economic and Social Affairs Population Division. There is a page in the English-language Wikipedia for "United Nations Department of Economic and Social Affairs", so perhaps the text in the CSV page could instead look something like "United Nations Department of Economic and Social Affairs Population Division", indicating to the system that it should look for a corresponding page for that term in whatever language it is exporting to, or use English if no such page exists. For the surrounding phrase, "Population Division", I think there are two real options: it could perhaps be translated using an outside service like Google's translation API, or there could be a general effort within Wikipedia to provide interwiki links for redirect pages as well. There's currently no page in the English Wikipedia for the entire phrase, "United Nations Department of Economic and Social Affairs Population Division", and if there were one, it would probably just be a redirect to "United Nations Department of Economic and Social Affairs". But if that redirect page itself had interwiki links to similar redirect pages in other languages, and if that process was duplicated for many relevant pages, that could, I think, solve the problem by itself.

How about speed? Ensuring that the API is fast is a challenge as well, although there are a number of possible solutions to this, like storing all the data from the CSV pages in a real database (like the MySQL database that Wikipedia already uses). The database could be refreshed every time one of the pages is saved.

Assuming that the API is in place, how do we then retrieve the data on a wiki page - say, within an infobox? That can be done in a variety of ways - one way which already exists is the External Data extension, which I'm the main author of, and which is mentioned in the Technology Review article. A call to get the current population of Afghanistan from that API, using External Data, would look like this:

{{#get_external_data:http://data.wikipedia.org/data_api?Page=Country_populations&Country=Afghanistan&Date=_latest_&_fields_=Population|csv with header|population=Population}} {{#external_value:population}}

It's alright if this doesn't make much sense - the point is that a wiki-text call is possible that will go to the API and get back the necessary data. It's definitely possible to create a function based specifically around the API, which would be simpler, and would possibly combine the two calls into one.