I’ve created a small PHP library to read from, and iterate through, Wikidata/Wikibase JSON dumps.
Wikidata is the free knowledge base that anyone can edit, and serves as a central data repository for Wikipedia and associated projects. Wikibase is a set of free open source software that powers Wikidata. You can get one of the Wikidata dumps, which contain the content of Wikidata, for your own purposes, such as doing analysis on the data or setting up your own copy. You can get these dumps in JSON, XML and RDF formats, which JSON generally being the recommended one. At the time of writing, the JSON dumps are 55 GiB or 3.7GiB when compressed with bzip2.
What this library does is provide the utilities needed to consume such dumps in your PHP code. This starts with the most simple thing you can possible want, reading a single entity serialization from the dump, and includes higher level access such as iterating through the dump as if it where an array of in-memory EntityDocument PHP objects. (These objects are defined by the Wikibase DataModel library, and deserialization is done via the Wikibase DataModel Serialization library.)
Reading some lines from a bz2 dump
1 2 3 |
$dumpReader = $factory->newBz2DumpReader( '/tmp/wd-dump.json.bz2' ); echo 'First line: ' . $dumpReader->nextJsonLine(); echo 'Second line: ' . $dumpReader->nextJsonLine(); |
Iterating though the JSON
1 2 3 4 5 6 |
$dumpReader = $factory->newBz2DumpReader( '/tmp/wd-dump.json.bz2' ); $dumpIterator = $factory->newStringDumpIterator( $dumpReader ); foreach ( $dumpIterator as $jsonLine ) { echo 'You can haz JSON: ' . $jsonLine; } |
Creating an EntityDocument iterator
1 2 3 4 5 6 |
$dumpReader = $factory->newBz2DumpReader( '/tmp/wd-dump.json.bz2' ); $dumpIterator = $factory->newEntityDumpIterator( $dumpReader ); foreach ( $dumpIterator as $entityDocument ) { echo 'At entity ' . $entityDocument->getId()->getSerialization(); } |
All services are constructed via the JsonDumpFactory class, which you can construct as follows:
1 2 |
use Wikibase\JsonDumpReader\JsonDumpFactory; $factory = new JsonDumpFactory(); |
There are two types of services provided by this library: those implementing DumpReader and those implementing Iterator. The former allow you to ask for the next line of the dump. They are the most low level, with the different implementations supporting different dump file formats (such as .json and .json.bz2). The iterators all depend on a DumpReader, and allow you to easily iterate over all entities in the dump. They differ in how much additional processing they do, from nothing (returning the JSON stings) to fully deserializing the entities into EntityDocument objects.
The iterators are lazy and can easily be combined with iterator tools provided by PHP, such as LimitIterator and CallbackFilterIterator. And of course, Iterators are awesome by default (even though they are kinda broken in PHP). I had some fun side endeavour in creating the latest version of this library in the form of creating another library, one dedicated to creating rewindable Generators (which are Iterators).
You can find JsonDumpReader on GitHub and it is also available as jeroen/json-dump-reader on Packagist.
Is that really all this library does?
It just allows me to loop through a dump? You need a whole library for that? Undoubtedly some people are asking that question, which by now I’m taking as a sign of having made a library/class/function that is properly dedicated to doing one single task 🙂
Do you want to set up your own Wikibase or import data from Wikidata and need help? Contact Professional.Wiki, my wiki services company.
3 thoughts on “Wikidata/Wikibase Json Dump Reader”