[Profile picture of Ruben Verborgh]

Ruben Verborgh

600,000 queryable datasets—and counting

The LOD Laundromat and Triple Pattern Fragments offer scalable Web querying.

What good is a Web full of Linked Data if we cannot reliably query it? Whether we like to admit it or not, queryable data is currently the Semantic Web’s Achilles’ heel. The Linked Data cloud contains several high-quality datasets with a total of billions of triples, yet most of that data is only available in downloadable form. Frankly, this doesn’t make any sense on the Web. After all, would you first download Wikipedia in its entirety just to read a single article? Probably not! We combined the power of the LOD Laundromat, a large-scale data cleansing apparatus, with the low-cost Triple Pattern Fragments interface so you can once and for all query the Web.

The SPARQL language lets you express powerful queries over a set of Linked Data. We would thus be tempted to think that, if we want to publish Linked Data, we could offer it through a full SPARQL interface. After all, this gives most flexibility to our consumers. But the question is then: who will pay for this interface? You already give away your data for free—are you expected to pay for people’s queries as well? SPARQL is so powerful that many queries are expensive; so expensive, that the majority of public SPARQL endpoints is down for more than 1.5 days each month.

And datasets are even not available through SPARQL endpoints, presumably because this interface is so expensive to host with high availability. Instead, datasets are offered as data dumps: if you want to query them, download the entire dataset, ingest it into a local triple store, and execute your queries on there. Needless to say, such an approach has hardly anything to do with Web querying: when was the last time you downloaded an entire website just to read one page? We urgently need an affordable way of publishing data that can still be queried live.

Combining the LOD Laundromat with a low-cost interface delivers queryable Linked Data. ©Megan

Equipping the LOD Laundromat with queryable access

The LOD Laundromat crawls the Linked Data cloud for data dumps. It downloads them and fixes various quality issues such as syntax errors, duplicates, and blank nodes. All cleaned-up datasets and their metadata can then be downloaded. While this is undoubtedly a great way to get your hands on clean triples, it still doesn’t let you query them online… until recently!

Together with the Laundromat team from the VU Amsterdam, which is Laurens Rietveld and Wouter Beek, we made all datasets queryable through Triple Pattern Fragments. More than 600.000 datasets from all over the Web, which used to be download-only, can now be queried live. Every dataset in the LOD Laundromat Wardrobe has Browse and Query buttons. Here are some cool examples you might have never heard of before (because they weren’t queryable ;-)

You might even search for your dataset to check whether it’s in there. And if it isn’t, you can simply add it. In a matter of moments, your data becomes queryable!

The Laundromat does this by converting every downloaded file to the HDT compressed triple format. HDT allows to query for triple patterns really fast, which is everything a Triple Pattern Fragments server needs. By the way, all HDT files are available, too!

This approach is different, because you can do it, too

But wait a minute… there already have been initiatives that tried this. Until recently, Sindice crawled the Semantic Web, and we still have the LOD Cache. They have tried (or are still trying) to provide a SPARQL endpoint for the entire Linked Data cloud.
What is the difference?

In the model followed by Sindice and the LOD Cache, all data is ingested into one endpoint for the entire Web. That’s a serious case of centralization, which is not webby. Not to mention of course that you and I don’t have the infrastructure to make it happen. It’s just too expensive.

We take a totally different approach: we embrace the fact that the Web is distributed. The LOD Laundromat does not strive to be the one interface to Linked Data. Actually, it’s not one interface, but 600.000 self-describing Web APIs. The fact that those APIs happen to be hosted on one machine is just a coincidence, but not a necessity. If you want, you can also host an API. Or two. Or 600.000. It doesn’t matter where the APIs are located: this model scales because the interface is inexpensive.

Thanks to a bit of engineering, public Triple Pattern Fragments APIs outnumber public SPARQL endpoints by a few magnitudes. And though numbers don’t say everything, it does show a clear cost and scalability advantage of simple APIs. The LOD Laundromat and the Triple Pattern Fragments server are open-source, so feel free to set up your own installation and help build the distributed queryable Web of Linked Data.

Next on the agenda: a federated Triple Pattern Fragments client which can execute SPARQL queries across multiple interfaces—regardless of whether they are located on the same server or not. I can say we have something nice working already… but more will be revealed in a timely manner ;-)

Ruben Verborgh

Enjoyed this blog post? Subscribe to the feed for updates!

Comment on this post