Ruben Verborgh

The devil is in the details,
but the demons are in the semantics.

Towards Web-scale Web querying

The quest for intelligent clients starts with simple servers that scale.

Most public SPARQL endpoints are down for more than a day per month. This makes it impossible to query public datasets reliably, let alone build applications on top of them. It’s not a performance issue, but an inherent architectural problem: any server offering resources with an unbounded computation time poses a severe scalability threat. The current Semantic Web solution to querying simply doesn’t scale. The past few months, we’ve been working on a different model of query solving on the Web. Instead of trying to solve everything at the server side—which we can never do reliably—we should build our servers in such a way that enables clients to solve queries efficiently.

The Web of Data is filled with an immense amount of information, but what good is that if we cannot efficiently access those bits of information we need?

SPARQL endpoints aim to fulfill the promise of querying on the Web, but their notoriously low availability rates make that impossible. In particular, if you want high availability for your SPARQL endpoint, you have to compromise one of these:

  • offering public access,
  • allowing unrestricted queries,
  • serving many users.

Any SPARQL endpoint that tries to fulfill all of those inevitably has low availability.
Low availability means unreliable query access to datasets.
Unreliable access means we cannot build applications on top of public datasets.

Sure, you could just download a data dump and have your own endpoint, but then you move from Web querying to local querying, and that problem has been solved ages ago. Besides, it doesn’t give you access to up to date information, and who has enough storage to download a dump of the entire Web?

The whole “endpoint” concept will never work on a Web scale, because servers are subject to arbitrarily complex requests by arbitrarily many clients.

Can we approach Semantic Web querying in a Web way?

Rethinking Web querying

In the current Web landscape, it seems as if there are two querying choices:

  • public SPARQL endpoints, which live on the Web but have low availability;
  • local databases from data dumps, not “Web”, but at least you control them.

However, there’s actually a whole spectrum between “data dumps” & “endpoints” that we haven’t uncovered yet. But we have to think differently.

Servers should only answer those questions they can answer rapidly, as those will not compromise a server’s availability. Just think about it: no regular HTTP server would agree to offer resources that can bring it down. Why does a SPARQL endpoint?

Before you say it’s unfair to compare a SPARQL endpoint to an HTTP server because endpoints do more work, let me tell you this: it is unfair. It’s precisely because SPARQL endpoints do so much work that they go down all the time.

So what kinds of questions should a server answer? Well, we have to design questions in such a way that their answers enable clients to solve queries.

Querying with Linked Data Fragments

We introduce Linked Data Fragments as an abstraction of all ways to serve parts of a linked dataset. SPARQL results are (expensive) Linked Data Fragments. Data dumps are (large) Linked Data Fragments. In between them is a whole range of other fragment types. Can we identify a type of fragments that

  • is easy to generate for the server?
  • allows the client to solve SPARQL queries efficiently?

We can: a Triple Pattern Fragment corresponds to a triple pattern.
So all a server has to do is to answer easy questions such as:

  • Do you have triples with subject foaf:name? (Answer.)
  • What ?x a :Artist triples do you have? (Answer.)
  • How many ?y :appointer :Barack_Obama triples to you have? (Answer.)

Since some of those fragments can have many results, they are paged. However, metadata in the fragments indicates the total number of results, which clients need to efficiently plan queries. Hypermedia controls inside the fragment allow to discover other fragments of this and other datasets.

For example, suppose a client is asked to answer the question: “what artists were born in a city named ‘York’?”. In the SPARQL query language, this becomes:

SELECT ?person, ?city WHERE {
  ?person a dbpedia-owl:Artist.
  ?person dbpedia-owl:birthPlace ?birthPlace.
  ?birthPlace foaf:name "York"@en.
}

Here is how we solve this using only Triple Pattern Fragments:

  • Get fragments for those 3 triple patterns and check how many triples match.
  • The pattern ?x foaf:name "York"@en has the least matches: 12.
  • Each match yields a possible ?birthPlace. Make new queries from them:
SELECT ?person WHERE {
  ?person a dbpedia-owl:Artist.
  ?person dbpedia-owl:birthPlace dbpedia:York.
}
  • Get fragments for those 2 triple patterns and check how many triples match.
  • ?x dbpedia-owl:birthPlace dbpedia:York has the least matches: 169.
  • Each match yields a possible ?person. Make new queries from them:
ASK WHERE {
  dbpedia:Dustin_Gee a dbpedia-owl:Artist.
}
  • Get the fragment for this triple pattern to check whether it exists.
  • In this case, there is a match for Dustin Gee.
    Others, such as Margaret Clitherow, do not match.
  • Collect all matching results.

Then, the results of all matching subpaths are combined together as an answer to the initial SPARQL query.

Try fragment-based query execution in your browser using the Linked Data Fragments Web client. You can also download the client and its source code.

On the Linked Data Fragments website, you’ll find much more on this approach and how it works, including our paper for the upcoming LDOW2014 workshop.

Linked Data Fragments
Linked Data Fragments make the Web of Data Web-scale by allowing clients to query datasets.

The way forward

The online client is a first step, a demo to show what’s already possible today. Many improvements are necessary. For instance, results should be streamed, paging needs to be supported to have a full resultset, …

But those are only the implementation challenges. Far more interesting questions are ahead of us. How can we further improve queryable access to public datasets?

Triple Pattern Fragments are just the beginning of the search for a sweet spot that balances the availability needs of servers and clients’ desire to find an answer to complex questions.

Want to join us on that quest? Mail me or follow @LDFragments.
Visit the Linked Data Fragments website for more. And try the client, you’ll like it!