Can I SPARQL your endpoint?

Perhaps the model in which a server does all the work, doesn’t work.

SPARQL, the query language of the Semantic Web, allows clients to retrieve answers to complex questions. It’s a core technology in the Semantic Web Stack, as it enables flexible querying of Linked Data. If the Google search box is the entry to the human Web, a SPARQL query field is the entry to the machine Web. There’s only one slight problem: nobody seems able to keep a SPARQL endpoint up. Maybe the issue is so fundamental that more processing power cannot solve it.

30 September 2013

SPARQL endpoints allow clients to ask for specific data. For instance, the following query searches names of people who are born in a city named “Ghent”.

SELECT ?name WHERE {
  ?person dbpprop:name ?name.
  ?person a dbpedia-owl:Artist.
  ?person dbpedia-owl:birthPlace ?city.
  ?city dbpprop:name "Ghent"@en.
}

Each part of the query is a triple, a small sentence of exactly three parts long that machines can interpret. Imagine doing this manually with Google… You’d need a list of all artists, which you have to filter for all those people born in Ghent. Either that, or finding a list of all people born in Ghent, and checking which of them are artists. Even if you’re allowed to give only your top-10 results, it will still take you some time. With SPARQL endpoints, you can obtain results much easier, but it comes at a cost.

Executing arbitrary SPARQL queries proves a heavy task for servers. Clients can make any possible combination of triples, and servers should respond as fast as they can. But responding takes time: even though each individual step is easier if all data is available locally, the server still needs to perform all steps you would do when trying it manually. The question is not whether it can, but why it would. What’s the incentive of the server to satisfy the client’s curiosity?

Only if they have incentives, servers can only do significantly more work than clients. ©Joe Cashin

A fundamental inequality

Being a distributed hypermedia system, the Web’s interaction model is based on servers sending documents to clients. You could argue that there is always some inequality: servers do a lot of work, and clients just profit from that. However, this is not entirely true. First, from the technical perspective, a server indeed generates a representation, but the client is responsible for the interpretation. For instance, the server sends HTML, the browser renders it graphically, and the user understands the text and acts upon it. Second, the owner of the server almost always gains something: you read pages which invite you to buy stuff or services. Third, even if the server gains little or nothing, only a limited cost is involved in serving HTML pages. Many providers let you host your website for free.

With SPARQL endpoints, the situation is very different. So first, the server suddenly becomes responsible for a lot more interpretation. Essentially, you ask the server to solve your specific question, instead of asking for more generic information that everybody uses, and drawing a conclusion yourself. Second, what does the server gain by answering a complex question? Or more precisely, will its hard work turn you into a better customer somehow? The server invests a lot in you—you don’t invest back. Oh, and there are thousands of clients like you, but only one server, so that’s a lot of one-sided investment. And finally, yes, the cost is high. Answering queries costs time. It’s like going to a library and asking the librarian to do your homework, instead of asking her for a book and reading it yourself to finish your assignment.

The result of all this is that people are afraid to set up SPARQL endpoints. And those who do, cannot keep their SPARQL endpoints up. In contrast, setting up Web pages is not a problem: many people do it daily. I’m by no means picking on proud SPARQL endpoint owners, just saying I think they’re really brave. But the main reason we don’t see more SPARQL endpoints is because it’s virtually impossible to be available for any query clients might ask. It’s just too much effort for no return.

Suppose I gave you a brand new server, and let you choose between hosting a 10.000.000 triple SPARQL endpoint under peak loads of 1.000 requests per second, or 15GB worth of static HTML pages under peak loads of 10.000 requests per second. Your challenge is to keep it up for one month with an availability of 99.9 percent. If you succeed, you get $1.000, if you loose, you pay me that same sum (and give me back my server). Which one would you rather choose?

In other words: do you dare to let me SPARQL your endpoint?

Ruben Verborgh

A fundamental inequality

Discover more

Comment on this post