The Pragmantic Web

The Semantic Web tried to solve too much—simple things should work first.

Like any technological or scientific community, optimism in the beginning years of the Semantic Web was high. Artificial intelligence researchers in the 1960s believed it would be a matter of years before machines would become better at chess than humans, and that machines would seamlessly translate texts from one language into another. Semantic Web researchers strongly believed in the intelligent agents vision, but along the way, things turned out more difficult. Yet people still seem to focus on trying to solve the complex problems, instead of tackling simple ones first. Can we be more pragmatic about the Semantic Web? As an example, this post zooms in on the SemWeb’s default answer to querying and explains why starting with simple queries might just be a better idea.

29 May 2014

Think about the Web, simply the Web as we humans use it. How does it work?
The ways in which you can access resources is determined by how the server structurally exposes its data. You somehow receive the URL of a resource you are interested in, for example, you click a link on the page you are reading. Then your browser requests it through HTTP, and the server sends back an HTML response.

Think about the Web, but now about how applications use it. How does that work?
The application again has a URL of a specific resource, gets it through HTTP, and retrieves a JSON response (because machines don’t like HTML).

Now think about the Web, the Web for intelligent agents. How should that work?
Well, the agent sends a URL of a resource, gets it through HTTP, and retrieves RDF, because those clients can interpret RDF and do cool and smart things with it.
Nice. Except that the Web for intelligent agents doesn’t work that way.

If you think about it, such an HTTP interface that simply returns RDF instead of HTML or JSON would have been a nice step. Clients would have what they needed to make sense of responses. But that’s not what happened: instead, the SPARQL protocol was designed to allow clients to ask much more complicated questions.
They would not be limited anymore by how servers had decided to offer resources.
They could access all of the server’s data the way they wanted—at least in theory.

Unfortunately, directly solving this more complex problem first has proven counterproductive. We now have a protocol to select any part of a given dataset, but it turns out virtually impossible for servers to do this reliably. We tried to solve all problems at once, but that doesn’t work on the Web in practice. So what do we do now?

Why turn our eyes to complex problems, when simple ones are yet unsolved? ©Michael Heiss

Simple servers, smart clients

The idea behind Linked Data Fragments is to find more ways to access specific parts of a dataset besides SPARQL. I believe that the structural limits imposed by an HTTP server’s resource interface are there for a reason: to ensure that the server can answer all requests in a reliable and timely manner. The SPARQL protocol seems to consider this an unnecessary restriction, a limit for the client that needs to be removed. But let’s be honest, if you hosted a Web application, would you offer (even read-only) direct SQL access to your database? Of course you wouldn’t; this would pose a serious threat to the stability of your server. And, it’s not needed: you design your HTTP interface such that all data can be easily accessed—but you decide how!

We’ve been thinking of such HTTP interfaces that are handy to query Linked Data datasets. So the server still decides how clients access data—just like on the Web for humans or applications—but this time in RDF. We designed one such interface that consists of basic Linked Data Fragments, which offer triple-pattern-based access to a dataset. Servers can easily generate such fragments, and clients can use them to solve more complex queries themselves. So simple servers, smart clients.

Because, really, since when was the Semantic Web about intelligent servers? ;-)

We cannot solve everything

Criticism I often get is: “your Web client/server mechanism cannot solve every query”.
My reply is: “a public SPARQL endpoint can… but doesn’t—and never will”.
But yes, we can find examples similar to this one:

# Find artists that are also politicians
SELECT ?person WHERE {
  ?person a dbpedia-owl:Artist.
  ?person a dbpedia-owl:Politician.
}

Indeed, such a query cannot be solved rapidly over DBpedia if the server only offers a triple-pattern interface. There are many artists and many politicians, so we need to loop over all politicians and check for each whether they are an artist. That’s slow.
Triple-pattern-based querying only works efficiently if at least one of the patterns has a small number of matches. For instance, the following query can be solved with basic Linked Data Fragments in a matter of seconds:

# Find artists born in cities named “York”
SELECT ?person, ?city WHERE {
  ?person a dbpedia-owl:Artist.
  ?person dbpedia-owl:birthPlace ?birthPlace.
  ?birthPlace foaf:name "York"@en.
}

There are only a few Yorks, and then only a few people born there. This makes the execution of this query efficient. A SPARQL endpoint could do this faster, but only if it is available—and that’s often not the case. Basic Linked Data Fragment clients can do this in 3 seconds 99.99% of the time, because simple servers have high availability.

The simple problems

But why would it be acceptable that some questions can be solved efficiently, while others cannot? Well, let’s actually look at it from a different perspective: why would we expect that all questions are easy to answer?

Take the second query: finding artists born in cities named “York”.
That’s something that a human could do easily on Wikipedia:

Look for cities named “York”.
For each city, find its people.
Check if those inhabitants are artists.

I mean, it would take some repetitive work, but you could find the answers.
The first query, on the other hand, would take you quite some time to complete.

Now take a step back… Isn’t that already quite a lot?
The fact that we can easily solve queries that humans can also solve easily?
Plus, we have the additional benefit that machine clients don’t get bored, so they have no problem finding answers.

The issue with SPARQL is that we tried to solve everything, and as a result, are able to do nothing reliably. We need to be much more pragmatic about the Semantic Web: let’s first enable those things that humans could do easily. Only then, it makes sense to look at more sophisticated problems. We need to stop solving problems in theory, and look at what we can do in reality. We need reliable systems to build on.

And this is not only true for querying, but for all facets off the Semantic Web.
Why are we focused on making everything work in theory, but only so few in practice?
The Semantic Web needs do-ers. Developers. Simple things that work.

Because if the simple things don’t work, neither will the complex things.
It’s time we show the world something they can build on.

Ruben Verborgh