Federated SPARQL queries in your browser
Querying multiple sources is the Linked Data dream—and it’s happening now.
Querying multiple sources reveals the full potential of Linked Data by combining data from heterogeneous origins into a consistent result. However, I have to admit that I had never executed a federated query before. Executing regular SPARQL queries is relatively easy: if the endpoint is up, you can just post your query there. But where do I post my query if there are multiple endpoints, and will they communicate to evaluate that query? Or do I have to use a command-line tool? We wanted federated queries to be as accessible as anything else on the Web, so our federated Triple Pattern Fragments engine runs in your browser. At last, multiple Linked Data sources can be queried at once, at very low server-side cost.
Linked Data excels at providing a uniform view over different datasources. One of the main reasons for the existence of RDF is its simple uniform model, in which we can easily express any kind of knowledge, regardless of its origins. This happens with the world’s largest encyclopedia Wikipedia, which exists in triplified form as DBpedia, and that’s just one of thousands of examples. Domains such as the digital humanities also strongly invest in Linked Data versions of their data, as evidenced by initiatives such as the Virtual International Authority File (VIAF) and metadata from libraries all over the world such as those of Harvard and Ghent University.
So since everybody has been putting data online, it’s our turn as technologists to act upon this. Can we query it? More specifically, can we execute federated queries on this data, queries that combine data from multiple sources? Great research on federated SPARQL querying has been carried out by pioneers in the Semantic Web community, who build highly optimized engines that deliver results fast. This work is vital to transition from single-source to multi-source SPARQL queries. However, most of this work remains intangible for developers and end users. Sure, there might be a couple of public SPARQL endpoints that can query data from others using the SERVICE
keyword. And several federated query engines are available as command-line applications. But this is not as easy as querying regular SPARQL endpoints, where we can just post the query we need.
Furthermore, SPARQL endpoints suffer from a two-sided availability problem. On the one hand, many datasets are not hosted as a public SPARQL endpoint, at least partly because of the high costs associated with operating such an interface. On the other hand, several SPARQL endpoints that are on the public Web suffer from frequent downtime, which makes building applications on top of them unrealistic. Especially in the context of federated querying, we get an unavailability cascade: if the average endpoint has 95% uptime, then two average endpoints have a combined uptime of only 90%, and this worsens progressively when we want to query more endpoints.
Federated SPARQL queries with Triple Pattern Fragments
The Triple Pattern Fragments interface is a light-weight alternative to SPARQL endpoints. Instead of allowing clients to ask any SPARQL query, clients of Triple Pattern Fragments can only ask ?s ?p ?o
questions. In other words, they can ask the server to filter all of its triples by specific subjects, predicates, objects, or combinations of these. Such a low-cost interface is easy to host with high availability, which makes querying of Linked Data much more realistic. Complex SPARQL queries are evaluated by the client, which breaks down a query into multiple Triple Pattern Fragments. We built a client that runs in your browser, so you can run live SPARQL queries.
Interestingly, this resembles how federated engines work: they also break down a federated SPARQL query into multiple non-federated queries. In that sense, federation is native to Triple Pattern Fragments, where clients always decompose queries into parts. In order to make the engine federated, we simply need to send those parts to multiple interfaces instead of just one. Add a little optimization to that, and you have a federated engine to evaluate SPARQL queries against any number of datasources. My colleague Miel Vander Sande implemented this on top of our JavaScript query engine for Triple Pattern Fragments.
This example SPARQL query finds artists of the Cubism movement and their works:
SELECT ?name ?work ?title {
?artist dbpedia-owl:movement [ rdfs:label "Cubism"@en ];
foaf:name ?name.
?work schema:author [ schema:sameAs ?artist ];
schema:name ?title.
FILTER (!REGEX(?name, ","))
}
No single datasource contains all information required to solve this query. DBpedia contains some parts of the answer (which artists are Cubists), and VIAF contains other parts (works each artist has authored). Only by evaluating this query over both datasources, we can obtain correct answers. And thanks to the federated SPARQL engine, your browser can compute these answers for you. Try it now!
In the answer you receive, you’ll see that the ?artist
values have DBpedia URLs, whereas the ?work
values have VIAF URLs. The connection between DBpedia and VIAF is made in line 4 of the query: we look for ?work
s whose author is schema:sameAs
the ?artist
from DBpedia. For instance, dbpedia:Pablo_Picasso
is the same as viaf:15873
. The FILTER
at the end ensures we only get a single label per artist.
Try your own federated SPARQL queries
Together with Patrick Hochstenbach, I created some federated queries in the digital libraries domain for the ELAG 2015 conference (blog post). Play around with them:
- Authors of works in the Harvard Library and their VIAF ID
This query connects authors from the Harvard Library, which don’t have a persistent ID, to VIAF, which has a unique identifier for each author. The match is based on name and birth date to avoid false positives. - Writers born in Stockholm and their works
This query looks up people who are writers in DBpedia, filtering them based on birth place (also from DBpedia), and then fetches their works from VIAF. - Works in Harvard Library by Swedish Nobel Prize winners
This query finds all Swedish Nobel Prize winners from DBpedia, and selects their name in “last, first” format from VIAF. Based on this name, we look up books by these authors in the Harvard Library.
You can use these queries as the basis for your own. For instance, in the first query, you could use any other Triple Pattern Fragments interface to see whether you find matching people. In the second and third queries, you can change the criteria to find, for instance, authors from your own city or country. The only limit is your creativity. Try it now!
Release the power of Linked Data for your datasets
Executing federated SPARQL queries in your browser really showcases the power of Linked Data. It’s only when querying multiple sources that we do justice to RDF’s integration capabilities. The promise to truly query the Web is now within reach, especially considering the low-cost RDF interface we use on the server side. Publishing Triple Pattern Fragments requires few resources, but directly enables federated queries on top of your data.
Even though more than 600,000 Triple Pattern Fragment interfaces exist, we strongly encourage publishers to publish their data in this way. Users and applications querying your data together with DBpedia or VIAF? It is not only possible, but also realistic thanks to the low interface cost and the direct browser-based access. Furthermore, open-source software for publishing and querying is implemented in several languages.
Hurdles left to take
While the browser-based interface brings federated querying closer to users, there are still some steps left to take. For instance, the current implementation is not an end-user application. While it can be used by data specialists and software developers, the idea is actually to integrate the query engine as a library in Web applications that only need SPARQL behind the scenes. Just like end-users don’t write SQL, we don’t expect them to write SPARQL either—
Writing federated SPARQL queries is indeed a complex matter. Just like with regular SPARQL, you have to know what ontologies the datasource uses, but this time for all datasources combined. Furthermore, you have to find points of correspondence that allow to connect one datasource to another. And it will not be always as easy as in the VIAF case, where there are explicit links to DBpedia. But even in that case, one could argue that the query is quite complex with the need for schema:sameAs
—and how did we know we needed this property in the first place? Clearly, we need solutions that help users explore datasets and build queries, preferably in their browser as well.
And finally, you might have noted that the federation is implicit in the query for now. The query itself does not contain any information about the datasources; instead, the user chooses these in a separate input field. But what if we want to direct one part of the query to a specific datasource, and another part to other datasources? We could use the SERVICE
keyword for that, but this would again make queries more complex. And we shouldn’t forget that the main reason we’re doing this is to bring federated queries over Linked Data closer to the people ;-)
Enjoy federated SPARQL querying, and consider publishing your data as Triple Pattern Fragments, so other people can enjoy it as well.