Lightning-fast RDF in JavaScript
Writing a spec-compatible Turtle parser has been an interesting journey.
Node.js has spawned a new, asynchronous generation of tools. Asynchronous thinking is different from traditional stream processing: instead of actively waiting for data in program routines, you write logic that acts when data arrives. JavaScript is an ideal language for that, because callback functions are lightweight. I have written a parser for Turtle, an RDF serialisation format, that uses asynchrony for maximal performance.
Turtle is a straightforward syntax for RDF, the Semantic Web language. Currently a candidate recommendation at W3C, it will likely become the de facto standard for representing RDF. Although several implementations exist, my goal was to create a high-performance parser compatible with the latest spec. Furthermore, I wanted this to be a streaming parser, avoiding the need to load the entire file in memory (and thus allowing parsing files larger than physical memory). The resulting parser is available on npm and GitHub, and this post highlights some of the technical challenges I faced.
Representing triples
The first important choice is the representation of triples (or quads, as I want to support graphs as well). I did not want to go for an “everything is an object” approach, because that would make developers’ code unnecessary complex and slow. All entities are represented as strings, and triples are plain JSON objects:
{ subject: 'http://example.org/cartoons#Tom',
predicate: 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type',
object: 'http://example.org/cartoons#Cat',
context: 'n3/contexts#default' }
This performs very fast, because modern JavaScript engines will automatically create a single runtime class for triples. Now you might wonder how I differentiate between URIs and literals. For that, I have created the convention of double quoted strings:
var uri = 'http://example.org/cartoons#Tom';
var literal = '"Tom Cat"';
Note the extra pair of quotes inside the second string, which indicate that it’s a literal. No escaping is required within the string though.
Tokenizing a Turtle stream
The main challenge with tokenizing streams is that you cannot look ahead. Yet, you typically want to decide as soon as possible what you can do with the input you already have, but not sooner. For instance, consider the following received chunk:
c:Tom c:name ""
You might be tempted to conclude that Tom has an empty name and already generate that triple. Unfortunately, that’s not always true:
c:Tom c:name """Tom""".
Here, we would have decided to soon, since the two double quotes were actually part of a triple-quoted literal. As a result, many regular expressions in the tokenizer contain a lookahead pattern at the end that looks for whitespace or punctuation.
Additionally, much work went into constructing the regular expressions for URIs. This was all the more complex since JavaScript doesn’t fully support UTF-16 characters. Unescaping URIs and strings is also quite complex, because of the many different possible cases. Unfortunately, this makes decoding also quite slow, which is why I’ve included a separate regular expression for strings without escape codes, thereby making this common case a lot faster.
Parsing a Turtle stream
The tokenizer converts the stream of characters into a stream of Turtle tokens, that need to be parsed similarly as soon as possible. I used a callback mechanism in which each callback acts on an incoming token and returns the callback that should be executed next. This is similar to how a finite-state machine works.
For each possible state, you will find a callback. For instance: for reading a subject, for reading a predicate, for reading punctuation… However, some features such as lists and blank nodes, which can be nested, additionally require a stack for proper parsing as finite-state machines cannot handle nesting.
The callback mechanism ensures code is only executed when the parser can actually do something. If the stream temporarily pauses (due to disk or network latency), no code remains on the calling stack. Instead, the right callback is executed when the next data chunk arrives.
Try it out
It took me quite some changes to upgrade the library from passing the informal Turtle spec to passing the current Candidate Recommendation. Eventually, test after test succeeded, and now the library is passing all 287 tests, meaning full Turtle compatibility. Yeah!
You can install the library with npm or browser through the source code on GitHub. Oh, and don’t forget to run the benchmarks…
DBpedia is currently parsed at a few hundred thousand triples per second!