Bringing fast triples to Node.js with HDT

The HDT compressed triple format allows fast datasets—now also in JavaScript.

Reading a selection from a large dataset of triples is an important use case for the Semantic Web. But files in textual formats such as Turtle become too slow as soon as they contain a few thousand triples, and triple stores are often too demanding, since they need to support write informations. The HDT (Header Dictionary Triples) binary RDF format offers fast, read-only access to triples in large datasets. Until recently, this power was only available in Java and C++, so I decided it was high time to port it to Node.js as well ;-)

30 September 2014

I first became interested in HDT when we started working on Linked Data Fragments. Our idea there is to make the Semantic Web scale by relying on more simple servers, and more complex and intelligent clients. At the same time, we invest a lot in JavaScript, because it is the language of Web clients—and, increasingly, also of servers. Node.js of course plays a major role in this, by providing stable, event-driven servers in JavaScript. For fast triple pattern fragment servers, combining the strengths of JavaScript and HDT thus seemed a natural choice. Sadly, since no JavaScript library for HDT was available, we were forced to use our Java back-end server in addition to our JavaScript server, which handles all non-HDT data sources as well as the front-facing Web activities.

There were two options to bring HDT to Node.js:

a pure JavaScript solution: This option would be the most flexible, since it would potentially run on all JavaScript environments, including browsers. However, it could be too slow, since JavaScript is not designed for binary files (yet typed arrays are improving this). But most importantly, HDT files can contain hundreds of millions of triples and can be several gigabytes in size. This would not fit the available memory of browser tabs; even Node.js has a memory limit of 1.7GB.
bindings to a native library: A Node.js-specific module, compiled to native code from C sources, can work around this memory limit. The HDT C++ source code uses memory-mapped files for fast and transparent access to large files on disk, assisted by the operating system. The only drawback is that this approach is non-portable across JavaScript engines.

The fact that most of the C code was already written, also slightly helped the decision ;-)

HDT in Node.js is the highway for fast access to triple-based datasets. ©Rakib Hasan Sumon

Implementing C bindings

Since my C experience was rather limited, I thought it would be an excellent challenge to tackle. The main HDT code was already written and tested, the only thing left was to bind the HDT code to Node. This meant:

finding a suitable way to expose HDT constructs in JavaScript
connecting that JavaScript code to its C counterparts

JavaScript code in Node.js runs in a single thread that is event-driven. This means we should perform the least amount of work possible on this JavaScript thread. We can realize that by making the interface to HDT fully asynchronous.

This already starts with opening an HDT file:

// Possible synchronous interface
var hdtDocument = new HdtDocument('dataset.hdt');
hdtDocument.close();

// Actually implemented asynchronous interface
hdt.fromFile('dataset.hdt', function (error, hdtDocument) {
  hdtDocument.close();
});

In the synchronous case, the JavaScript thread would be blocked until the file and its index would have been found, read, and loaded. In the asynchronous case, the JavaScript thread can freely continue while the hard work happens on the C thread; only when this is ready, we switch back to JavaScript using the callback.

Implementing this one asynchronous JavaScript function requires three C functions:

CreateAsync: runs in the JavaScript thread and stores the JavaScript function arguments.
Create: runs in a separate thread and performs the actual work.
CreateDone: runs again in the JavaScript thread and sends the result to the JavaScript code.

The interface to search for triples is also implemented asynchronously:

hdt.fromFile('dataset.hdt', function (error, hdtDocument) {
  hdtDocument.search(subject, predicate, object,
    function (error, triples) {
      console.log(triples);
      hdtDocument.close();
    });
});

This avoids the need to wait while the HDT code fetches triples.

Try it yourself

The HDT-Node library is available as an npm package.
Explore the source code on GitHub, where I’ve documented the current API.

Try it on some of the example HDT datasets and tell me what you think!
Have a little patience the first time you load a dataset, as an index needs to be generated. From the second time onwards, you’ll have blazingly fast access to millions of triples from within JavaScript.

Ruben Verborgh

Implementing C bindings

Try it yourself

Discover more

Comment on this post