Shaping Linked Data apps

Decentralized apps will need a lot of flexibility, which they will gain through data shapes.

Ever since Ed Sheeran’s 2017 hit, I just can’t stop thinking about shapes. It’s more than the earworm though: 2017 is the year in which I got deeply involved with Solid, and also when the SHACL recommendation for shapes was published. The problem is a very fundamental one: Solid promises the separation of data and apps, so we can choose our apps independently of where we store our data. The apps you choose will likely be different from mine, yet we want to be able to interact with each other’s data. Building such decentralized Linked Data apps necessitates a high level of interoperability, where data written by one app needs to be picked up by another. Rather than relying on the more heavy Semantic Web machinery of ontologies, I believe that shapes are the right way forward—without throwing the added value of links and semantics out of the window. In this post, I will expand on the thinking that emerged from working with Tim Berners-Lee on the Design Issue on Linked Data shapes, and sketch the vast potential of shapes for tackling crucial problems in flexible ways.

17 June 2019

Solid brings a new way of building apps: rather than storing data inside apps that request it, people store their data in their own personal data pod. This puts an end to the vendor lock-in of current apps, where people use a certain app not because they like it, but because their data or that of their friends or colleagues is on there. In essence, Solid aims to do for the Web what we have been doing on our desktops for ages: you create a JPEG image in one app, view it with another, and send it to your friends who pick their own apps. The difference is that Solid aims to do this not just with documents, but also for elementary data attributes inside of those documents. So applications that you give permission will be able to reuse data fields such as your name, list of friends, favorite songs, calendar appointments, etc.

In order to achieve such unprecedented granular interoperability, Solid uses Linked Data from the Semantic Web family of technologies. This focus on interoperability is what sets Solid apart from other projects. By using Linked Data, we ensure that:

Anyone can say anything about anything, by reusing existing URLs from the Web.
No central agreement on data models is required, because vocabularies can be linked afterwards.
Data can easily be remodeled later, because data and semantics are intertwined.

Unfortunately, there is a significant difference between theory and practice. Linked Data allows many degrees of freedom, so the mere fact that two apps use an RDF format does not guarantee interoperability in and of itself. In theory, data modeled with one vocabulary can be accessed seamlessly using another through mechanisms such as reasoning. In practice, reasoning is seldom available on the client or server, so data access patterns would need to match storage patterns exactly. But then we’d quickly lose two of the three above benefits of Linked Data, and might as well have agreed on a shared rigid JSON structure from the get-go.

Decentralized Linked Data applications as envisioned within the Solid ecosystem are too flexible to rely on hard-coded data access patterns. Instead, such as apps should be coded against shapes, which can be published on the Web as Linked Data so other apps can reuse them. Whereas vocabularies provide a list of possible attributes, shapes mandate a specific structure for data, combining attributes from vocabularies in a certain way. In the short term, we expect apps to reuse common shapes for classes such as people, photos, songs, comments, etc. In the longer term, we should strive to reshape data on the fly, such that apps can work with data in different shapes.

[photograph of a cookie cutter on dough] — Shapes allow applications to see data through specific lenses, regardless of what the underlying structure of the data might be. ©2017 Marco Verch

However, existing shape languages such as SHACL or ShEx are only part of what we need. Apps let people edit or create data with a specific shape, so we need a portable way of doing so. Furthermore, different data pods might store data in different places, so we need to model rules about storage locations instead of leaving that up to individual apps. To address these needs, the Design Issue on this topic introduces 3 concepts:

Shapes: A shape defines the fields and structure that client and apps can expect to find in a view over a piece of data. Technologies: SHACL or ShEx
Forms: A form is a part of a user interface that allows people to easily view, edit, and create data in a given shape. Technologies: UI ontology
Footprints: A footprint explains to an app where to store new data corresponding to a shape, and how it should be wired up within existing data. Technologies: footprints ontology

A shape can have associated forms so people can easily view and edit data, and footprints for determining how new data should be stored.

This blog post explains those three technologies in detail, and discusses where they might take us in the future.

Shapes for machine-readable and -writable data

The necessity of shapes for decentralized applications

Nearly all Web applications make assumptions about the underlying structure of data they retrieve from their backend. In many cases, those assumptions are unwarranted, strictly speaking: if a server only indicates that a document is JSON, then clients should not assume anything beyond a syntactically valid JSON document. In practice, Web apps rely on (often implicit) contracts that certain fields with certain structures will be there. This is probably acceptable when you are the only client relying on such assumptions, but quickly becomes problematic when multiple apps access the same data, as is the case with decentralized applications.

One way to achieve interoperability among multiple parties is by mandating specific data models through a centralized process. This route is followed by initiatives such as the Data Transfer Project and IndieWeb, and comes with clear advantages: once the data model has been decided upon, apps be can coded using its agreed assumptions. However, this presumes that prior central agreement is a viable option, and that data models do not evolve (or at least not in a backward-incompatible way). While these conditions might be met when the number of stakeholders is small, this becomes problematic in the presence of many parties, especially if they are inventing in parallel or disagreeing. In such one-size-fits-all environments, the ability to introduce new models or extend existing ones is limited, since either multiple parties will create different extensions for the same purpose, or another centralized iteration will be needed.

For instance, we could settle on the following structure:

{
  "fullName": "Ruben Verborgh",
  "email": "ruben@verborgh.org"
}

But what happens when we need to extend it with a birthday? The chance that we each will come up with compatible alternatives is very slim.

As argued above, Linked Data introduces much more flexibility, since data can have arbitrary shapes and extensions that can carry a universal meaning, without requiring centralized agreement. Some would call this unnecessary complexity, but we should not ignore that Solid deals with complicated real-world data and problems, wherein it is simply unfeasible to assume that any centralized authority could determine all data models that will ever be needed. So we put the power in individual developers’ hands to decide on the data models they need.

Hence, we need a uniform way of expressing the assumptions about a piece of data with regard to what fields and what structure apps can expect. These assumptions are captured by a data shape. Such shapes can be published decentrally as Linked Data at public locations, so developers can reuse, compose, adapt, or extend existing shapes when building apps that create and/or consume data.

Crucially, data can be stored or created in one shape but retrieved in another shape. So unlike some other technologies, shapes primarily matter dynamically during interactions and exchanges, instead of acting as permanent upfront contracts.

Different apps can interact with the same resources through different data shapes.

For instance, suppose a pod internally stores data as:

<#me> a schema:Person;
      schema:name "Ruben Verborgh"@en;
      foaf:mbox <mailto:ruben@verborgh.org>;
      dbo:birthDate "1987-02-28"^^xsd:date.

Then an app should be able to request that data as this shape:

<#me> vcard:fn "Ruben Verborgh"@en;
      vcard:hasEmail [
        vcard:value <mailto:ruben@verborgh.org>
      ];
      vcard:bday "1987"^^xsd:gYear.

Note how the structure and field names are different, while the information contained in the second is a subset of the first.

Existing shape technologies

Two main technologies for capturing data shapes exist: SHACL and ShEx. Both are discussed at length in a book called Validating RDF Data. Personally, I consider validation to be one of the least interesting uses for shapes; I see so much more potential in shapes as points of convergence between apps and decentralized data sources. I envision apps stating what shape they expect, and pods declaring what shapes they have or can make available.

The choice of SHACL versus ShEx is a topic in itself that I will not tackle here. Both have their merits: SHACL is W3C-standardized, and ShEx has many use cases and good tooling available. Fortunately, they share a large common functional subset wherein one can be translated into the other. The most important thing is ensuring that Linked Data apps are coded using shapes—whatever the format. So what should not happen is that a Linked Data app presents users with a hard-coded UI in which certainly fields are assumed to be there. Such closed object-oriented assumptions do not match well with Linked Data’s flexibility. Instead, apps should be built by pointing to shapes. When the shape changes, the app should co-evolve to the extent possible. For example, when editing a user profile, a shape will tell what fields should be there (assisted by a form).

In that sense, shapes are views over a chunk of data, not unlike how decentralized apps act as views over data pods. Shapes are views with a specific structure that can be relied upon, abstracting away the flexibility of the RDF model at a moment when it is easier without, while still maintaining the semantics and links inside of the data. The latter is an important difference from rigid JSON trees or even XML data conforming to a strict XML Schema: the data remains more than only structure. It is still Linked Data that can be processed further. A shape is not an end point, but a connection point.

Future shape technologies

We will need ways to easily reshape data from one structure into another. While this might sound like XSLT transformations on XML documents, Linked Data has the important advantage that each single piece of data is intertwined with its semantics. With XSLT, we need transformation pathways from every possible source shape to every possible destination shape. Shapes instead only define those states at each end. When semantics are present, transformations for reshaping can be (at least partially) autogenerated through ontological knowledge and reasoning. Those semantics are also (at least partially) preserved after transforming, so data flows can be reshaped along the way as needed.

Clients should be able to request data in different shapes. For my own public Linked Data, I have done this by applying reasoning, so my data can be queried using multiple vocabularies and even combinations thereof, despite being authored using a much smaller number of vocabularies. However, this results in more data than strictly needed, and perhaps not all possible shapes are compatible with each other. Through profile-based content negotiation (on which I’m currently working), clients could indicate their preference for specific vocabularies, shapes, or even JSON-LD frames such that apps know exactly what structure to expect.

Finally, the traditional usage of shapes for validation also has its merits. For instance, servers could restrict write access to resources by requiring conformance to a shape. On a more granular level, a server could validate whether an individual patch to a resource conforms to a shape. For instance, current public chats typically allow append-only access, such that people can only add messages but not modify or delete old ones. However, appending still enables people to annotate old messages with conflicting information, or to post new messages with an old timestamp or with a different WebID as author. Rather than relying on honest apps, the authorizations document could specify that the author field must be the posting user’s WebID, and that the timestamp should be sufficiently close to the current time.

Forms for viewing and editing by people

Clearly, shapes are for data exchanges between machines, but a lot of human–machine interactions will need to happen as well. A shape that is useful for an app is not necessarily the most user-friendly shape for viewing or editing. So while shapes are for machines, forms are for people. In practice, these terms are often used less strictly, and people (including myself) have written about hypermedia forms for automated clients. For clarity, “forms” in this article will always refer to “UI forms”.

It might not be obvious at first that there is a distinction between the UI and the underlying shape, but the difference becomes clearer when we compare some properties of both. First of all, shapes are inherently unordered: they constrain the structure of data, but not the processing order. Forms are ordered: they display information such that it is easy for a person to read or change. They can also be sectioned or otherwise organized in ways that facilitate human interaction, but which are not reflected in the underlying data. Forms suggest UI elements that make it easy to edit a given field: depending on the type of data, a text field, dropdown list, or autocompletion control might provide the best user experience.

Crucially, a one-to-many relationship exists between shapes and forms: the same shape can be manipulated through multiple forms. Not every view of a user profile needs to include all of its shape’s fields, and making a quick edit such as changing a profile picture shouldn’t involve a complex form. A form itself always corresponds to one shape, but users can be given a choice of forms to interact with data in a given shape.

People can use different forms to read or write data in a certain shape.

Like shapes, forms are composable: smaller subforms can serve as building blocks for larger ones. For instance, a small form for entering a postal address can subsequently be used for home addresses and business addresses alike. Forms, like shapes, can even be recursive: a person can have contacts, which can have contacts of themselves. And, as is usual with Linked Data, it’s turtles all the way down: we can imagine a form to edit forms and/or shapes.

While shapes and forms are thus clearly distinct concepts, there nonetheless exists a strong relationship between them, which can be exploited for (partially) automated generation of one from the other. Starting from a shape, we could automatically generate a form from its structure, which then is further refined by a designer by adding a logical order and picking the right UI elements. Conversely, starting from a form, we could try to determine an appropriate RDF shape. Since no exact match exists, both cases will likely require human input.

Current technologies for representing forms include the UI ontology, which is a work in progress. Despite their terminology, initiatives such as RDF forms and the Hydra core vocabulary actually more closely resemble shapes than forms, as they target machine clients rather than people.

Footprints for creating and wiring up data

What you have read so far provides answers to the question of how to read and edit existing data. Indeed, if you are only reading or editing, then you do not need anything in addition to shapes and forms. However, (only) when creating new data, clients need something that tells them where to store what you create. This includes the parent folder of the new document, as well as the desired URL, but also other documents such as indexes that will need to point to the newly created document.

A footprint complements a specific shape with answers to questions such as:

In what document should new data for the shape be stored?
What updates to other documents are required?
What notifications should be sent to where?

Like with forms, a many-to-many relationship exists between shapes and footprints: different footprints can be used for the same shape. The way you structure documents might be different from mine, even though our data might conform to the same shape. This re-emphasizes the importance of not using footprints for reading data: read operations solely depend on links between different parts of the data, regardless of the actual folder or document structure through which the data happens to be exposed. When writing data, footprints explain the desired structure.

New data corresponding to a specific shape can be written to different documents, depending on the footprint that is used.

A folder might mandate a certain footprint. If no footprint is present, the shape or form might point to a suggestion. For example, I might want to store my contacts as follows:

contacts/
    family/
      {firstname}_{lastname}.ttl
    friends/
    colleagues/
    professional/
        {company}_{lastname}.ttl
    lastnames.ttl
    events.ttl

In my address book, contacts are thus grouped based on their role, and URLs are based on other attributes. The document lastnames.ttl serves as a searchable index of all contacts based on their last name. The document events.ttl indexes my contacts based on event dates, such as their birthday or anniversary, such that they can easily be displayed in a calendar. Your contacts footprint might be more simple:

people/
    {lastname}_{firstname}.ttl

Yet our shapes for representing people can be exactly the same. In both cases, footprints are used only to create new data. Reading happens by navigating to the parent folder (contacts or people) and following links. In your folder, only one level of links needs to be followed; my folder requires two levels of link-following, unless indexes are considered. In no circumstance should the footprint be used for looking up contacts, since the footprints and used attributes might have changed since creation time.

We could argue that certain types of edits might also require footprints. For instance, if birthday information becomes available for a contact, or if a friend gets married, then events.ttl needs updating. More complicated changes involve a colleague becoming a friend or moving to a different company. Those could mean minting new URLs—or maybe not. (If that were to happen, a redirect and owl:sameAs link would be necessary!) These mechanisms still mandate some careful thought though.

Footprints are still in an early stage of development. A draft footprints ontology exists, but still requires some more iterations. Mechanisms to declaratively describe transformations, such as perhaps the Function ontology, will be needed to capture how attributes of a shape translate into URLs.

Shaping the future

The triad of shapes, forms, and footprints will help developers contribute to an ecosystem of decentralized Linked Data apps, as they offer answers for dealing with the flexibility of the RDF data model. The permissionless innovation promised by the Solid ecosystem can only truly happen when app developers do not depend on slow centralized consensus processes, so the ability to handle that flexibility is crucial.

Beyond these three components, I see a lot of potential for shape-based thinking. Below, I’m suggesting some new directions about which I’m particularly curious, on how shapes can affect Linked Data apps and their underlying query mechanisms.

Shape-aware apps

An essential change for apps is that knowledge about data structure or storage should not be expressed in program code anymore. Current apps often contain JavaScript code that expects specific RDF properties, or writes new documents to specific locations. Just like well-behaved desktop apps will ask you where to store files instead of assuming, Linked Data apps shouldn’t make assumptions about where you want your documents to reside. There’s no such thing as the data pod layout that works for everyone. Instead, apps should be built around declarative shapes as provided by SHACL or ShEx, augmented with footprints. By doing so, you can safely make assumptions about RDF properties as described in the shape—at least presuming the data is available in the shape you want. The plan is that reshaping will lift this constraint, allowing for truly independent evolution of storage and apps without requiring central agreement. Initial incarnations of shape-based thinking are already present in client-side Solid toolkits built by Inrupt, such as their Solid React Components.

I’ve previously explained how building Linked Data apps using queries eliminates the dependency on concrete HTTP requests. With shapes, the same query-based architecture can make apps independent of the shapes provided by the server. For instance, the below three queries all semantically express the same idea, namely to obtain the names of my friends:

# Query 1: SPARQL query expressed with FOAF
SELECT * WHERE {
  ?person foaf:name "Ruben Verborgh"@en;
          foaf:knows ?friend.
  ?friend foaf:name ?name.
}

# Query 2: SPARQL query expressed with Schema.org
CONSTRUCT WHERE {
  ?person schema:name "Ruben Verborgh"@en;
          schema:knows ?friend.
  ?friend schema:name ?name.
}

# Query 3: GraphQL query (with an appropriate context)
{
    name(_: "Ruben Verborgh")
    friends {
      name
    }
}

In order to populate the result set, the query engine evaluating the query could either ask the server to provide the data in the needed shape, or the reshaping could be done on-the-fly by the client. The important thing is that the app itself doesn’t have to care: it can assume the presence of either foaf:name, schema:name, or friends.name, depending on the query form that was chosen. The relationship between queries and shapes is an interesting one, since queries indeed convey an expectation of what the data is shaped like. Reshaping allows us to unconditionally make that assumption.

Higher-level tools such as LDflex will also need to be shape-aware. In fact, as soon as data is more complex than a single triple relationship, shapes come into play. For example, social interactions such as likes are expressed by multiple triples, so unsurprisingly, the current LDflex code uses a preliminary kind of shapes for reading and writing. In a future version, such shapes should be expressed in SHACL or ShEx and perhaps even be external to the code. Wiring up a shape should be as easy as wiring up a single property using JSON-LD.

Shape-aware queries

Solid rightly grants a great amount of freedom to individual data pod owners to decide how they will structure documents in their pods. Currently, traversing pods and folders to list all photos requires following literally every link. This is because we have no clue about the structure of the data: I could store my photos in albums per year and location, you could store them per project or per person. So photo apps might need to follow year, location, project, or person links, in addition to anything that anyone else might use for organizing their photo library. Solid is thereby similar to a (virtual) hard drive: people can place their photos anywhere on their device or online storage. In such systems, it is typically the user’s responsibility to tell apps what file to open. However, since Solid works on a much more granular data level, it would quickly become a nuisance to always have to tell apps where exactly they can find what—especially given the promise that Solid allows for easy app switching.

If folders explicitly indicate their conformance to a certain shape, we can rely on that structure to traverse and query data more efficiently. Of course, we would also need the means to validate such conformance, which can be performed by existing SHACL or ShEx validators (although the data might be spread across different documents). Although they should not be used for reading directly, and in particular not for guessing the URL of documents, footprint conformance might also be checked, as they can guarantee that certain indexes are present and up-to-date.

For instance, if I describe my photo folder’s shape as a structure of years and locations, then an app looking for photos will only have to follow year links from the root, and location links within those years. This saves a lot of time and bandwidth, since months, media, messages, contacts, and other folders and their descendants do not need to be checked. Similarly, in your pod, projects or people links could be the only ones that need following.

In addition to traversing, querying can also be simplified considerably. Solid folders and networks can be searched through link-traversal-based query evaluation, wherein a query engine executes a SPARQL query by following links. An issue is that such engines cannot know which links to follow, and hence have to follow every single link (within a scope) to exhaustively search for anything they can reach. Given knowledge about the shape of a pod and/or network, guided link-traversal-based query evaluation becomes possible, wherein query engines only follow relevant links that can lead to an answer.

Certain shapes might even allow for highly specific optimizations in some queries. Consider the following SPARQL query for photos from 2018:

SELECT ?photo ?date {
  ?photo a schema:Photograph;
         schema:dateCreated ?date.
  FILTER(YEAR(?date) = 2018)
}

If indeed photos in my pod are linked in folders per year, then the engine can decide not to follow links in the 2017/ or 2019/ folders. From a shape perspective, this requires a triple expressing that the folder for the year 2018 is indeed 2018/ (as opposed to 18/ or last_year/). If we want to rely on the knowledge that the full year is always used as the folder name, we need footprint conformance as well.

This reaffirms how all query performance is strongly affected by the underlying storage structure, and traversal-based querying is no exception to that. For instance, the above query will be much slower when your folder is organized by project (unless the project shape has an associated date range). At the same time, finding photos for a specific project would be super fast. This opens possibilities for multiple linking structures, where data is connected in many more ways than just the typical hierarchical folder structure. A pod could have virtual folders for projects, or index documents for specific types, dates, or people—all maintained through footprints.

Shapes of trust

Since Linked Data is connected across the Web and thus across data pods, a question arises about the scope of queries. Do we want only data with respect to one pod, or do we want to follow external links, and if so, how many levels deep? This is not only a question of performance, but also about trust: are results found outside of a pod equally trustworthy? For instance, people have asked why the Solid LDflex expression data.user.friends.firstName will by default only give first names of friends stored in the user’s profile document. Often, those names are not there, only the WebIDs of those friends, so few or no results for names will be returned. We could easily extend the result set by following the WebIDs of those friends and reading their first name from their own profiles. This latter behavior occurs only when you explicitly iterate over data.user.friends (path query from the user’s WebID document) and then ask for friend.firstName (path query from the friend’s WebID document).

People can express trust boundaries in shapes such that apps and query engines receive boundaries of which sources to trust for what information. By default, this might just be the user’s pod. Others’ pods might be consulted for their personal information, such as name, location, birth date, but not for things such as preferences, annotations, etc. Such limits will affect both performance and trust positively, at the cost of perhaps missing some results that might still be trustworthy, but were not part of the shape. In any case, documenting the provenance of query results and their individual components remains important in decentralized networks.

Shaped-based Web APIs

We so far have assumed the absence of more specific query interfaces, so we had to resort to traversal-based querying. While this technique will always work with Solid pods and Linked Data networks in general, certain types of queries will be slow.

Of course, more advanced query interfaces for Solid have been discussed, including a suggestion for a SPARQL-like interface, and a more lightweight Triple Pattern Fragments interface. From earlier research, we know that certain Web APIs can become prohibitively expensive, and that queries over federations can actually be cheaper with lighter interfaces. An important direction is thus to explore what optional Linked Data APIs should be exposed by Solid pods to optimize querying—in other words, what types of Linked Data Fragments are useful for pods to publish.

Up until now, Solid pods have always exposed generic Web APIs instead of the domain-specific APIs exposed by most other projects (such as APIs for people, photos, comments, etc.). Solid exposes Linked Data within RDF documents, which people and apps can structure as they see fit. With the advent of shapes and footprints, such structural decisions are made explicit, so apps and clients can find their way. However, from an optimization perspective, shapes could define custom, shape-specific Web APIs that are based on domain concepts.

For example, a shape for a person could give rise to an API where you can look up people based on last name. Or photos based on year or location. Those APIs could be driven by a physical document structure (as suggested for traversal), or by an internal query endpoint. Smart query engines such as Comunica would then determine that a certain SPARQL query directly translates to a domain-specific API request, or that it can be decomposed into multiple such requests. Such fragment-based query strategies have always been my long-term plan for Linked Data Fragments, and I believe Solid creates the right ecosystem for this.

However, I cannot emphasize enough that such optimizations are optional. Solid clients can only assume the Linked Data Platform interface and additional structures made explicit by shapes. By writing all data accesses as queries, apps can transparently rely on the query engine using more specific interfaces when present, yet continue working when they’re not—albeit with different performance characteristics.

Shapes are everywhere

Since I started thinking about shapes in RDF, I’m seeing many opportunities for them in both old and new problems I’ve come across. True to the statement When all you have is a hammer, every problem starts to look like a nail, they’ve become a new lens on reality. With shapes, the question becomes: can we reshape the problem into a nail? Or more accurately, can we transform it into a shape-shaped problem?

One of their attractive properties is that not only programmers, but also power users can create and tweak shapes. We can even use shapes to validate shapes, and forms to edit shapes or forms.

One recent example is someone describing the problem of historians collecting data about specific domains during a certain era. Think list of people, buildings, books, etc. Different historians enter these independently, and afterwards, they aim for that data to become connected. Today, many customized systems to support such processes exist, and tech-savvy people have created their own databases. The most advanced of such systems even support Linked Data. But the biggest challenge is how to make data entry easy. This is a problem screaming for shapes and forms, where Linked Data actually makes it easier to enter data if done right.

Shapes should be Linked Data’s answer to the developer-friendly demand for predictable structures, while at the same time keeping flexibility and connectedness. Shapes should not become the new XSLT, where every interoperability problem is solved by crafting yet another point-to-point solution. Instead, those points themselves are shapes, and conversions should happen nearly automatically, because of the semantics contained in Linked Data documents and the ontologies they point to. When used right, shapes amplify instead of reduce the power of Linked Data, by exposing predictability without pretending reality to be shaped that way.

Ruben Verborgh