The Web’s data triad
Querying decentralized data is a solved problem, but why do we prefer some solutions over others?
There’s no single optimal way to query permissioned data on the Web. Data publication and consumption processes are subject to multiple constraints, and improvements in one dimension often lead to compromises in others. So if we cannot successfully complete the quest for the perfect data interface, what is a meaningful way to make progress? Over the years, I’ve been ripening a conceptual framework to cast light upon such decisions and their interactions within an ecosystem. In this blog post, I describe the Policy–Interface-Query triad, or PIQ triad for short. It’s a tool to explore design trade-offs in specifications or abstractions for data interfaces within decentralized ecosystems such as Solid. Despite—
In the beginning, there was knowledge. There were documents and there were people. The creation of the World Wide Web culminated in REST. It truly was a paradise, an open world where the highest goal was sharing our digital goods—
With the Web’s origins coming from a spirit of exchange, it’s no surprise that the movement for data on the Web was driven by that same mindset. Raw data now!
echoed across the Terrace Theater—
Throughout the years, we’re slowly realizing how ways that work well for document publication do not always generalize to data publication. Furthermore, we’re seeing how lessons learned from publishing open data do not necessarily transfer to private data. The presumed orthogonality of publication and authorization protocols comes under pressure as we learn that we cannot simply implement private data management by transparently inserting an access control layer on top of existing open data interfaces. And where writing documents on the Web already suffers from a standardization gap compared to reading them, the read/
This begs the main question of how to design Web-based data publication systems with desirable properties, acknowledging that the trade-offs involved require different answers in different use cases. Managing these trade-offs requires us to distinguish feasibility problems from optimization problems, by separating the non-negotiable needs from the improvable wants. For example, we typically cannot afford leaking some private data (non-negotiable) to make querying 200% faster (improvable). We strive to get close to what we’d ideally want, without ever sacrificing what we absolutely need.
A secondary question is which lessons from open document publication can and can’t be transferred to access-controlled data publication. This question is itself an optimization problem: while there is no inherent requirement for data publication to adopt techniques from document publication, such partial reuse of familiar technologies can be cost-saving in terms of experience, infrastructure, and adoption. Yet such reuse must not be an upfront constraint, for risk of hindering us in our search to satisfy data publication needs—
Permissioned data publication: needs and wants
Quick definitions
Let’s separate the essentials from the (very-)nice-to-have. This blog post uses the definitions below to unambiguously describe our resulting needs and wants.
(Feel free to skip ahead to those if you prefer.)
- data
- Anything that can be serialized as bits and bytes is data. They might represent photos, recordings, texts, structured values…
- a fact
- A fact is the smallest unit of data that any actor might need to access independently. For a photo collection, this could be an individual photo. For structured data, we typically consider the level of individual facts appropriate: a single value in a spreadsheet or a single triple in an RDF document.
- an interface
- An interface is a specific instance of an automated communication channel through which a certain collection of data can be accessed at a predetermined granularity using a request—
response paradigm via an agreed-upon protocol. For example, this interface exposes my data through the Triple Pattern Fragments protocol. - a query
- A query is a declarative expression that uses an agreed-upon query language to describe constraints about data one aims to access. A query processor then uses interfaces to obtain data that matches the constraints of a given query.
- a policy
- A policy is a machine-processable description of how data should be treated, which can be attached to a selection of data. For example, a license is a policy for intellectual property such as music or source code. A usage policy stipulates rules actors must adhere to when processing certain data. Policy enforcement can happen at different points by different actors. I’ll loosely refer to data that is subject to policies as permissioned data (despite this not fully embodying all kinds of policies).
These concepts can be related to each other in various ways. For example, one way of characterizing interfaces is by listing what queries correspond to available requests. Similarly, the granularity at which you can attach policies to data within a given system could be expressed as a query. Some interfaces might refuse to respond when specific actors make certain requests, because they are enforcing access control policies attached to the underlying permissioned data.
List of non-negotiable needs
I consider the constraints below to be non-negotiable for any Web-based system in which permissioned data is published by servers and consumed by clients. Any solution either fulfills all of these in their entirety, or cannot be considered a valid solution at all:
-
Servers must support attaching policies to individual data facts; policies may additionally be attachable at other levels of granularity.
-
Servers must contain a policy resolution mechanism that consistently and unambiguously answers yes or no to questions of whether a given client is allowed to access a requested selection of data at the specific point in time.
-
Servers must deny a client’s request unless the response only consists of data it is permitted to read, and must allow the request if the client is provably permitted.
-
The server must allow each client to retrieve the complete subset of all data this client is allowed to read via a finite number of requests.
The above needs are boolean in nature: they’re either on or off. For example, the first need cannot be partially satisfied with “some policies can be attached at the individual fact level”, just like the other 3 cannot be gradually met by software that does the right thing 99% of time but has unpredictable behaviors in other cases.
Because needs are on/
List of improvable wants
Separately from our needs—
A couple of important wants may involve:
-
Query processing time. We want queries to be fast. Ideally, instantaneous, but that’s a theoretical optimum we’ll never reach. So it becomes a question of
how fast?
. -
Extent of decentralization. Across how many nodes should data be spread? And how many different nodes should a query processor be able to handle simultaneously?
-
Server and client load. How much server load is each request allowed to cause on average? What is the acceptable client load for battery-hungry mobile clients?
-
Policy complexity. How complex are the policies that guard the data? Does high flexibility hinder usability when it’s harder for people to edit their policies?
-
Query complexity. What kinds of questions are clients allowed to ask? How far does the server’s obligation stretch in answering them? Which constraints should be left to clients—
or not be possible selection criteria at all? -
Bandwidth and latency. How much data can the server send that might not directly change the query result? Can the server just dump all data the client is allowed to access (high bandwidth), or should it refine its responses (potentially higher latency)?
-
Recency. Do we require the most up-to-date version of data, or is it sometimes acceptable—
or preferable— to reuse results from a specific point in the past? -
Monetary budget. How much budget is available for hardware, software, and maintenance? Is there a budget for intermediate caches or aggregators as well?
Each of these wants are clearly on a scale. They’re sliders. We cannot possibly add “fast query” to our list of non-negotiable needs, because “fast” means different things in different contexts, and involves different efforts. Some sliders don’t even have a unanimously positive direction: with policy complexity, sometimes less is more.
Unlike the needs, the wants are in conflict with each other. Maximizing all of them is impossible, necessitating prioritization—and prioritization criteria differ across cases. While many might agree on the needs as ground rules for Web-based data systems, the wants are highly subjective. The realistic answer to the needs is turning all switches on, but no viable answer to the wants sets the sliders to everyone’s satisfaction.
So we must choose slider positions in a way that acknowledges their trade-offs, yet never at the expense of the needs. For example, it’s easy to have both fast queries and low server load if we sacrifice result completeness. Yet we agreed completeness cannot be compromised without irreparably changing our needs and thus the problem at hand. As such, query time and CPU cost unavoidably remain at odds, as do most other wants.
Re-solving solved problems
People sometimes inquire about my opinions on the Web data publication landscape. A returning question is: how can we enable query within the Solid ecosystem, where every person stores their data in their own pod?
The enable
bit highlights that many perceive it as a feasibility problem, as something not yet possible today, that we need to make possible soon. My answer is, and has been for years: query is a solved problem. In fact, there are many solutions.
You want to query across 20 pods? I got you!
You want to query across 20 million pods? We can do that too.
It’s easy. It’s a solved problem after all.
People think there’s a catch when I say silly things like that. That I’m obstinate or in denial, or just like to sound mysterious.
I’m actually just trying to pry more information out of you about what you really mean and want. Because it’s likely you’ll consider my proposed solutions unacceptable, although all of those provably address the stated problem. Sometimes it can just be easier to specify what you want, when you can point at something you don’t. Shooting at the wrong thing helps us more precisely hit this opposite we’re actually looking for.
So let’s try it. Each of the solutions below addresses all of the needs. Shoot!
Solution 1: Traversal-based query processing
A common misconception is that we can’t yet query a Solid pod
, so by extension, it seems unlikely that we’d already be able to query multiple pods. And yet we are. The solution to query? Well… just query. We’ve been doing it for years.
Because Solid servers are able to support the 4 needs, we can query pods using existing link traversal-based query processors. This example query shows that Comunica is able to retrieve all movies in any pod, without prior knowledge of where things are stored. Any query you can ask over any pod can already be answered today.
This same query technique works on any number of pods. Therefore, given sufficient entry points into each pod, Comunica always provides valid and complete answers to any query over all data a client is allowed to read from those pods. We can massively speed up queries when pods advertise their structure upfront, so the engine can skip irrelevant data. Many more performance optimization avenues exist.
But maybe you feel like traversal is just too slow. Or maybe you don’t like its associated CPU and bandwidth cost for clients. That’s okay. I just showed you that it can be done. So you tell me that your use case prefers fast. Yes, we can do fast.
Solution 2: Querying an aggregator
Fast query across Solid pods—
This solution comes with a preparation step: first, obtain the data from each pod you aim to query. This combined data is ingested into a trusted aggregator that dedicates its processing power to any sufficiently authenticated client. This proven technique has been practiced for decades and scales to billions of facts. Simply query the aggregator and you’ll obtain valid and complete results.
But perhaps you’re concerned about all of us needing to trust a single aggregator. That’s okay, we could have multiple. Or perhaps you’re worrying about the operating costs of such aggregators, or the extra bandwidth required to keep them up to date—
Solution 3: Query interfaces at the data source
Another way to make query fast without resorting to an aggregator is to equip each data source with a powerful query endpoint. It’s fairly easy to do, really.
Let’s just install a SPARQL endpoint on every single pod. Unfortunately, existing SPARQL endpoints typically do not have support for all kinds of policies. That’s okay, we can choose to only expose this interface to power users of the pod, who already have permission to freely set policies anyway. As the common case, they get to query fast. More infrequent users default to a slower experience, which impacts them less.
But perhaps, like me, you’re concerned about the CPU cost to the server. It all depends on whether this is a price your use case is willing to pay. Also, the speed-up may not necessarily apply to cross-pod queries. To make those fast, you’re better off with an aggregator, if that’s a price you’re willing to pay. It all depends.
Why we need the wants
The above solutions show why it’s important to keep our needs and wants separated. We’re all too often asking feasibility questions, when we should really be asking optimization questions. Query is a solved problem for decentralized data publication on the Web. But it’s likely to not be solved in a way that works for you. The problem space is relatively new, and we’re still discovering its multiple dimensions.
Asking relevant optimization questions impacts the systems we build. They help us express our priorities, knowing that we can never have all of our wants because they, unlike the needs, are self-contradictory. Deciding on the appropriate optimization criteria means moving the sliders without touching the switches. Let’s use this renewed understanding of the problem space to explore its effects on the solution space.
By moving sliders in the problem space, we constrain the sliders of the solution space. Which are the solution space’s sliders, and where should they go to meet our wants?
Interface
The single interface to rule them all
Let’s continue our discovery along the lines of the Web’s history. In the beginning, there was the interface, also known as Web API. It uses HTTP to expose data documents, just like we had been exposing hypertext documents for years before that.
We could model that tiny slider space like this:
Conceptually, it’s quite simple. Our freedom in designing the interface is choosing what kinds of HTTP requests it offers, and to what selection granularity within the data those requests correspond. Some interfaces will offer one big dump with all of the data, while others will allow for much more precise selections.
As a result, there are essentially two kinds of cases when a client needs to query data: either the server does all of the work, or the client needs to perform some of the work.
The first case is when the interface appears to have “predicted” the client’s exact needs, such that one of its data documents happens to perfectly match the query answer. The client simply makes one HTTP request to this particular document, and we’re done. For example, let’s say we’re querying for a source’s data about a specific person, then this document contains one server’s exact answer. That doesn’t mean we’re stuck with relying on sheer coincidence. Some interfaces allow clients to offer varying kinds of assistance with such “predictions” via highly expressive requests, like this example. Regardless of specifics, the server still performs all of the work in both examples.
The other case is when the interface doesn’t offer an exact document match for the client’s actual query, so the client itself has to partially construct the result. Maybe it just needs to make one HTTP request and then simply filter out unnecessary data from the response. Or maybe it needs to make multiple requests, and combine the data of several documents to obtain the correct result, like with link traversal.
So far, the REST architectural style, which helped us make sense of open documents, is still working well: the server publishes data as documents via an interface, and the client can answer its query by accessing documents. Adding policies seems easy: they can be attached to individual documents, and the interface can regulate access control.
Multiple interfaces to rule the rest
Some of our assumptions do not hold at scale. What if multiple kinds of clients emerge, such that no single interface is the best option for all of them? As an extreme example, let’s assume two groups regularly query a 100,000-fact source: the first group only ever accesses all facts at once, whereas clients in the second group read one fact at a time.
We could aim to perfectly satisfy one of the groups and see if the other group can cope. Let’s please the data-hungry group via an interface with only one document containing all 100,000 facts in a tightly compressed format. They clearly find this highly efficient. This same granularity proves highly inefficient for the second group, who waste CPU time and bandwidth downloading a huge document and filtering it down to a single fact.
We can perfectly satisfy this second group of fact-checkers via an interface with one document per fact, so each client only uses one tiny HTTP request. Unfortunately, the data-hungry group now needs tens of thousands of HTTP requests to complete their job.
A compromise is an interface with some intermediate number of facts per document. For example, a third option is an interface with 100 documents of 1,000 facts each. Then the data-hungry clients need 100 HTTP requests to obtain all data—
The server isn’t happy though, because it still needs to process many more requests than in the optimal case. And all of this so far assumes two groups of clients. What if the demand is more nuanced and unpredictable? Especially with domain-agnostic interfaces like with Solid, it’s impossible to anticipate all usage patterns.
It sounds like our previous single-interface model is too simple. We need to make room for exposing the same data through multiple interfaces:
With the previous model, our main choice was defining the data granularity of documents for a single interface. This updated model extends our choices as follows:
- What is the data granularity of each interface?
- Which combination of interfaces best serves actual clients?
- Which clients get access to which interfaces?
This last consideration reflects that some interfaces might incur a significantly higher server cost than others; as such, they might be reserved for privileged clients. I previously discussed the inherent need for multiple interfaces as a necessary way to harness high degrees of variation in client access patterns. So for the remainder of this blog post, let’s assume that whenever we’re talking about the interface, we’re always assuming that multiple interfaces might and will be in use. Clients access more than one interface anyway when data is decentralized across multiple servers.
The first dents in REST suddenly begin to appear: needing multiple interfaces breaks the implicit symmetry assumed in read/
Query
All interfaces answer queries
An important element is missing from the two simple models we’ve drafted so far. After all, the appropriateness of the interface heavily depends on the kinds of queries clients aim to perform. So how can we give query its rightful place in our model?
Intuitively, it might seem like some clients are just fetching data
, whereas other clients need more specific query results
. However, no such difference exists. All clients have data needs, and a query is simply a declarative way to express a data need. We strive to use sufficiently expressive query languages, so clients can capture their need in one or few queries rather than many. Clients that translate a singular need into many queries, reduce options to fulfill their need more efficiently, as those clients are essentially taking on the role of a query engine (and usually inefficiently so).
Similarly, no distinction between data interfaces
and query interfaces
exists, because there aren’t any non-query interfaces
. With any interface, the server always performs some kind of work, by virtue of sending over some data. That data always corresponds to some query—
So from a data perspective, there’s no such thing as a bad interface. The performance of any Web API is a function of how well an interface’s data layout happens to match that of a given query load. Seemingly “bad” interfaces, with “weird” distributions of data across documents, can prove extraordinarily useful if some of the frequent queries happen to exactly match one of those documents. Therefore, the usefulness of any interface always needs to be evaluated in the context of clients’ actual demand.
Let’s visualize this relative connection between interface and query in our model:
This is the point where many client-side developer experiences are currently stuck. It’s the implicit model used by resource-oriented interfaces, but equally by graph-oriented interfaces such as SPARQL and GraphQL. Their SDKs tend to provide abstractions for individual HTTP requests, thereby prematurely tying one interface’s granularity to the possible data needs of all clients. Several problems arise from such coupling, including the high costs of powerful interfaces for both open and permissioned data.
When the expressivity of an interface’s query language increases, we depart even further from the REST architectural style’s implicit dependency on symmetry, as very specific resources for reading do not always correspond to meaningful units for writing.
Query is an abstraction
This latest draft of our model portrays an overly naive relationship between interfaces and queries. Our goal isn’t making interfaces correspond as closely as possible to specific client queries, but—
Our deeper goal is the exact opposite: the main purpose of query is abstraction. Writing application code using queries allows abstracting away concrete HTTP requests such that clients can fully benefit from the evolving presence of multiple interfaces. When a data need is expressed as hard-coded HTTP requests against one specific interface, we can’t leverage runtime efficiency gains from other interfaces that might be present. In contrast, capturing the same need as a declarative query allows for more efficient translations into HTTP requests based on what data and interfaces a server offers.
Moving forward, our model should depict the tension between interface and query, acknowledging that there exists a perpetual trade-off between the simplicity and sustainability of an interface versus the actual queries clients have:
The above model presents the core idea of the Linked Data Fragments axis, which encourages experimentation into what (combinations of) interfaces are best suited to handle certain diverse query loads. The inevitable tension doesn’t only imply that the optimal interface
is an unsolved problem, but in fact an unsolvable problem. So while I called query a solved problem with regard to our needs, the general problem is unsolvable with regard to our wants. Fortunately, it is solvable for specific wants, which confirms the importance of adequate prioritization for each use case.
Either way, this model makes the REST architectural style really start to crumble. Determining what constitutes a resource is not just a function of the data available to the server, but equally of client-side demand. Not only do we break read/
Policy
Policy enforcement cannot rely on goodwill
So far, we’re still managing. While the interface/
Everything changes when we take policies into account. According to our needs, each fact within a data source can be subject to different policies. What if an interface groups data into documents in a way that boosts performance for a certain use case’s queries, but clashes with how some policy is grouping facts for permissioning? When a policy permits a client to read only some of the facts in one of the interface’s documents, denying access might prevent it from obtaining that data altogether. Even worse, allowing access exposes data it should never see. Either outcome endangers our needs.
In other words, the desired interface granularity from a query performance perspective is likely to conflict with the required granularity of applicable policies. Examples are not far-fetched: a natural interface for publishing contact data is one document per person, which represents many typical app queries. Yet this interface breaks the enforcement of a simple policy that prevents my contacts from seeing their colleagues’ birthdates.
A naive way to deal with this constraint is asking clients to restrict their queries to what is permitted by the active policies:
This model is the equivalent of a don’t steal my stuff
sign on a wide open front door. To be fair, it does make sense in some cases. For instance, open data or source code often carry licenses, which an interface’s responses can describe. It could be okay for clients to merely access some intellectual property and then discard it if their intended usage wouldn’t meet the license’s conditions. When it comes to personal data rights, the server or client might already be committing legal or ethical violations merely through the possibility of accessing such data, rather than actual usage. Relying on clients’ goodwill can be a viable strategy for some queries, but not with sensitive data.
Interface-controlled policies negate use case query needs
Many servers instead handle potential conflicts between interfaces and policies by resorting to fixed policy granularities that only cover what they consider to be common use cases. For example, current social media Web APIs don’t let you grant access to just the dates and locations of your posts without also disclosing your posts’ full contents.
The exclusive determination of policy granularity by the interface, rather than the queries required for a use case, is reflected in this model:
The benefit of the above model is that it simplifies server-side policy enforcement. By (fully or partially) enforcing policies at the earliest moment in the data flow, it provides easy protection against unauthorized use. However, it does so by excluding many access patterns, which is easier to imagine in domain-specific interfaces, but harder for domain-agnostic data interfaces such as the ones we’re aiming to build.
The Query–Interfaces–Policies triad
In general, our stated needs cannot be met by settling for fixed policy granularities, since one of our needs captures the ability to attach policies to any individual fact. Although such highly specific cases might seem exceptional, enabling them is a precondition to offering all possible kinds of data protection. After all, any data fact might be subject to multiple partially overlapping policies, and the only way to cover such situations is to allow for policy selections up to the most elementary level.
So clearly, policy is not a constraint we can apply directly to query, because then interfaces cannot play an enforcing role. Conversely, applying policy as a constraint to interfaces significantly reduces the available flexibility for query. Therefore, the final diagram models policy as a constraint on the relation between interface and query.
The resulting Policy–Interface–Query triad (PIQ triad) is the mental model I’ve been using for several years to reason about the sliders of the solution space. It shows how the tension itself between interfaces and queries experiences a tension from policies:
In addition to revealing the three core sliders of the data publication solution space as Policy, Interface, and Query, the PIQ triad highlights that maximizing all of the three sliders simultaneously is impossible. A high demand for query flexibility conflicts with either policy complexity or interface performance. Similarly, a high demand for interface performance restricts query and/
Those constraints also mark the end of REST for designing permissioned data systems. The architectural style has outlived its relevance as a helpful framework for evaluating client–server interfaces within this realm, because it offers inadequate guidance for interfaces under the query and policy constraints at the heart of the PIQ triad. None of this constitutes a value judgement about REST itself, nor an endorsement of any other style, but rather an unmistakable demarcation of the boundaries of REST for data.
The PIQ triad in practice
Designing Web-based data publication systems
So how does the PIQ triad answer our main question on designing better data systems? And what challenges do we need to be tackling right now to build such systems?
From a feasibility perspective, cross-source query is a solved problem, to which multiple solutions already exist. From an optimization perspective, cross-source query is not one single solvable problem, because the optimization criteria vary between use cases. That means our objective should be to design adjustable systems that can span across various ranges of the solution space. By definition, no single system can span all of it. The resulting diversity is highly desirable, yet simultaneously emphasizes the importance of interoperability between diverse systems.
From a problem perspective, we must know the relevant wants and how they interact and conflict, while resisting the temptation to deviate from our needs. From a solution perspective, the PIQ model helps us understand at a high level where the wiggle room is. Furthermore, we should iterate between problem and solution, because tracing back from their consequences can help us sharpen and prioritize our wants.
Reusing from open document publication
As a secondary question, we were wondering which parts from open document publication systems—
More than ever, orthogonality stands out as a key attribute of technology specifications. This interoperability aspect enables selective reuse through careful mixing and matching of relevant technologies. It’s a precondition to discussing reuse of open document technology in the first place. Another precondition, which I believe requires more attention, is standardizing the scalable discovery of what interfaces and features are available to a given authenticated client accessing data from a specific server.
Interoperably designed protocols such as HTTP allow for request-based authorization that is independent from concrete identity and policy mechanisms. Therefore, many of their lessons still stand, and it’s no coincidence that emerging specifications such as the Solid Protocol heavily rely on both the principle of orthogonality as well as existing specifications that were designed with this orthogonality principle already in mind.
Of bigger concern is that the time-tested REST architectural style undoubtedly lost its relevance as a conceptual framework to guide the design of permissioned data systems on top of the Web—
Reuse of REST- or HTTP-based specifications should hence be considered with caution, and with a careful eye on maintaining orthogonality. Under such circumstances, the transparent reuse of some orthogonal specifications, let’s say temporal versioning, carry the potential of compatibility. However, and definitely in absence of an alternative model to REST, we should steer clear of premature assumptions about the portability of document technology to data technology. Recall that the orthogonality of interfaces and authorization—
Solving unsolved problems
By proposing a simple way to think about the permissioned data publication landscape, this blog post aims to not push any particular solutions, but rather to provide a useful mental model for discussing about all potential solutions. I hope those discussions will shift our questions from the answered feasibility angle (How do we make query work?
) to the open-ended optimization and desires angle: How do you want query to work?
.
While I criticized REST and interfaces, I’m purposely not writing about solutions here. Some of my other blog posts deliberately explore query, interfaces, and policies, which acquire additional colors through the prism of the PIQ triad that was underpinning my thinking when writing those.
Triads for modeling trade-offs are not uncommon. I recently learned how in information security, the Confidentiality–Integrity–Availability triad explores the solution space of secure data systems. In a nutshell, it similarly reveals that privacy is a solved problem: simply pour all of your personal data into a block of concrete, which you then dump into the middle of the ocean. Perfect privacy. And that’s when we find out that people don’t just care about confidentiality, but that availability and integrity matter too.
I similarly claimed query to be a solved problem, yet the PIQ triad shows that many areas of the solution space still require consideration and exploration. That’s exactly why it’s so crucial to talk about the right problems: so we can find the right solutions.
By the way, we’ll also need the write solutions, as I only wrote about read. But I trust you’ll agree the read problem space leaves plenty of room for investigation.
Dedicated with great gratitude to my dear colleague Pieter Colpaert, who has helped shape my thinking about reading and writing on the Web. None of the thoughts herein would’ve been worth anything if left unchallenged by you.