[Profile picture of Ruben Verborgh]

Ruben Verborgh

The Web’s data triad

Querying decentralized data is a solved problem, but why do we prefer some solutions over others?

There’s no single optimal way to query permissioned data on the Web. Data publication and consumption processes are subject to multiple constraints, and improvements in one dimension often lead to compromises in others. So if we cannot successfully complete the quest for the perfect data interface, what is a meaningful way to make progress? Over the years, I’ve been ripening a conceptual framework to cast light upon such decisions and their interactions within an ecosystem. In this blog post, I describe the Policy–Interface-Query triad, or PIQ triad for short. It’s a tool to explore design trade-offs in specifications or abstractions for data interfaces within decentralized ecosystems such as Solid. Despite—or thanks to—its simplicity, the PIQ triad aims to support us in our conversations to evolve decentralized data access technologies for the Web.

In the beginning, there was knowledge. There were documents and there were people. The creation of the World Wide Web culminated in REST. It truly was a paradise, an open world where the highest goal was sharing our digital goods—and sharing we did. Seduced by our Apple devices, the Cambridge Analytica serpent deluded us into siphoning away our data one byte at a time. Now separated from our precious assets, we got kicked out of our paradise to pay for their digital sins, as we were forced to descend into a post-truth world that would never be the same again.

With the Web’s origins coming from a spirit of exchange, it’s no surprise that the movement for data on the Web was driven by that same mindset. Raw data now! echoed across the Terrace Theater—and in many more rooms after—when Tim Berners-Lee took the stage to spread the idea of Linked Data to the masses. More specifically, his seminal talk evangelized Linked Open Data to the world.

Throughout the years, we’re slowly realizing how ways that work well for document publication do not always generalize to data publication. Furthermore, we’re seeing how lessons learned from publishing open data do not necessarily transfer to private data. The presumed orthogonality of publication and authorization protocols comes under pressure as we learn that we cannot simply implement private data management by transparently inserting an access control layer on top of existing open data interfaces. And where writing documents on the Web already suffers from a standardization gap compared to reading them, the read/write gap for data is demonstrably much wider.

This begs the main question of how to design Web-based data publication systems with desirable properties, acknowledging that the trade-offs involved require different answers in different use cases. Managing these trade-offs requires us to distinguish feasibility problems from optimization problems, by separating the non-negotiable needs from the improvable wants. For example, we typically cannot afford leaking some private data (non-negotiable) to make querying 200% faster (improvable). We strive to get close to what we’d ideally want, without ever sacrificing what we absolutely need.

A secondary question is which lessons from open document publication can and can’t be transferred to access-controlled data publication. This question is itself an optimization problem: while there is no inherent requirement for data publication to adopt techniques from document publication, such partial reuse of familiar technologies can be cost-saving in terms of experience, infrastructure, and adoption. Yet such reuse must not be an upfront constraint, for risk of hindering us in our search to satisfy data publication needs—or worse, forcing us to prematurely compromise. In particular, we should investigate whether the REST architectural style can still provide similar guidance for our use cases as it has for open document systems.

[photograph of a wall reading “Jesus says ‘Come to me and I will give you rest’”]
All you need is REST?

Permissioned data publication: needs and wants

Quick definitions

Let’s separate the essentials from the (very-)nice-to-have. This blog post uses the definitions below to unambiguously describe our resulting needs and wants.
(Feel free to skip ahead to those if you prefer.)

data
Anything that can be serialized as bits and bytes is data. They might represent photos, recordings, texts, structured values…
a fact
A fact is the smallest unit of data that any actor might need to access independently. For a photo collection, this could be an individual photo. For structured data, we typically consider the level of individual facts appropriate: a single value in a spreadsheet or a single triple in an RDF document.
an interface
An interface is a specific instance of an automated communication channel through which a certain collection of data can be accessed at a predetermined granularity using a request—response paradigm via an agreed-upon protocol. For example, this interface exposes my data through the Triple Pattern Fragments protocol.
a query
A query is a declarative expression that uses an agreed-upon query language to describe constraints about data one aims to access. A query processor then uses interfaces to obtain data that matches the constraints of a given query.
a policy
A policy is a machine-processable description of how data should be treated, which can be attached to a selection of data. For example, a license is a policy for intellectual property such as music or source code. A usage policy stipulates rules actors must adhere to when processing certain data. Policy enforcement can happen at different points by different actors. I’ll loosely refer to data that is subject to policies as permissioned data (despite this not fully embodying all kinds of policies).

These concepts can be related to each other in various ways. For example, one way of characterizing interfaces is by listing what queries correspond to available requests. Similarly, the granularity at which you can attach policies to data within a given system could be expressed as a query. Some interfaces might refuse to respond when specific actors make certain requests, because they are enforcing access control policies attached to the underlying permissioned data.

List of non-negotiable needs

I consider the constraints below to be non-negotiable for any Web-based system in which permissioned data is published by servers and consumed by clients. Any solution either fulfills all of these in their entirety, or cannot be considered a valid solution at all:

  1. Servers must support attaching policies to individual data facts; policies may additionally be attachable at other levels of granularity.

  2. Servers must contain a policy resolution mechanism that consistently and unambiguously answers yes or no to questions of whether a given client is allowed to access a requested selection of data at the specific point in time.

  3. Servers must deny a client’s request unless the response only consists of data it is permitted to read, and must allow the request if the client is provably permitted.

  4. The server must allow each client to retrieve the complete subset of all data this client is allowed to read via a finite number of requests.

The above needs are boolean in nature: they’re either on or off. For example, the first need cannot be partially satisfied with “some policies can be attached at the individual fact level”, just like the other 3 cannot be gradually met by software that does the right thing 99% of time but has unpredictable behaviors in other cases.

Because needs are on/off switches, we can only accept a given system as a solution to permissioned data publication if it keeps all switches on at all times. Furthermore, any system that has all switches on is by definition a valid solution to our problem—but just because it does what we need, doesn’t mean it’s necessarily exactly what we want.

List of improvable wants

Separately from our needs—on which we cannot compromise even if we wanted to—desirable properties exist that make some solutions more attractive for our use case. The key difference is that those properties are not switches, but sliders: they can be satisfied to varying extents, and we aim to express what level we find desirable or required to meet our case’s objectives. Even though we might not be willing to negotiate on those objectives (for example, we can’t afford hour-long queries for urgent medical decision support), the nature of their underlying parameters is that they are scalable (seconds to minutes to hours). Our strong dislike of some parts of their scale doesn’t negate their fundamental nature as sliders, even if our preference is so strong that we might choose to treat them as switches until they reach our targeted threshold. In contrast, none of the switches from the list of needs can ever be treated as sliders.

A couple of important wants may involve:

  1. Query processing time. We want queries to be fast. Ideally, instantaneous, but that’s a theoretical optimum we’ll never reach. So it becomes a question of how fast?.

  2. Extent of decentralization. Across how many nodes should data be spread? And how many different nodes should a query processor be able to handle simultaneously?

  3. Server and client load. How much server load is each request allowed to cause on average? What is the acceptable client load for battery-hungry mobile clients?

  4. Policy complexity. How complex are the policies that guard the data? Does high flexibility hinder usability when it’s harder for people to edit their policies?

  5. Query complexity. What kinds of questions are clients allowed to ask? How far does the server’s obligation stretch in answering them? Which constraints should be left to clients—or not be possible selection criteria at all?

  6. Bandwidth and latency. How much data can the server send that might not directly change the query result? Can the server just dump all data the client is allowed to access (high bandwidth), or should it refine its responses (potentially higher latency)?

  7. Recency. Do we require the most up-to-date version of data, or is it sometimes acceptable—or preferable—to reuse results from a specific point in the past?

  8. Monetary budget. How much budget is available for hardware, software, and maintenance? Is there a budget for intermediate caches or aggregators as well?

Each of these wants are clearly on a scale. They’re sliders. We cannot possibly add “fast query” to our list of non-negotiable needs, because “fast” means different things in different contexts, and involves different efforts. Some sliders don’t even have a unanimously positive direction: with policy complexity, sometimes less is more.

Unlike the needs, the wants are in conflict with each other. Maximizing all of them is impossible, necessitating prioritization—and prioritization criteria differ across cases. While many might agree on the needs as ground rules for Web-based data systems, the wants are highly subjective. The realistic answer to the needs is turning all switches on, but no viable answer to the wants sets the sliders to everyone’s satisfaction.

So we must choose slider positions in a way that acknowledges their trade-offs, yet never at the expense of the needs. For example, it’s easy to have both fast queries and low server load if we sacrifice result completeness. Yet we agreed completeness cannot be compromised without irreparably changing our needs and thus the problem at hand. As such, query time and CPU cost unavoidably remain at odds, as do most other wants.

Re-solving solved problems

[photograph of a kid looking confused at a blackboard filled with question marks]
Identifying the right problems is a prerequisite to building the right solutions. ©2014 iStock.com / baona

People sometimes inquire about my opinions on the Web data publication landscape. A returning question is: how can we enable query within the Solid ecosystem, where every person stores their data in their own pod?

The enable bit highlights that many perceive it as a feasibility problem, as something not yet possible today, that we need to make possible soon. My answer is, and has been for years: query is a solved problem. In fact, there are many solutions.

You want to query across 20 pods? I got you!
You want to query across 20 million pods? We can do that too.
It’s easy. It’s a solved problem after all.

People think there’s a catch when I say silly things like that. That I’m obstinate or in denial, or just like to sound mysterious.

I’m actually just trying to pry more information out of you about what you really mean and want. Because it’s likely you’ll consider my proposed solutions unacceptable, although all of those provably address the stated problem. Sometimes it can just be easier to specify what you want, when you can point at something you don’t. Shooting at the wrong thing helps us more precisely hit this opposite we’re actually looking for.

So let’s try it. Each of the solutions below addresses all of the needs. Shoot!

Solution 1: Traversal-based query processing

A common misconception is that we can’t yet query a Solid pod, so by extension, it seems unlikely that we’d already be able to query multiple pods. And yet we are. The solution to query? Well… just query. We’ve been doing it for years.

Because Solid servers are able to support the 4 needs, we can query pods using existing link traversal-based query processors. This example query shows that Comunica is able to retrieve all movies in any pod, without prior knowledge of where things are stored. Any query you can ask over any pod can already be answered today.

This same query technique works on any number of pods. Therefore, given sufficient entry points into each pod, Comunica always provides valid and complete answers to any query over all data a client is allowed to read from those pods. We can massively speed up queries when pods advertise their structure upfront, so the engine can skip irrelevant data. Many more performance optimization avenues exist.

But maybe you feel like traversal is just too slow. Or maybe you don’t like its associated CPU and bandwidth cost for clients. That’s okay. I just showed you that it can be done. So you tell me that your use case prefers fast. Yes, we can do fast.

Solution 2: Querying an aggregator

Fast query across Solid pods—or any multi-source data landscape—is easy, because it’s a solved problem. We’re going to use an aggregator.

This solution comes with a preparation step: first, obtain the data from each pod you aim to query. This combined data is ingested into a trusted aggregator that dedicates its processing power to any sufficiently authenticated client. This proven technique has been practiced for decades and scales to billions of facts. Simply query the aggregator and you’ll obtain valid and complete results.

But perhaps you’re concerned about all of us needing to trust a single aggregator. That’s okay, we could have multiple. Or perhaps you’re worrying about the operating costs of such aggregators, or the extra bandwidth required to keep them up to dateand how crucial is recency for your use case?

Solution 3: Query interfaces at the data source

Another way to make query fast without resorting to an aggregator is to equip each data source with a powerful query endpoint. It’s fairly easy to do, really.

Let’s just install a SPARQL endpoint on every single pod. Unfortunately, existing SPARQL endpoints typically do not have support for all kinds of policies. That’s okay, we can choose to only expose this interface to power users of the pod, who already have permission to freely set policies anyway. As the common case, they get to query fast. More infrequent users default to a slower experience, which impacts them less.

But perhaps, like me, you’re concerned about the CPU cost to the server. It all depends on whether this is a price your use case is willing to pay. Also, the speed-up may not necessarily apply to cross-pod queries. To make those fast, you’re better off with an aggregator, if that’s a price you’re willing to pay. It all depends.

Why we need the wants

The above solutions show why it’s important to keep our needs and wants separated. We’re all too often asking feasibility questions, when we should really be asking optimization questions. Query is a solved problem for decentralized data publication on the Web. But it’s likely to not be solved in a way that works for you. The problem space is relatively new, and we’re still discovering its multiple dimensions.

Asking relevant optimization questions impacts the systems we build. They help us express our priorities, knowing that we can never have all of our wants because they, unlike the needs, are self-contradictory. Deciding on the appropriate optimization criteria means moving the sliders without touching the switches. Let’s use this renewed understanding of the problem space to explore its effects on the solution space.

By moving sliders in the problem space, we constrain the sliders of the solution space. Which are the solution space’s sliders, and where should they go to meet our wants?

Interface

The single interface to rule them all

Let’s continue our discovery along the lines of the Web’s history. In the beginning, there was the interface, also known as Web API. It uses HTTP to expose data documents, just like we had been exposing hypertext documents for years before that.

We could model that tiny slider space like this:

[a diagram with a single node labeled Interface]

Data is partitioned as a set of documents through a single interface.
Our primary design space involves determining the granularity of all such documents.

Conceptually, it’s quite simple. Our freedom in designing the interface is choosing what kinds of HTTP requests it offers, and to what selection granularity within the data those requests correspond. Some interfaces will offer one big dump with all of the data, while others will allow for much more precise selections.

As a result, there are essentially two kinds of cases when a client needs to query data: either the server does all of the work, or the client needs to perform some of the work.

The first case is when the interface appears to have “predicted” the client’s exact needs, such that one of its data documents happens to perfectly match the query answer. The client simply makes one HTTP request to this particular document, and we’re done. For example, let’s say we’re querying for a source’s data about a specific person, then this document contains one server’s exact answer. That doesn’t mean we’re stuck with relying on sheer coincidence. Some interfaces allow clients to offer varying kinds of assistance with such “predictions” via highly expressive requests, like this example. Regardless of specifics, the server still performs all of the work in both examples.

The other case is when the interface doesn’t offer an exact document match for the client’s actual query, so the client itself has to partially construct the result. Maybe it just needs to make one HTTP request and then simply filter out unnecessary data from the response. Or maybe it needs to make multiple requests, and combine the data of several documents to obtain the correct result, like with link traversal.

So far, the REST architectural style, which helped us make sense of open documents, is still working well: the server publishes data as documents via an interface, and the client can answer its query by accessing documents. Adding policies seems easy: they can be attached to individual documents, and the interface can regulate access control.

Multiple interfaces to rule the rest

Some of our assumptions do not hold at scale. What if multiple kinds of clients emerge, such that no single interface is the best option for all of them? As an extreme example, let’s assume two groups regularly query a 100,000-fact source: the first group only ever accesses all facts at once, whereas clients in the second group read one fact at a time.

We could aim to perfectly satisfy one of the groups and see if the other group can cope. Let’s please the data-hungry group via an interface with only one document containing all 100,000 facts in a tightly compressed format. They clearly find this highly efficient. This same granularity proves highly inefficient for the second group, who waste CPU time and bandwidth downloading a huge document and filtering it down to a single fact.

We can perfectly satisfy this second group of fact-checkers via an interface with one document per fact, so each client only uses one tiny HTTP request. Unfortunately, the data-hungry group now needs tens of thousands of HTTP requests to complete their job.

A compromise is an interface with some intermediate number of facts per document. For example, a third option is an interface with 100 documents of 1,000 facts each. Then the data-hungry clients need 100 HTTP requests to obtain all data—quite a step down from 100,000 with their worst option. The fact-checking clients only need to download 999 unnecessary facts—down from 99,999 in their worst option. Neither group of clients is perfectly happy, but still way happier than they’d be in their nightmare case.

The server isn’t happy though, because it still needs to process many more requests than in the optimal case. And all of this so far assumes two groups of clients. What if the demand is more nuanced and unpredictable? Especially with domain-agnostic interfaces like with Solid, it’s impossible to anticipate all usage patterns.

It sounds like our previous single-interface model is too simple. We need to make room for exposing the same data through multiple interfaces:

[a diagram with multiple cascading nodes labeled Interfaces]

Different interfaces allow partitioning the same data in multiple ways, such that clients can pick the interface that best matches their current access patterns. Our design space now additionally involves maintaining appropriate interface variation.

With the previous model, our main choice was defining the data granularity of documents for a single interface. This updated model extends our choices as follows:

  1. What is the data granularity of each interface?
  2. Which combination of interfaces best serves actual clients?
  3. Which clients get access to which interfaces?

This last consideration reflects that some interfaces might incur a significantly higher server cost than others; as such, they might be reserved for privileged clients. I previously discussed the inherent need for multiple interfaces as a necessary way to harness high degrees of variation in client access patterns. So for the remainder of this blog post, let’s assume that whenever we’re talking about the interface, we’re always assuming that multiple interfaces might and will be in use. Clients access more than one interface anyway when data is decentralized across multiple servers.

The first dents in REST suddenly begin to appear: needing multiple interfaces breaks the implicit symmetry assumed in read/write manipulation of resources. If the same fact appears in multiple documents, writing no longer has a clear default location—and even if it had, cache invalidation partly depends on read/write operation symmetry.

Query

All interfaces answer queries

An important element is missing from the two simple models we’ve drafted so far. After all, the appropriateness of the interface heavily depends on the kinds of queries clients aim to perform. So how can we give query its rightful place in our model?

Intuitively, it might seem like some clients are just fetching data, whereas other clients need more specific query results. However, no such difference exists. All clients have data needs, and a query is simply a declarative way to express a data need. We strive to use sufficiently expressive query languages, so clients can capture their need in one or few queries rather than many. Clients that translate a singular need into many queries, reduce options to fulfill their need more efficiently, as those clients are essentially taking on the role of a query engine (and usually inefficiently so).

Similarly, no distinction between data interfaces and query interfaces exists, because there aren’t any non-query interfaces. With any interface, the server always performs some kind of work, by virtue of sending over some data. That data always corresponds to some query—but as we’ve discussed above, the interface’s kind of query may or may not correspond well to a client’s data needs.

So from a data perspective, there’s no such thing as a bad interface. The performance of any Web API is a function of how well an interface’s data layout happens to match that of a given query load. Seemingly “bad” interfaces, with “weird” distributions of data across documents, can prove extraordinarily useful if some of the frequent queries happen to exactly match one of those documents. Therefore, the usefulness of any interface always needs to be evaluated in the context of clients’ actual demand.

Let’s visualize this relative connection between interface and query in our model:

[a diagram with two intertwined nodes labeled Interface & Query]

Each interface supports certain kinds of query. The better those match actual demand, the more efficiently client data needs can be fulfilled.

This is the point where many client-side developer experiences are currently stuck. It’s the implicit model used by resource-oriented interfaces, but equally by graph-oriented interfaces such as SPARQL and GraphQL. Their SDKs tend to provide abstractions for individual HTTP requests, thereby prematurely tying one interface’s granularity to the possible data needs of all clients. Several problems arise from such coupling, including the high costs of powerful interfaces for both open and permissioned data.

When the expressivity of an interface’s query language increases, we depart even further from the REST architectural style’s implicit dependency on symmetry, as very specific resources for reading do not always correspond to meaningful units for writing.

Query is an abstraction

This latest draft of our model portrays an overly naive relationship between interfaces and queries. Our goal isn’t making interfaces correspond as closely as possible to specific client queries, but—especially from the multi-interface perspective—finding efficient ways to satisfy all clients’ needs, optimizing for their wants.

Our deeper goal is the exact opposite: the main purpose of query is abstraction. Writing application code using queries allows abstracting away concrete HTTP requests such that clients can fully benefit from the evolving presence of multiple interfaces. When a data need is expressed as hard-coded HTTP requests against one specific interface, we can’t leverage runtime efficiency gains from other interfaces that might be present. In contrast, capturing the same need as a declarative query allows for more efficient translations into HTTP requests based on what data and interfaces a server offers.

Moving forward, our model should depict the tension between interface and query, acknowledging that there exists a perpetual trade-off between the simplicity and sustainability of an interface versus the actual queries clients have:

[a diagram of a tension field between two nodes labeled Interface and Query, connected by a bidirectional arrow]

In any decentralized data landscape, server-side interfaces and client-side data demand affect each other. Changes on either end inevitably cause changes for the opposite end.

The above model presents the core idea of the Linked Data Fragments axis, which encourages experimentation into what (combinations of) interfaces are best suited to handle certain diverse query loads. The inevitable tension doesn’t only imply that the optimal interface is an unsolved problem, but in fact an unsolvable problem. So while I called query a solved problem with regard to our needs, the general problem is unsolvable with regard to our wants. Fortunately, it is solvable for specific wants, which confirms the importance of adequate prioritization for each use case.

Either way, this model makes the REST architectural style really start to crumble. Determining what constitutes a resource is not just a function of the data available to the server, but equally of client-side demand. Not only do we break read/write symmetry, we invalidate REST’s key abstraction of resources as the top-level concept shared by servers and clients, instead suggesting query as a main building block for the publication and consumption of data on the Web. This starkly contrasts with the more naive approach of the SPARQL and GraphQL protocols, whose clients cannot leverage multiple interfaces because they equate interface with query (like our earlier model).

Policy

Policy enforcement cannot rely on goodwill

So far, we’re still managing. While the interface/query tension feels inconvenient, acceptable solutions exist as generic workarounds—definitely for open data.

Everything changes when we take policies into account. According to our needs, each fact within a data source can be subject to different policies. What if an interface groups data into documents in a way that boosts performance for a certain use case’s queries, but clashes with how some policy is grouping facts for permissioning? When a policy permits a client to read only some of the facts in one of the interface’s documents, denying access might prevent it from obtaining that data altogether. Even worse, allowing access exposes data it should never see. Either outcome endangers our needs.

In other words, the desired interface granularity from a query performance perspective is likely to conflict with the required granularity of applicable policies. Examples are not far-fetched: a natural interface for publishing contact data is one document per person, which represents many typical app queries. Yet this interface breaks the enforcement of a simple policy that prevents my contacts from seeing their colleagues’ birthdates.

A naive way to deal with this constraint is asking clients to restrict their queries to what is permitted by the active policies:

[a diagram of a tension field between a node labeled Interface and two intertwined nodes labeled Policy & Query, connected by a bidirectional arrow]

Naively constraining queries to what is permitted by policies provides insufficient protection for sensitive data in the interface.

This model is the equivalent of a don’t steal my stuff sign on a wide open front door. To be fair, it does make sense in some cases. For instance, open data or source code often carry licenses, which an interface’s responses can describe. It could be okay for clients to merely access some intellectual property and then discard it if their intended usage wouldn’t meet the license’s conditions. When it comes to personal data rights, the server or client might already be committing legal or ethical violations merely through the possibility of accessing such data, rather than actual usage. Relying on clients’ goodwill can be a viable strategy for some queries, but not with sensitive data.

Interface-controlled policies negate use case query needs

Many servers instead handle potential conflicts between interfaces and policies by resorting to fixed policy granularities that only cover what they consider to be common use cases. For example, current social media Web APIs don’t let you grant access to just the dates and locations of your posts without also disclosing your posts’ full contents.

The exclusive determination of policy granularity by the interface, rather than the queries required for a use case, is reflected in this model:

[a diagram of a tension field between a node labeled Query and two intertwined nodes labeled Interface & Policy, connected by a bidirectional arrow]

Tying the granularity of policies to those of specific interfaces allows for simple server-side enforcement at the cost of a reduced query offering.

The benefit of the above model is that it simplifies server-side policy enforcement. By (fully or partially) enforcing policies at the earliest moment in the data flow, it provides easy protection against unauthorized use. However, it does so by excluding many access patterns, which is easier to imagine in domain-specific interfaces, but harder for domain-agnostic data interfaces such as the ones we’re aiming to build.

The Query–Interfaces–Policies triad

In general, our stated needs cannot be met by settling for fixed policy granularities, since one of our needs captures the ability to attach policies to any individual fact. Although such highly specific cases might seem exceptional, enabling them is a precondition to offering all possible kinds of data protection. After all, any data fact might be subject to multiple partially overlapping policies, and the only way to cover such situations is to allow for policy selections up to the most elementary level.

So clearly, policy is not a constraint we can apply directly to query, because then interfaces cannot play an enforcing role. Conversely, applying policy as a constraint to interfaces significantly reduces the available flexibility for query. Therefore, the final diagram models policy as a constraint on the relation between interface and query.

The resulting Policy–Interface–Query triad (PIQ triad) is the mental model I’ve been using for several years to reason about the sliders of the solution space. It shows how the tension itself between interfaces and queries experiences a tension from policies:

[a diagram of a tension field between a node labeled Interface and a node labeled Query connected by a bidirectional arrow; this connecting arrow bends down as it experiences pressure from a node labeled Policies weighing on it from above, as indicated by an arrow from Policy towards the middle of the Interface–Query arrow]

The PIQ triad reveals that the choice of the appropriate interfaces for a given query load is always constrained by the policies that can apply to the data within a given use case.

In addition to revealing the three core sliders of the data publication solution space as Policy, Interface, and Query, the PIQ triad highlights that maximizing all of the three sliders simultaneously is impossible. A high demand for query flexibility conflicts with either policy complexity or interface performance. Similarly, a high demand for interface performance restricts query and/or policy complexity. The prioritization of our actual wants will always be constrained by this triangular solution space.

Those constraints also mark the end of REST for designing permissioned data systems. The architectural style has outlived its relevance as a helpful framework for evaluating client–server interfaces within this realm, because it offers inadequate guidance for interfaces under the query and policy constraints at the heart of the PIQ triad. None of this constitutes a value judgement about REST itself, nor an endorsement of any other style, but rather an unmistakable demarcation of the boundaries of REST for data.

The PIQ triad in practice

Designing Web-based data publication systems

So how does the PIQ triad answer our main question on designing better data systems? And what challenges do we need to be tackling right now to build such systems?

From a feasibility perspective, cross-source query is a solved problem, to which multiple solutions already exist. From an optimization perspective, cross-source query is not one single solvable problem, because the optimization criteria vary between use cases. That means our objective should be to design adjustable systems that can span across various ranges of the solution space. By definition, no single system can span all of it. The resulting diversity is highly desirable, yet simultaneously emphasizes the importance of interoperability between diverse systems.

From a problem perspective, we must know the relevant wants and how they interact and conflict, while resisting the temptation to deviate from our needs. From a solution perspective, the PIQ model helps us understand at a high level where the wiggle room is. Furthermore, we should iterate between problem and solution, because tracing back from their consequences can help us sharpen and prioritize our wants.

Reusing from open document publication

As a secondary question, we were wondering which parts from open document publication systems—if any—are good candidates for reuse within publication systems for permissioned data, acknowledging that the actual match might be less natural than we all had anticipated when we embarked on this journey many years ago.

More than ever, orthogonality stands out as a key attribute of technology specifications. This interoperability aspect enables selective reuse through careful mixing and matching of relevant technologies. It’s a precondition to discussing reuse of open document technology in the first place. Another precondition, which I believe requires more attention, is standardizing the scalable discovery of what interfaces and features are available to a given authenticated client accessing data from a specific server.

Interoperably designed protocols such as HTTP allow for request-based authorization that is independent from concrete identity and policy mechanisms. Therefore, many of their lessons still stand, and it’s no coincidence that emerging specifications such as the Solid Protocol heavily rely on both the principle of orthogonality as well as existing specifications that were designed with this orthogonality principle already in mind.

Of bigger concern is that the time-tested REST architectural style undoubtedly lost its relevance as a conceptual framework to guide the design of permissioned data systems on top of the Web—and this affects how we’ll want to use HTTP. Even though REST is not in strong contradiction with servers adjusting their multi-interface offering to different groups of clients, it crucially lacks any guidance for doing so. The breaking point is replacing REST’s symmetric resource abstraction by query as a first-class abstraction over multiple read interfaces, implying read/write asymmetry. This directly contradicts the core assumption of symmetry, which is baked so deeply into REST that even its main description leaves it implicit. The HTTP specifications—fortunately in a more non-committal way—similarly reflect this symmetry and resource-orientation.

Reuse of REST- or HTTP-based specifications should hence be considered with caution, and with a careful eye on maintaining orthogonality. Under such circumstances, the transparent reuse of some orthogonal specifications, let’s say temporal versioning, carry the potential of compatibility. However, and definitely in absence of an alternative model to REST, we should steer clear of premature assumptions about the portability of document technology to data technology. Recall that the orthogonality of interfaces and authorization—which works so well for documents—is easily broken by query in a permissioned data ecosystem. Maybe temporal document access could also turn out to possess vastly different characteristics than temporal data access. The PIQ triad can help us reason about when interface-level approaches are sufficient, or when a combined interface–query (or even policy–interface–query) strategy is required.

Solving unsolved problems

By proposing a simple way to think about the permissioned data publication landscape, this blog post aims to not push any particular solutions, but rather to provide a useful mental model for discussing about all potential solutions. I hope those discussions will shift our questions from the answered feasibility angle (How do we make query work?) to the open-ended optimization and desires angle: How do you want query to work?.

While I criticized REST and interfaces, I’m purposely not writing about solutions here. Some of my other blog posts deliberately explore query, interfaces, and policies, which acquire additional colors through the prism of the PIQ triad that was underpinning my thinking when writing those.

Triads for modeling trade-offs are not uncommon. I recently learned how in information security, the Confidentiality–Integrity–Availability triad explores the solution space of secure data systems. In a nutshell, it similarly reveals that privacy is a solved problem: simply pour all of your personal data into a block of concrete, which you then dump into the middle of the ocean. Perfect privacy. And that’s when we find out that people don’t just care about confidentiality, but that availability and integrity matter too.

I similarly claimed query to be a solved problem, yet the PIQ triad shows that many areas of the solution space still require consideration and exploration. That’s exactly why it’s so crucial to talk about the right problems: so we can find the right solutions.

By the way, we’ll also need the write solutions, as I only wrote about read. But I trust you’ll agree the read problem space leaves plenty of room for investigation.

Let’s get our prisms out.

Ruben Verborgh

Dedicated with great gratitude to my dear colleague Pieter Colpaert, who has helped shape my thinking about reading and writing on the Web. None of the thoughts herein would’ve been worth anything if left unchallenged by you.

Enjoyed this blog post? Subscribe to the feed for updates!

Comment on this post