CRDTs and collaborative playgrounds

Published by Sam Lock on December 17, 2024

At Cerbos, we specialize in simplifying complex authorization logic to empower developers with the tools to implement secure, scalable, and maintainable access control systems. Our mission is to streamline the development of robust access policies, making it easier for teams to define who can do what in their applications.

One of the tools we offer is a collaborative IDE and testing environment we nicknamed the "Playground" (because access control should be a joy, not a chore). The Playground is a fully integrated collaborative IDE with built-in testing that provides feedback in real-time, and easily integrates into your GitOps workflow! We're very proud of it, but rather than spending too much time bragging, I'm going to deep-dive into a very particular word from the spiel before: "collaborative".

We saw real value in building the environment with collaboration in mind—both for efficiency in authoring policies and also as a tool for sharing knowledge or educating others (pair-programming FTW, right?).

But how does one make an app collaborative? Well, there are several approaches, but we chose one that stands out for its elegance and efficiency. To enable this seamless collaboration, we've leveraged the power of Conflict-Free Replicated Data Types (CRDTs).

CRDTs: What and why

CRDTs are a class of data structures that automatically resolve conflicts in distributed systems, allowing for seamless data synchronization across multiple points without centralized coordination. They're designed for environments where network partitions or latency make constant communication impractical but have since found more generalised use due to their simplicity and elegance.

They're incredibly useful when it comes to developing robust, distributed applications that require real-time collaboration. They enable multiple users to work concurrently on the same dataset, with guarantees of eventual consistency, eliminating the need for complex conflict resolution logic. Does your application need offline support? Good news: you get that for free, too!

The concept was formalised in 2011 when a group of very smart researchers came together and presented a paper on the topic; initially motivated by collaborative editing and mobile computing, but its adoption has spread to numerous other applications in the years that followed.

OK, sold. How do I get started?

The answer, surprisingly, is "very easily". Given its meteoric adoption rate in recent years, some excellent, battle-tested projects have appeared and taken strong hold in the community. Let's take a look at a couple:

Yjs

Yjs is a bit of a household name, it's probably fair to say that this project underpins a large proportion of CRDT-driven applications in general use today. Born originally in the JS space by the prolific Kevin Jahns, it started as a collection of importable familiar "types" with a sync-able central document:

import * as Y from 'yjs';


const doc = new Y.Doc();
const yarray = doc.getArray('my-array')

yarray.observe(event => {
  console.log('yarray was modified')
})

// every time a local or remote client modifies yarray, the observer is called
yarray.insert(0, ['val']) // => "yarray was modified"

Over the years that followed, Yjs grew into an extensive library of adapters for different communication protocols, editor integrations and storage drivers, each with an almost "drag and drop" integration pattern. More recently, the maintainers built a cross-language/cross-platform Rust port which itself then came with language bindings to several other languages.

On top of its massive ecosystem and reputation for ease of use, it's also an extremely performant implementation. Kevin was the first to introduce a rather clever optimization in the underlying data model which has since become a de facto industry standard: representing a more complex tree-type structure in a single flat list. If you want to read more about this, I recommend you dive into CRDTs go brrr; this pivotal piece goes into some intricate detail of the inner workings of CRDTs, and explains in reasonably accessible terms just how this optimization was achieved.

In short, if you want to dabble in the world of CRDTs, Yjs is a good place to start.

Automerge

Automerge is the brainchild of Martin Kleppman and his team as part of their research at the University of Cambridge. (You may remember Martin as the author of Designing Data-Intensive Applications and other notable works.)

After its foundational release back in 2021, the more recent Automerge 2.0 introduced a plethora of optimizations, features and support. Similarly to Yjs, it also offers a library of data structures as well as a central document.

Automerge is production-ready, available in Javascript and Rust, and includes TypeScript types and C bindings for use in other ecosystems, and is definitely worth a good look if you're considering diving in. Be sure to check out the nifty "5-minute quickstart" to see how it all works!

Do-it-yourself

This isn't a project name—it's a suggestion. If you fancy a challenge, it might not be as tricky as you'd think to build a basic, functioning CRDT—even one that you could tailor to your own specific needs!

How?

Ultimately, what it boils down to is the requirement of a resolvable "unit" within your data structure. What this means is that you can compare that unit to another of the same type, and one will deterministically always be "less than" or "greater than" the other.

Let's take a simple example: a CRDT "string" (e.g. a line in a text document). Our "string" CRDT will be represented by a tree-like structure of the following Unit type instances:

type Unit struct {
    ClientID uint64  // the unique ID of the client/process within the distributed system
    Data     string  // this could simply be a character or perhaps a contiguous set of characters (note: I like this concept and I shall call it a "word")
    Clock    uint64  // I'll come to this below...
    Children []*Unit // ... and this ...
}

type Sentence []*Unit

The `Clock`

Time is… a bit complicated, but very important. In a distributed system, we need to be able to compare the exact datetime of two moments in time. You might be asking, "can't we just rely on physical clocks read directly from the origin computers?" No, because there's no guarantee that both computers have exactly the same time—and the clock skew of even a millisecond could result in havoc in a distributed system.

What we need is the concept of a logical clock. A simple but effective implementation is called a "Lamport timestamp". These clever things provide a method to order events in a distributed system without relying on synchronized physical clocks. Here's a summary explanation as brief as I can muster:

Each process in the system maintains a counter. When an event occurs, the counter is increased. If a process sends a message, it includes its counter with the message. Upon receiving a message, a process sets its counter to be greater than both its current counter and the received counter value. This way, Lamport timestamps ensure a partial ordering of events based on the "happens-before" relation, allowing the system to determine the sequence of events across different processes.

Here's a nice diagram I drew to further demonstrate:

Lamport Timestamp Diagram

Each line represents a separate process in the distributed system. The black nodes are local origin events, the arrows denote the event being passed to another process, and the pink nodes represent that event being received and thus the local clock incremented.

Beyond this, if you want to determine causality, e.g. whether events are "causally related" (happened before or after each other) or are "concurrent" (entirely independent of), you can look at Vector Clocks—I won't go down that rabbit-hole here, though.

Insertion

Let's say we have a line (a tree of Units). Let's say its current state is:

s := &Unit{ClientID: 1, Data: "S", Clock: 1, Children: []*Unit{
    &Unit{ClientID: 1, Data: "h", Clock: 2, Children: []*Unit{
        &Unit{ClientID: 1, Data: "i", Clock: 3},
    }},
}}

This assumes that each Unit stores a single character in Data and that the characters were typed in order, start to finish (e.g. no cursoring back to a middle position). We could write a method to build the representative string from this structure:

func (s *Sentence) build() string {
    if len(*s) == 0 {
        return ""
    }

    b := new(strings.Builder)
    for _, u := range *s {
        traverse(u, b)
    }

    return b.String()
}

func traverse(u *Unit, b *strings.Builder) {
    b.WriteString(u.Data)

    for _, child := range u.Children {
        traverse(child, b)
    }
}

Which, when called, would return:

"Shi"

Nice. But now I want to be able to insert characters into my string. Let's write another slightly more involved function:

func (s *Sentence) insert(unit, parent *Unit) {
    // Resolve within children of parent
    if parent != nil {
        if len(parent.Children) == 0 {
            parent.Children = []*Unit{unit}
        } else {
            for i, c := range parent.Children {
                if unit.Clock > c.Clock || (unit.Clock == c.Clock && unit.ClientID > c.ClientID) {
                    // Insert newer item before older
                    parent.Children = append(parent.Children, &Unit{})
                    copy(parent.Children[i+1:], parent.Children[i:])
                    parent.Children[i] = unit
                    return
                }
            }
            parent.Children = append(parent.Children, unit)
        }
    } else {
        // Otherwise, prepend?
        *s = append([]*Unit{unit}, *s...)
    }
}

When I insert, I increment my local counter, specify a parent Unit, and hand it off to this resolution logic to figure out where it should go. We can run it like this:

func main() {
    u0 := &Unit{ClientID: 1, Data: "S", Clock: 1}
    u1 := &Unit{ClientID: 1, Data: "h", Clock: 2}
    u2 := &Unit{ClientID: 1, Data: "i", Clock: 3}

    u0.Children = []*Unit{u1}
    u1.Children = []*Unit{u2}

    sentence := Sentence{u0}

    sentence.insert(&Unit{ClientID: 1, Data: "p", Clock: 4}, u2)

    print(sentence.build())
}

And we receive:

"Ship"

Even nicer. But what happens if two users attempt to update the same word at the same time? Let's see:

    sentence.insert(&Unit{ClientID: 1, Data: "t", Clock: 4}, u2)
    sentence.insert(&Unit{ClientID: 2, Data: "p", Clock: 4}, u2)

Note the differing ClientIDs. If we build this string, it gives us:

"Shipt"

This scenario made it into the inner loop of the resolution. Because these updates were concurrent (e.g. two separate offline operations), the resolution logic falls back to the tie-breaker, which in this case is a comparison of the ClientID. Thus, we still have our deterministic output!

The case against DIY

It's worth acknowledging the overhead involved with building your own CRDT—particularly if you intend on building a product on top of that CRDT. I learnt this first-hand with one of my own projects: fuzzynote.

Fuzzynote is a list-based note-taking tool, backed by a tree-based CRDT which supports sensical partial ordering and the ability to move items (plus their dependents) to other locations in the list. The reason for the novel design was very specific to the use case: it allowed users to share and collaborate on individual items (or selections thereof) with other users whilst still maintaining a sane ordering of the overall list.

The original app was terminal-based, but I found myself wanting to access my notes on the go. So I abstracted away the core functionality, compiled it to Wasm, and built a web app around it, too. I then built a lambda-based infrastructure to serve the web app. I then figured I might be able to sell this to other people, so I doubled down and set about productizing it…

As the months rolled on, between the web app, the infrastructure, and the underlying engine and CRDT—even though the first two were hefty undertakings in themselves—I still found that the vast majority of my time was spent digging around the dark corners of the engine, adding new minor features, or fixing obscure (terrifying) bugs. Unsurprisingly, I ran out of steam and shelved it. 🏳️

Lesson learnt: if I ever fancied having another crack at building a marketable product, I would more than likely just use Yjs. That said, while it might not have been the best product-building approach, I still had a lot of fun.

How Cerbos uses CRDTs

A brief word on our architecture

We adopt a backend-for-frontend (BFF) architecture. The Node.js backend is responsible for communicating with the various Go services that power Cerbos Hub and does all of the heavy lifting required to retrieve, aggregate and transform data used in the front end. This keeps our front end nice and light—and maintainable.

Now, for the playground clients to talk to each other, we need a central point of communication. Conveniently, our Node.js backend offers us this very functionality. Our Go backend has no interest in the CRDT data that's fired around—in fact, it's better to consider the users' browsers themselves as the primary sources of truth—so we don't have to bother with the extra network jump! This also gives us the power to deploy the UI backend close to our users and keep that collaborative latency down to a minimum. Very cool!

Given its basic proxying needs, CRDT data transfer is also protocol agnostic. Providing the state ends up where it needs to be, you can use any method to get it there. We opted for Websockets, given its simplicity, prevalence, and performance.

Websocket diagram for Cerbos Hub

Notice how at this point, there is no persistence in the backend, but there is IndexedDB persistence in each user's browser. Providing no user deletes their IdDB storage and all users maintain an internet connection, we now have fully functioning collaboration in our Playground with only a single proxy layer in our UI backend! When a user logs in, they load up the state from their browser IdDB instance and send off a synchronisation event to any other listening clients. This clever "handshake" mechanism results in missing events being shared between all who need them! And all the resolution occurs locally within the user's browsers.

You might have a couple of questions, though:

What about if everyone loses the local state?
What about if someone made loads of edits and then closed their laptop and went on holiday?

These are fair questions. The scenario above describes what will likely occur in the vast majority of cases, however, we need to cover these, too.

Persisting state in the backend with Yjs

We use Yjs as our weapon of choice, given its hefty ecosystem and brilliant performance. Because our UI backend runs in Node.js, we're able to represent the same Playground state there as well; in essence, the backend is just another read-only user.

Backend CRDT diagram for Cerbos Hub

This almost solves our persistence problem itself, but we can't rely on a single process being around forever holding the CRDT state in memory. So, what we do is serialize it into binary format and periodically ship it off to a Go service which stores it, as is, in our Postgres DB. When someone loads up the given playground, they make a request to the UI backend, which calls this Go service to retrieve the binary blob, deserializes it into Yjs's CRDT state, and then syncs that back down with the user. Success!

My colleague lives at the South Pole.

The eagle-eyed reader may have spotted a potential flaw with a comment I made previously on "deploying the UI backend near to users". This is all well and good if collaborators all share the same backend server, but what if they don't? If two users are attempting to collaborate on a single playground via two different backends, how do the events reach each other? This issue also rings true in more localised setups; what if I have two or more instances of the UI backend running in the same region, due to demand?

OK. So we need another approach for this:

Redis on the CRDT backend diagram

In the massive library of useful Yjs tooling, there is a Redis adapter available which allows us to represent our CRDT state in Redis' KV format (scoped to each Yjs document or Playground). If we extend it further with a fairly simple distributed mutex mechanism, we can now persist and share state across any service which can access the Redis instance!

And with that, we have a fully functioning, globally distributed, persisting collaboration infrastructure. 🎉

In conclusion

Cerbos leverages the power of CRDTs to facilitate real-time collaboration in our Playground, ensuring a seamless and efficient development experience. By adopting a combination of advanced data structures and innovative architecture, we offer developers a robust platform for building and testing access control systems collaboratively. This approach not only simplifies the complexity of authorization logic but also enhances productivity and knowledge sharing among teams.

Whether you're a seasoned developer or just starting out, Cerbos provides the tools and support needed to implement secure, scalable, and maintainable access policies with ease.

As always, if you have any questions or comments on the above, please head over to our Slack community—we look forward to seeing you there.

Thanks for reading!

Engineering

Book a free Policy Workshop to discuss your requirements and get your first policy written by the Cerbos team