System Design goal: efficient property reads from multiple sources

A problem of ownership

I set out to build the props library to standardize retrieving (key,value) pairs from multiple sources while also allowing a simple mechanism to establish key precedence when more than one source is aware of the same key.

Let’s consider the following scenario, where multiple sources define property X:

Source (Layer)	Value
Prop X is defined by source 1	1
Prop X is defined by source 3	2
Prop X is defined by source 4	3
Effective value	3

Sources can be property files, system properties, or they can be coming from a key-value store or database.

Since each (key,value) pair can change independently on each source, we need to establish a key’s precedence, basically answering the “Who provides the value for this key?” question.

We do so by introducing the concept of layers.

Each added layer can retrieve data from a source. As we add more layers, each takes precedence over the previous layer, for any keys it defines.

We call this concept ownership.

Let’s look at an example:

flowchart LR
    subgraph Source 1
        system[System Properties]
        X1[X: 1]
        Y1[Y: 1]
    end

    subgraph Source 2
        env[Environment Variables]
        Y2[Y: 2]
    end

    subgraph Source 3
        file[Filename]
        X2[X: 2]
    end

    subgraph Source 4
        cp[Classpath]
        X3[X: 3]
    end

    registry[Properties Registry]
    system --> registry
    env --> registry
    file --> registry
    cp --> registry

    client[Client]
    registry -- The value of X is: 3 --> client
    registry -- The value of Y is: 2 --> client
    X3 -- effective value --> registry
    Y2 -- effective value --> registry

In the diagram above we see that:

Source 4 provides the effective value for X
Source 2 provides the effective value for Y
(regardless of what other sources have values for X and Y)

We can also say that source 4 owns X and source 2 owns Y.

Sources vs. layers

Let’s think for a moment about sources and layers.

A naïve approach would be to model a 1:1 relationship between Sources and Layers, basically creating a single entity to represent both, then passing a list of these entities to a registry.

Consider the following configuration:

classDiagram
    class SourceLayer1{
        source: /path/to/file.properties
        -load()
        +reload(interval)
    }
    class Registry1 {
    }
    Registry1 <|-- SourceLayer1

    class SourceLayer2{
        source: /path/to/file.properties
        -load()
        +reload(interval)
    }
    class Registry2 {
    }
    Registry2 <|-- SourceLayer2

We see that both sources point to the same file on disk. However, we have two objects that independently read the same data and create a potential for contention and consistency problems (e.g., SourceLayer 1 reads data, the file is updated on disk, SourceLayer2 reads different data).

Consistency is not always possible and cannot always be a property of a system, but in this case, we can ensure that a source has the same view of its data in a running JVM process.

Rather than have sources and layers be the same (1:1), let’s instead have a single source representing a dataset and multiple layers (1:N) which receive updates when the dataset changes. An added benefit of this design is that our system only needs to do one read operation per dataset instead of one per layer. Since I/O is generally more “expensive” than in-memory operations, this is a win!

Here are the components of our system:

Source: loads the data from its origin (e.g., filesystem), reloads it using user-defined logic (e.g., an interval), and notifies registered layers of any reload events
Layer: holds copies of the data for each registry
Registry: queries data from each of its layers

classDiagram
    class Source{
        source: /path/to/file.properties
        -load()
        +register(layer)
        +reload(interval)
    }
    class Layer1{
        onReload(data)
    }
    class Registry1 {
    }
    Layer1 <|-- Source
    Registry1 <|-- Layer1

    class Layer2{
        onReload(data)
    }
    class Registry2 {
    }
    Layer2 <|-- Source
    Registry2 <|-- Layer2

Retrieving the effective value

But what if a layer adds a new key or removes an existing one!?

sequenceDiagram
    participant L1 as Layer 1
    participant L2 as Layer 2
    participant L3 as Layer 3
    participant R as Registry
    participant C as Client
    
    activate L1
    L1->>L1: Set X: 1
    L1->>L2: Set X: 2
    deactivate L1
    activate L2
    L2->>L3: Set X: 3
    deactivate L2
    activate L3
    C->>R: Get X
    R->>L3: Retrieve from current owner
    R->>C: X:3
    L3->>L2: Unset X
    activate L2
    deactivate L3
    C->>R: Get X
    R->>L2: Retrieve from current owner
    R->>C: X:2 (now owned by L2)
    deactivate L2

Between two reads of X, the owner changes from Layer 3 to Layer 2.

A simple algorithm to solve this problem is always to query all layers, e.g.:

ask Layer 1-3: Do you define X?
Layers 2 and 3 respond yes
Layer 3 has precedence and owns the key
X is unset from Layer 3
we repeat the first three steps and establish Layer 2 as the owner of X

This choice is not ideal. First, we’d have to repeat this on every read, which would result in O(K) time-complexity (where K is the number of layers in the registry) for every single read property. This doesn’t sound performant!

A side objective here is to deal with a high frequency of ownership changes; for example, a source dataset may repeatedly set and unset one or more keys in a short interval.

Type casting values

Before we talk about ownership, let’s quickly segue into data types. The values will always be serialized as strings as we’re reading them from property files, system properties, and the environment. (The library will support loading values from other data stores, e.g., databases, but the problem of casting strings from property files remains)

Here’s an example of a standard Java property file:

string.prop=some value
int.prop=42
float.prop=1.23

Even though we have integer and float values, the relevant Java APIs will still return a Map<String,String>, leaving the job of casting these values to the appropriate types to the developer.

In practice, while developers “come and go” in projects, this results in many ways to, not-so-optimally, load strings and repeatedly convert them to the desired type.

When I set out to build the props library, I wanted to give developers “one” way of loading the values corresponding to each key without worrying about types, simple validations, or caching. My goal is to implement a no-hassle, efficient library for dealing with application properties!

How do we establish key ownership?

Ok, back to ownership!

The registry needs to keep track of a large enough number of keys (let’s say in the 10,000s range) and retrieve values from the appropriate layer.

Developers will be able to rely on a registry to retrieve values cast to their desired type.

classDiagram
    class Registry{
        layers
        +get(key, type)
    }

Let’s examine the internals of a get(key, type) call:

flowchart LR
    subgraph Props
        subgraph Layer 1
            l1[Layer 1]
            x11[str.prop=value1]
            x12[int.prop=1]
        end
        subgraph Layer 2
            l2[Layer 2]
            x22[int.prop=2]
        end
        registry[Registry]
    end

    client[Client]
    client -. 1: get int.prop, INT .-> registry
    registry -- 2: find key --> l1 & l2
    l1 -.- x12 -- 3: ACK --> registry
    l2 -.- x22 -- 3: ACK ==> registry
    registry -- 4: cast and return int.prop=2 --> client

A client queries the registry (1), which in turn checks all its layers (2), awaiting responses (3), then determining the owner and returning the effective value to the client (4).

In practical applications, we can assume a registry will have a few layers (K).

If our application handles N properties (keys), then reading all the properties at least once would be an O(N*K) operation. Given that ultimately a key is owned by a single layer, the best we can hope for is O(N) for N keys, or O(1), constant time, per key.

This time-complexity is not only more efficient but also totally achievable!

Internally, the registry needs to keep track of which layers own each key.

Since layers can be updated concurrently, we need a thread-safe data structure.

Since we need to “map” (pun intended) keys to layers, we need to use a Map (duh!)

The Java SDK provides a very efficient, thread-safe map: ConcurrentHashMap.

Let’s see how this could look in code:

public class Registry {
    ConcurrentHashMap<String, Layer> owners;

    // retrieves the value for the specified key
    public <T> T get(String key, Class<T> clz) {
        // finds the owner
        Layer owner = owners.get(key);
        // retrieves the value
        String effectiveValue = owner.get(key);
        // casts it
        return clz.cast(effectiveValue);
    }
}

public interface Layer{
    // retrieves the value for the specified key, from this layer alone
    String get(String key);
}

The last piece of the puzzle is updating ownership. Below, we examine the new key case.

public class Source{
    List<Layer> layers;

    // reloads data periodically, at the specified interval
    void reload(Duration interval) {
        // the trigger will be periodically scheduled at an interval
        var trigger = () -> {
            // data is reloaded from source
            Map<String, String> data = ...;

            // and then sent to all registered layers for this source
            for (Layer layer : layers) {
                layer.onReload(data);
            }

            // we may also want a sync stage here
            // to make the new dataset available to all sources at the same time
            // but it's beyond the scope, for the moment
        }
    }
}

public class Layer{
    // processes the reloaded data
    public void onReload(Map<String, String> data) {
        // for each key in the dataset
        for (var entry : data.entrySet()) {
            // updates the value
            updateValue(entry.getKey(), entry.getValue());
            // and notifies the registry that a new owner might be available
            registry().bindKey(entry.getKey(), this);
        }
    }

    // decides the priority of this layer, in the current registry
    int priority();

    // reference to the registry that owns this layer
    abstract Registry registry();
    
    // stores the (key,value) pair
    abstract void updateValue(String key, String value);
}

public class Registry {
    ConcurrentHashMap<String, Layer> owners;

    // registers ownership of a layer over a key
    void bindKey(String key, Layer layer) {
        // finds the current owner
        Layer owner = owners.get(key);
        
        // determines if ownership change is required
        if (owner.priority() < layer.priority()) {
            // change the current owner
            owners.put(key, layer);

            // from this point, all "get"s of "key" will be routed to "layer"
        }
    }
}

The scenario above details how the system will handle new keys appearing in a layer.

A full implementation also needs to consider deleted keys and updated keys. For those cases, we’d want to avoid calling registry().bindKey(...) since it would result in a no-op anyway.

Overall, I am willing to bet that ownership changes will be less frequent than value changes in real-world scenarios. So, the upfront cost of adding and deleting keys, represented by registering and unregistering keys from a layer, will be worth it to achieve constant time (O(1)) reads per key.

Conclusion

At this point, I am genuinely excited to implement this code!

I’m sure I will learn more as I go along and may change my mind on some of these choices, but this is the plan for now!

Update Sep 21, 2021: I realized a fatal flaw in this design! Check out the next article for an in-depth explanation!

Until next time…