Button Button

ActivityPub normalization part I, object methods

This post will deal with preparing ActivityPub for storage using an object library. Most programmers will be using a JSON library to parse the received JSON (or equivalent) into some native object or struct idiomatic for their programming language. The alternative is parsing a buffer, which I'll try to present tolerably in part II

Once you can upload and serve ActivityPub naively, the next step will be to establish a basis for the program to process it. For this, it's useful to establish a consistent starting point. ActivityPub can, and will, contain nested Objects, including other Activities. If we unwind those and replace any quoted Objects or links with their references, then we can avoid certain problems when processing

To be clear, however, we will not be normalizing ActivityPub objects in the same manner as would be done with a relational data base. What we really want here is key-mapped storage like a key value store or a simple file system

The normalize function

We're going to see ActivityPub Objects and Links as JSON objects. The root object in JSON will be an object with key:value pairs. The keys should always be strings representing ActivityPub properties and the values may be objects, strings, or arrays. The spec says that we should pass through any unknown properties, so we need to make allowance to do that in our storage. We cannot hard code the properties from the spec into a table format. Instead, we need to save a copy of the object that we can look up by id, and that copy should reference any other objects that were received instead of embedding them

The normalize function is going to iterate over the properties of the received object and return a copy of that object that has replaced any embedded Objects with an ID

The two required properties in ActivityPub are id and type. When receiving ActivityPub from a client, the id may be omitted and we need to rewrite the id if one is included, so type is the property to check to establish that the JSON object is ActivityPub. I may change this later, but for now I think the safest thing to do with something that looks like a JSON object but doesn't have a recognizable ActivityPub type will be to treat it as an Object with an unknown type, which is to say that we'll consider it a leaf node and return an unmodified copy

So first step is to check the ActivityPub type. If the type is unknown, a Link, or an Object (like Note or Article) that is not a Collection type or an Activity, then that's a leaf node. The normalize function will return that object verbatim

If the ActivityPub type is not a leaf node, then we'll iterate over its properties. When the value of a property is a string, that branch needs no further processing. When a value is an array, the values in the array need to be iterated over and any objects normalized. This is recursive

The normalize function will call itself for any objects that it encounters and call a function to save the normalized object to storage on return. The save function should take an object as an argument, persist that object, and return the id. Normalize then munges its copy of the current object, replacing the embedded object with the id returned from saving the normalized embedded object

If you know what you're doing, there's room for optimization. If you don't, this will work

The save function

The save function needs to determine the id of the object, map it to our storage method, then return a key to look it up. The implementer has a lot of options in this space, but generally speaking the key should be something fast to compute, easy to look up in the target storage system and be resistant to injection attacks. (Don't use externally supplied identifiers as keys to access internal storage)

Case 1: The id is an explicit null

When the id is null, and not simply absent, then the object isn't intended to be persisted. Return the original contents of the argument without modification and without modifying storage

Case 2: There is no id

If there is no id, we're going to assume that this is client protocol and and create one. Either the client didn't provide one or we've advanced our program to the point where we've scrubbed any client-provided id in uploaded protocol packets. The method for creating an id is up to the implementer. Snowflake ID is a guaranteed unique 64 bit identifier to use for a local part of a globally unique identifier, but requires a 64 bit bitops library. If this isn't convenient, a similar combination of seconds-since-epoch and salted-collision-resistant-hash should suffice

Case 3: The id is an IRI string

When an IRI is serving as an "id" value, it must not be mapped to a URI. We'll revisit this later to verify that it can be dereferenced, but for now we just want to get it into storage

The real problem is that this is a string from an untrusted source. We could escape this string or quote it using an algorithm appropriate for whatever storage method is being used, or we could create a local id

An IRI is the internationalized version of a URI or URL. An implementation may, if so inclined, choose different storage methods by mapping the URL scheme and domain, for example using different storage engines for different schemes or domains

Case 4: The id is an array or an object

This is a hypothetical situation that won't be presented to our server by any implementation as of this writing. It is possible, however, that future AP services may offer multiple IRIs for retrieving an item and leave it to remotes to choose the best one according to local needs. If we use a string representation of the id to generate a hash as part of the local storage id, then we need to be certain that we don't end up with multiple copies of the object saved with variations where the elements in the array or object are present in a different order. Ideally, we'd be able to track any item as an entity even if migrated across schemas and domains. Since this is still a hypothetical situation and http/https will be universal for some time, using the https or http IRI as the basis for key generation is probably best

Summary

It shouldn't matter whether there's one save function with a type sieve to determine how to identify the object locally, multiple save functions with a type sieve to call the correct one, or a single save function that calls an id function with argument overloading

Contact and feedback

The purpose of this project is to provide positive guidance and a way forward for those considering building an ActivityPub implementation that allows for consistent progress while emphasizing freedom and expressiveness. If you'd like to ask questions or have a more detailed discussion, please look for discussions with the #swaps or #swaps000003 tag on the fediverse, or start your own discussion and tag @yaaps@bananadog