DValue Serialisation Format

The DValue serialisation format is a method for converting between DValue instances and pairs of type strings and byte sequences.

There are separate little and big-endian serialisation formats. The formats are exactly the same except for the byte order of integers and floating point numbers appearing within them.

The serialisation operation takes a DValue instance and produces a byte sequence and a type string.

The deserialisation operation requires a byte sequence and corresponding type string, from which it produces a DValue instance.

A byte sequence encoding for a DValue is only meaningful if the byte sequence has a known length. A pointer alone, for example, is insufficient. Given a pointer to its first byte, it is impossible to determine the length of a DValue encoding (and, in fact, there may be multiple valid lengths that each result in different values). If being able to determine the end point of a byte sequence is desired for a medium where you only know the start (consider a network socket) then merely send the size before sending the serialisation.

The type string of the data in question must also be known. A byte sequence can have two different meanings if deserialised with two different type strings.

If a serialisation format is desired where type strings are not required for deserialisation, a good option is to first box the value into a variant and then serialise that. Deserialisation is then always performed with the type string "v" and the value is unboxed from the variant.

Properties

The serialisation format was designed to have the following properties:

  • compactness
  • direct-accessibility by the user
  • uniqueness
  • O(1) extraction of children from containers

Compactness

Where possible, the serialisation format uses as few bytes as possible.

Direct-Accessibility

Where reasonable, the user is able to gain direct access (via a pointer) to subsequences of the serialised format that are easy to handle in C. This includes being able to obtain a pointer to the contents of a string value or an array of integers.

There are separate little and big endian formats so that the values peeked by the programmer are in native endianness.

Native alignment on all sane platforms is met (or exceeded) for all multi-byte data types. Doubles and 64bit ints are both 8-aligned, for example.

Uniqueness

As with all non-lossy serialisation relations, serialisation is injective and total. There is an additional property of determinism, though -- there is only one valid serialisation for each value. Another way to state this is that deserialisation is injective (in addition to the normal determinism and surjectivity).

This allows very efficient hashing and comparison for equality of serialised values of a given endianness -- merely hash/compare the byte sequence.

One important rule to follow to ensure this property is that all padding for alignment must be done with zeros.

O(1) Extraction of Children

This means two things.

Firstly, the serialised form of a value is always available as a subsequence of the serialised byte sequence of any container that contains it. This means that 'copying' the value out of the container actually only involves making a pointer to that subsequence.

Secondly, the serialised form contains offsets where required to enable O(1) lookup of children in arrays and structures.

Alignment

All types have an alignment. The alignment must be one of 1, 2, 4 or 8.

Alignment is a property of a type. All values of a given type will have the same alignment.

Each container type has an alignment equal to the largest alignment among its possible child types. This prevents the amount of padding contained within a container type from changing depending on the offset at which the container started. This is required for the guarantee that a given value will only ever have one serialised form and is important so that the offset at which a serialised value appears does not affect the interpretation of the value (so long as the alignment requirement is met or exceeded).

Fixed Size

Some types have a fixed size. This means that all possible values of this type have the same size.

Fixed-size is a property of a type.

In general, the obviously fixed-sized types have fixed size (ints, floats, bools). Additionally, any structure or dictionary entry consisting entirely of fixed-size types is also fixed type. This includes the possibility of nested structures.

Offsets

Containers containing non-fixed-size children make use of offset values to specify the size of those children. An offset always points to the end of a child (ie: the byte after the last byte).

The size of the offset is dynamically chosen depending (only) on the total size of the container being considered.

Offsets are not aligned. In order to align the offset, the entire container would itself need to be aligned to at least the alignment of the offset. Since the size of the offset isn't known from just the signature of the container, this would violate the rule that alignment can be determined from type. If the implementing architecture doesn't implement unaligned accesses then any unaligned accesses must be performed manually.

There is a small paradox in that the offset size depends on the total size of the container including the sizes of offsets themselves. This results in situations in which there is more than one workable choice for offset size. The smallest possible size is chosen. When serialising, the choice of offset size has to be made with this consideration. When deserialising, the total size of the container is already known and so the offset size is obvious.

The offset size is chosen so that an offset may uniquely identify any prefix of the contents of the container. For example, a container with 3 bytes ('123') has 4 possible prefixes: , '1', '12', '123'.

Refer to the following table:

Total Container Size

Offset Size

From

To

0

0

0

1

255

1

256

65535

2

65536

4294967295

4

4294967296

18446744073709551615

8

18446744073709551616

340282366920938463463374607431768211455

16

etc...

Note that zero bits of information are required to choose a prefix of -- there is only one choice. If we suppose that we use an offset size of zero for a structure with zero bytes of content then regardless of the number of offsets added the total size of the structure is still zero (thus supporting our choice of zero for offset size).

Note also that there is no theoretical upper limit to the size of a DValue and therefore no upper limit to the offset size. Of course, on most systems offsets will be limited to 4 or 8 bytes in size.

Fixed Numeric Types

Fixed numeric types include the following:

  • booleans
  • bytes
  • 16, 32, 64 bit signed and unsigned integers
  • floating point numbers

All of these types have a fixed-size.

The fixed size of each of these types is the number of bytes required to represent it.

The alignment of each of these types is exactly equal to the fixed size.

The serialised format for a fixed numeric type is simply the encoding of that number in the appropriate endianness. For example, a 32 bit integer 42 would be represented by the following little-endian serialisation:

   2a 00 00 00

Booleans are a single byte restricted to only contain the numbers 0 and 1.

String Types

String types include strings, object paths and type strings.

All string types have an alignment of one and do not have a fixed size.

The serialised format for a string type is the string in utf-8 format followed by a zero byte. It is an error for a zero byte to appear anywhere else in the string.

For example, "foo" would be represented by the following serialisation (in both endians):

  'f 'o 'o \0

Object paths and type strings must be valid.

The Arbitrary Precision Integer Type

(not yet implemented or decided)

  • Probably something like an array of uint32, uint64 or even bytes.
  • Turns out that if we use machine endianness for the order of the array then uint32 or uint64 would result in the exact same encoding except
    • different alignment requirement
    • different granularity (ie: the number "1" would occupy 4 vs. 8 bytes)
  • Maybe we should look at how gmp does this since users of these types might well want to use them with gmp.
    • gmp is probably different on each platform, though :(

Structure Types

The alignment of structure types is equal to the largest among the alignments of their child types.

Structures are serialised in the most efficient way possible while maintaining the order of the items, the alignment of each item and the requirement that lookups can be performed in O(1) time.

All non-fixed-size items (except for the last) have an offset specifying their end. For example, a structure of type "(ys)" will have zero offsets, a structure of type "(sy)" will have one offset, and a structure of type "(siss)" will have two. If the final item in the structure is variable-width then its end offset need not be recorded because it will be exactly equal to the end of the structure content.

Any required indexes are stored, in reverse order, at the end of the structure.

If a structure contains only fixed-sze items then the structure itself is said to be fixed-size. In this case, the size of the structure is rounded up to the alignment of the structure. This allows structures to be packed together in an array without additional padding and reflects the way that C structures work.

As an example, the serialisation of a value of the type "(x(in)yq)" would be as follows:

  xx xx xx xx-xx xx xx xx -(ii ii ii ii-nn nn 00 00)
  yy 00 qq qq-00 00 00 00

Where the letters {xx,ii,nn,yy,qq} correspond to bytes used by the representation for the corresponding integer type and the '00' corresponds to padding bytes. The brackets indicate the subsequence of the serialisation that corresponds to the nested structure.

The type "(x(in)yq)" has fixed size 24 and alignment 8.

As another example, a value of type "(ny)" would serialise as follows:

  nn nn yy 00

and a value of type "(yyy)" as follows:

  yy yy yy

"(ny)" has alignment 2 and fixed-size 4. "(yyy)" has alignment 1 and fixed size 3.

As an example of a structure with an offset, take the type "(xsni)". If the string is "string" then the serialised form of this structure would be:

  xx xx xx xx xx xx xx xx - 's 't 'r 'i 'n 'g \0 00
  nn nn 00 00 ii ii ii ii - 0f

The 0x0f refers to the fact that 0x0f is the byte after the last byte of the string.

The alignment of the type "(xsni)" is 8.

Another example: a structure of type "(ys)" containing the string "foo":

  yy 'f 'o 'o \0

Since the string is the last item in the structure its length is obvious since we know the length of the structure; no index is required.

The alignment of the type "(ys)" is 1.

Finally, consider the example "(siss)" where the strings are, respectively, "x", "y" and "z".

  'x \0 .. .. ii ii ii ii - 'y \0 'z \0 0a 02

The indexes are 0x02 and 0x0a. Note that the index points to the end of the string (and not the start of the integer following it). Note also he reversal of the indexes and that the final string does not have an index. Its ending is determined by subtracting the number of indexes (known from the signature to be 2) from the end of the string.

Dictionary Entry Types

Dictionary entries are treated exactly the same as structures with two items (key before value).

Array Types

Array types come in two forms -- fixed and unfixed.

An array type that has an element type with a fixed size greater than zero is called a "fixed array". An array type that has an element type without a fixed size (or an element type with a fixed size of zero) is called a "non-fixed array".

No array type itself has a fixed size, but the encoding of array values varies depending on if the element type is fixed-size or not.

In both cases, an array containing zero items will have a serialised size of zero.

Fixed Array Types

Fixed arrays are handled in the obvious low-overhead way -- the fixed-sized items are simply stored in an array. Since fixed-size items are always a multiple of their alignment size and the array is of the same starting alignment as the element type, no additional padding is required.

The length of a fixed array is obvious -- it's simply the fixed element size multiplied by the number of items in the array.

As a (very simple example) take an array of int16s (type "an") that contains the integers 1 2 and 3. The serialisation would be:

  01 00 02 00 03 00

with an alignment of 2.

For "a(ny)", containing the integers 1, 2, 3 and the characters 'a' 'b' and 'c':

  (01 00 'a --)(02 00 'b --)(03 00 'c --)

where '--' are '\0' padding bytes contained within the structures.

Non-fixed Array Types

Similar to structures, these include a set of indexes at the end.

For an array with n elements, there will be n indexes. The first offset (located at the very end of the serialised array) is used to record the ending point of the last item in the array. Because of this, we can determine the size of the offsets in the array (ie: the difference between the end of the array and the given offset). From this, and the fact that we can determine the offset size, we can determine the number of offsets and therefore the number of items in the array.

The offsets all correspond to the ending of the items in the array -- in order. The starting of an item in the array is determined by taking the ending offset of the previous element and rounding up for alignment. In the case of the first item, the starting offset is simply 0.

As an example, consider an array of strings (type "as") that contains the strings "foo" "bar" and "baz". The serialisation would be:

  'f 'o 'o \0 'b 'a 'r \0 - 'b 'a 'z \0 04 08 0c

with an alignment of 1.

Consider an array of pairs of booleans and strings (type "a(bs)") where there are 2 elements which are both equal to (true, ""). The little-endian serialisation of each of the structures would be:

  01 \0

with alignment 1. Note that because the only variable-width item in the structure is at the end, no offsets are written.

The little-endian serialisation of the array would be

  01 \0 01 \0 02 04

with alignment 1.

The Variant Type

Since the variant type may contain any possible type of value, it has an alignment of 8. It does not have a fixed size.

The serialised form of a variant consists of the body, followed by a '\0' byte, followed by the signature. Since the variant is 8-aligned, the body will always be sufficiently aligned.

Consider the example of a variant containing the string "foo". The serialised format is:

  'f 'o 'o \0 \0 's

Consider the example of a variant containing the array of int16s "1, 2, 3":

  01 00 02 00 03 00 \0 'a 'n

Both of these examples have an alignment of 8.

Tagged Union Types

(not actually implemented or decided yet)

Ideas:

  • For a tagged union that contains any non-fixed-size element, the most reasonable thing to do is use the alignment of the largest possible child as the alignment for the union and then simply follow the serialisation of the child with a single tag byte.
  • Possibly, make all tagged unions non-fixed size and simply have the data, followed by a single 'tag' byte at the end.
    • This slightly breaks the idea of fixed size, though, since all possible values of the union type "[bi]" would, in fact, have a size of 5.
  • Tagged unions of the form [xiy] where each item has a different fixed size could be decided based solely upon their size (no need for a tag byte).
  • Tagged unions of the form [bxiy] where each item has a fixed size could be encoded fixed-size as the size of the largest one, plus one extra "alignment size" to store the tag. This might allow more efficient packing into arrays or structures (allowing them to remain fixed-size).

Maybe Types

Maybe types have the alignment of their child type and are not fixed size.

Nothing is serialised as a zero-length sequence.

If the child type has a fixed size larger than zero then Just x is serialised exactly the same as x.

If the child type is variable-width or has a fixed size of zero, then ust x is serialised as the serialised form of x followed by a zero byte.

As an example, consider the ridicuous type of "mmmn".

Nothing serialises as the zero-length byte sequence

Just Nothing serialises as

  00

Just Just Nothing serialises as

  00 00

Just Just Just 257 serialises as

  01 01 00 00 00

But, for the type "mn", Just 257 serialises as

 01 01

In all cases, the alignment is 2.

Projects/GLib/GVariant/Serialisation2 (last edited 2013-12-03 17:44:06 by WilliamJonMcCann)