Serialisation Format

The GVariant serialisation format is a method for converting between GVariant instances and pairs of type strings and byte sequences.

There are separate little and big-endian serialisation formats. The formats are exactly the same except for the byte order of integers and floating point numbers appearing within them.

The serialisation operation takes a GVariant instance and produces a byte sequence and a type string.

The deserialisation operation requires a byte sequence and corresponding type string, from which it produces a GVariant instance.

A byte sequence encoding for a GVariant is only meaningful if the byte sequence has a known length. A pointer alone, for example, is insufficient. Given a pointer to its first byte, it is impossible to determine the length of a GVariant encoding (and, in fact, there may be multiple valid lengths that each result in different values). If being able to determine the end point of a byte sequence is desired for a medium where you only know the start (consider a network socket) then merely send the size before sending the serialisation.

The type string of the data in question must also be known. A byte sequence can have two different meanings if deserialised with two different type strings.

If a serialisation format is desired where type strings are not required for deserialisation, a good option is to first box the value into a variant and then serialise that. Deserialisation is then always performed with the type string "v" and the value is unboxed from the variant.

Properties

The serialisation format was designed to have the following properties:

  • compactness
  • direct-accessibility by the user
  • uniqueness
  • O(1) extraction of children from containers

Compactness

Where possible, the serialisation format uses as few bytes as possible.

Direct-Accessibility

Where reasonable, the user is able to gain direct access (via a pointer) to subsequences of the serialised format that are easy to handle in C. This includes being able to obtain a pointer to the contents of a string value or an array of integers.

There are separate little and big endian formats so that the values peeked by the programmer are in native endianness.

Native alignment on all sane platforms is met (or exceeded) for all multi-byte data types.

Uniqueness

As with all non-lossy serialisation relations, serialisation is injective and total. There is an additional property of determinism, though -- there is only one valid serialisation for each value. Another way to state this is that deserialisation is injective (in addition to the normal determinism and surjectivity).

This allows very efficient hashing and comparison for equality of serialised values of a given endianness -- merely hash/compare the byte sequence.

One important rule to follow to ensure this property is that all padding for alignment must be done with zeros.

O(1) Extraction of Children

This means two things.

Firstly, the serialised form of a value is always available as a subsequence of the serialised byte sequence of any container that contains it. This means that 'copying' the value out of the container actually only involves making a pointer to that subsequence.

Secondly, the serialised form contains offsets where required to enable O(1) lookup of children in arrays and structures.

Alignment

All types have an alignment. The alignment must be one of 1, 2, 4 or 8.

Alignment is a property of a type. All values of a given type will have the same alignment.

Each container type has an alignment greater than or equal to the largest alignment among its possible child types. This prevents the amount of padding contained within a container type from changing depending on the offset at which the container started. This is required for the guarantee that a given value will only ever have one serialised form.

Fixed Size

Some types have a fixed size. This means that all possible values of this type have the same size.

Fixed-size is a property of a type.

If a type has a fixed size, the fixed size must be an integer multiple of the alignment of the type. The serialisation of a value of a non-fixed type need not have a length that is a multiple of the alignment of that type.

Fixed Numeric Types

Fixed numeric types include the following:

  • booleans
  • bytes
  • 16, 32, 64 bit signed and unsigned integers
  • floating point numbers

All of these types have a fixed-size.

The fixed size of each of these types is the number of bytes required to represent it.

The alignment of each of these types is exactly equal to the fixed size.

The serialised format for a fixed numeric type is simply the encoding of that number in the appropriate endianness. For example, a 32 bit integer 42 would be represented by the following little-endian serialisation:

   2a 00 00 00

Booleans are a single byte restricted to only contain the numbers 0 and 1. If another value is found in a this byte during deserialisation then the deserialisation fails.

String Types

String types include strings, object paths and type strings.

All string types have an alignment of one and do not have a fixed size.

The serialised format for a string type is the string in utf-8 format followed by a zero byte. It is an error for a zero byte to appear anywhere else in the string.

For example, "foo" would be represented by the following serialisation (in both endians):

  'f 'o 'o \0

Object paths and type strings must be valid. If an invalid object path or type string is detected during deserialisation then the deserialisation fails.

The Arbitrary Precision Integer Type

(not yet implemented or decided)

  • Probably something like an array of uint32, uint64 or even bytes.
  • Turns out that if we use machine endianness for the order of the array then uint32 or uint64 would result in the exact same encoding except
    • different alignment requirement
    • different granularity (ie: the number "1" would occupy 4 vs. 8 bytes)
  • Maybe we should look at how gmp does this since users of these types might well want to use them with gmp.
    • gmp is probably different on each platform, though :(

Structure Types

There are two different classes of structure types.

Structure types that contain only fixed-size item types are, themselves, fixed-size.

Structure types that contain one or more non-fixed-size item type are non-fixed-size.

Fixed-size Structure Types

The alignment of fixed-size structure types is equal to the largest among the alignments of their child types.

Fixed-size structures are serialised in the most efficient way possible while maintaining the order of the items, the alignment of each item and the requirement that all fixed-size types have a fixed size that is a multiple of their alignment. Padding must be done with zero bytes.

As an example, the serialisation of a value of the type "(x(in)yq)" would be as follows:

  xx xx xx xx-xx xx xx xx -(ii ii ii ii-nn nn 00 00)
  yy 00 qq qq-00 00 00 00

Where the letters {xx,ii,nn,yy,qq} correspond to bytes used by the representation for the corresponding integer type and the '00' corresponds to padding bytes. The brackets indicate the subsequence of the serialisation that corresponds to the nested structure.

The type "(x(in)yq)" has fixed size 24 and alignment 8.

As another example, a value of type "(ny)" would serialise as follows:

  nn nn yy 00

and a value of type "(yyy)" as follows:

  yy yy yy

"(ny)" has alignment 2 and fixed-size 4. "(yyy)" has alignment 1 and fixed size 3.

Non-fixed-size Structure Types

Since non-fixed structures contain child items of unknown length and since a serialisation byte sequence is useless without length information this information must be stored in the serialised form of the structure.

For a structure containing n items, there will be n - 1 32bit offsets stored in the header of the structure serialisation. The ith offset points to the end of the ith item in the structure (the index of the byte after the last byte, relative to the beginning of the structure). The end of the last item in the structure is implicitly known because it is also the end of the structure.

The starting offset of any item in the structure can be quickly determined by taking the end offset of the previous item and rounding up to the appropriate alignment.

As such, the alignment of non-fixed-size structure and dictionary entry types is equal to the greater of the largest among the alignments of their child types, or 4 (since the offsets need to be aligned).

As an example, take a structure of type "(xsni)". If the string is "string" then the serialised form of this structure would be:

  18 00 00 00 1f 00 00 00 - 22 00 00 00 -- -- -- --
  xx xx xx xx xx xx xx xx - 's 't 'r 'i 'n 'g \0 --
  nn nn -- -- ii ii ii ii

where "00" is a zero that is part of the encoding of an item and "--" is a zero for padding.

The alignment of the type "(xsni)" is 8.

Another example: a structure of type "(sy)" containing the string "foo":

  08 00 00 00 'f 'o 'o \0 - yy

The alignment of the type "(sy)" is 4 even though both children have an alignment of 1.

Note that non-fixed-size structures don't have to have sizes that are multiples of their alignment. Consider packing the above structure into another structure of type "((sy)y)". The serialisation would be:

  0d 00 00 00(08 00 00 00 - 'f 'o 'o \0 yy)yy

Note that even though the internal struct has an alignment of 4 and a size that is not a multiple of 4, the final byte in the outside struct can immediately follow it (aligned only to byte alignment).

Possible Optimisation for Structure Encoding

There are a couple of nice optimisations possible there that are not yet implemented (and may never be):

  • Have a deterministic algorithm for mapping between "user order" of a structure and "wire order" of a structure such that the least amount of space is wasted on padding
  • For non-fixed-size structures, place all of the fixed-size items at the front and then only give "end" offsets for the non-fixed ones. For structures that contain only a single non-fixed-size item this means that no offsets would be required (and alignment requirements may even be reduced).

The above optimisations are probably too much effort for too little gain.

Dictionary Entry Types

Dictionary entries are treated exactly the same as structures with two items (key before value).

Array Types

Similar to structure types, array types come in two forms -- fixed and unfixed.

An array type that has an element type with a fixed size is called a "fixed array". An array type that has an element type without a fixed size is called a "non-fixed array".

No array type itself has a fixed size, but the encoding of array values varies depending on if the element type is fixed-size or not.

Fixed Array Types

Fixed arrays are handled in the obvious low-overhead way -- the fixed-sized items are simply stored in an array. Because of the requirement that the size of a fixed-sized element is always a multiple of its alignment, fixed arrays never contain padding bytes of their own.

The length of a fixed array is obvious -- simply take the length of the encoding in bytes and divide by the fixed element size.

As a (very simple example) take an array of int16s (type "an") that contains the integers 1 2 and 3. The serialisation would be:

  01 00 02 00 03 00

with an alignment of 2.

Non-fixed Array Types

Similar to non-fixed-size arrays, these include an offset table at the start. As such, the alignment of a non-fixed array type is equal to the greater of the alignment of its child type, or 4.

For an array with n elements, there will be n offsets. The first offset (offset #0) is used to record the ending point of the offset table and therefore, when divided by 4, serves a double purpose as a counter of the number of elements in the array. The following offsets give the end of the elements just as for structures. Offset i (for i > 0) gives the end of element i - 1. The end of the last element in the array is implicitly known because it is also the end of the array.

The starting offset of any element in the array can be quickly determined by taking the end offset of the previous element and rounding up to the appropriate alignment.

As an example, consider an array of strings (type "as") that contains the strings "foo" "bar" and "baz". The serialisation would be:

  0c 00 00 00 10 00 00 00 - 14 00 00 00 'f 'o 'o \0
  'b 'a 'r \0 'b 'a 'z \0

with an alignment of 4.

Consider an array of pairs of booleans and strings (type "a(bs)") where there are 2 elements which are both equal to (true, ""). The little-endian serialisation of each of the structures would be:

  08 00 00 00 01 00 00 00 - \0

with alignment 4. Note that there are no padding characters.

The little-endian serialisation of the array would be

  08 00 00 00 11 00 00 00 -(08 00 00 00 01 00 00 00
  \0)-- -- --(08 00 00 00 - 01 00 00 00 \0)

with alignment 4.

The Variant Type

Since the variant type may contain any possible type of value, it has an alignment of 8. It does not have a fixed size.

The serialised form of a variant consists of a single byte to indicate the length of the type string of the value contained in the variant, followed by the type string, followed by the aligned serialised form of that value. The length of the serial form of the value is known because its end is the same as the variant's end.

Consider the example of a variant containing the string "foo". The serialised format is:

  01 's 'f 'o 'o \0

Consider the example of a variant containing the array of int16s "1, 2, 3":

  02 'a 'n -- 01 00 02 00 - 03 00

Both of these examples have an alignment of 8.

Tagged Union Types

(not actually implemented or decided yet)

Ideas:

  • For a tagged union that contains any non-fixed-size element, the most reasonable thing to do is use the alignment of the largest possible child as the alignment for the union and then simply follow the serialisation of the child with a single tag byte.
  • Possibly, make all tagged unions non-fixed size and simply have the data, followed by a single 'tag' byte at the end.
    • This slightly breaks the idea of fixed size, though, since all possible values of the union type "[bi]" would, in fact, have a size of 5.
  • Tagged unions of the form [xiy] where each item has a different fixed size could be decided based solely upon their size (no need for a tag byte).
  • Tagged unions of the form [bxiy] where each item has a fixed size could be encoded fixed-size as the size of the largest one, plus one extra "alignment size" to store the tag. This might allow more efficient packing into arrays or structures (allowing them to remain fixed-size).

Maybe Types

Maybe types have the alignment of their child type and are not fixed size.

Nothing is serialised as a zero-length sequence.

Just x is serialised as the serialised form of x followed by a zero byte.

As an example, consider the ridicuous type of "mmmn".

Nothing serialises as the zero-length byte sequence

Just Nothing serialises as

  00

Just Just Nothing serialises as

  00 00

Just Just Just 257 serialises as

  01 01 00 00 00

In all cases, the alignment is 2.

Projects/GLib/GVariant/Serialisation (last edited 2013-12-03 17:45:45 by WilliamJonMcCann)