Log API Revamp (OBSOLETE)

Zeitgeist 0.3.0 had a major API overhaul, obsoleting this blueprint. The concept of private logs is still not addressed though

Contents

Log API Revamp (OBSOLETE)
Zeitgeist Internals and Implementation Details
1. Rename Content and Source
2. Log Database

Launchpad blueprint: Not registered

This document proposes a new DBus API for the Zeitgeist engine which aims to accomplish a few things that the API used hitherto does not allow for:

G1. Have more than one conceptual log and allow applications to create private logs that doesn't interfere with normal user event logging
G2. Make it possible to seamlessly use items from an external repository such as Tracker or CouchDB

Important Concepts

Log - All logged events go to some specific Log object. Zeitgeist has some default Log objects, but applications can register custom logs for application-specific logging needs
Repository - Zeitgeist doesn't store metadata about actual objects, this is the job of a Repository. Zeitgeist has an internal default repository, but can also use external repositories such as Tracker or DesktopCouch.

Why Private Logs

There are a few reasons for allowing dynamic creation and deletion of new Log objects

1. Some applications may be sending a lot of events. A frequently updating DB will have worse performance. This is mitigated by having it in another SQLite instance.
2. Some apps will require specific client side reasoning to understand and use the data it logs (eg. social -grouping, -permissions, and -relations are stored outside the db)
3. Fast and efficient deletion of entire datasets. Say an app wants to suddenly delete all data it ever logged. If it has logged 50,000 events this will be a costly operation if it is interspersed with 2M other log entries. If it has a private log all Zeitgeist needs to do is to close the SQLite connection and delete the DB file.
4. Less code duplication. Many apps may want some kind of semantic logging that may not fir well into the user-centric approach Zeitgeist has. If these apps are to invent their own APIs and protocols then it will lead to a lot of code duplication. Both for clients and servers.
5. Opening up disjoint private data silos. Evolution, Liferea, and Firefox are just three prominent examples of apps that have changed to use private SQLite instances for data storage. These all have the public label "don't use these DBs, the schemas are unstable and not part of the public API". The way out of the "disjoint data silo" problem is standardized DBus APIs. Coming up with a powerful and nice API is not an easy task. We've learned that the hard way! Zeitgeist alone is not the solution to this problem in itself, but it is a part of the solution.
6. Resource usage
- If each app needs its own cache we have memory overhead.
- IO optimization. If one app hogs all IO bandwidth it will do so no matter if it is logging events into Zeitgeist or its own DB structure. The difference is that Zeitgeist has as its main goal to be efficient for this. Custom app code is generally not as optimized
- Battery life time. Centralized scheduling of wakeups instead of each app managing IO itself

Warning on JOINs

Splitting the Repository from the Log will change the way the Zeitgeist API is used. Queries like:

Give me the most recent documents tagged "foo"

are not possible directly against the Zeitgeist engine. One would first need to resolve the URI of the Tag with label "foo" against the Repository, and then ask Zeitgeist

Give me the most recent docs with tag tag://mjfd673bfw82bf3

DBus Addresses

The Zeitgeist LogManager object runs as:

      Address : org.gnome.zeitgeist.LogManager
  Object path : /org/gnome/zeitgeist/log
    Interface : None

Logs controlled by the log manager runs as

      Address : org.gnome.zeitgeist.LogManager
  Object path : /org/gnome/zeitgeist/log/[a-z_]+   (the 'activity' log is there by default)
    Interface : org.gnome.zeitgeist.Log

The Zeitgeist Repository object runs as:

      Address : org.gnome.zeitgeist.Repository
  Object path : /org/gnome/zeitgeist/repository
    Interface : org.gnome.zeitgeist.Repository

DBus Protocol

DBus Wire Representations of Events

We use the signature T as a short hand for for event templates

T   =   (asaas)

We use the signature E as a short hand for for full event metadata

E   =   (asaasay)

The first array of strings contains specific Event data fields as enumerated below. The following array of arrays of strings (aas) contains an array of metadata for each subject, consult the table below for the details. The second array of bytes (not present in the event template) ay is the event payload. The event payload is a free-form binary blob completely controlled by the logging application (ie. the client).

Event data member offsets:

seqnum              = 0
timestamp           = 1
interpretation      = 2
manifestation       = 3
app                 = 4
origin              = 5

Subject metadata member offsets:

subj_uri            = 0            
subj_interpretation = 1
subj_manifestation  = 2
subj_mimetype       = 3
subj_origin         = 4
subj_text           = 5

org.gnome.zeitgeist.LogManager

No interfaces besides the standard DBus Introspectable interface. DBus introspection is used to enumerate all Log objects. Log objects live under the DBus Object path

/org/gnome/zeitgeist/log/[a-z_]+

Eg. the default activity log is at

/org/gnome/zeitgeist/log/activity

Sending DBus messages to other paths matching the pattern above, will create them dynamically. Ie. To create a new log called mylog then I simply start sending messages to the org.gnome.zeitgeist.Log interface on:

/org/gnome/zeitgeist/log/mylog

org.gnome.zeitgeist.Log

Represents a Log for some specific collection of data. Most stuff should go to the Log object at /org/gnome/zeitgeist/log/activity unless the client has specific needs to do private logging.

GetEvents (in au event_seqnums, out aE events)
- Retrieve the events signified by uris in the same order as the URIs
InsertEvents (in aE events, out as event_ids)
- Insert a collection of events and return their assigned sequence numbers. The timestamp field may optionally be left as an empty string in which case Zeitgeist will assign an event timestamp
DeleteEvents (in aT event_templates)
- Delete all events in the log matching any of event_templates. Event fields with empty string are treated as wild cards.
FindEventIds (in (ii) time_range,
                in aT event_templates
                in u storage_state,
                in u num_events,
                in u order,
                out (au) event_ids)
- Find all events matching any of event_templates where the empty string in any field denotes a wildcard. An array of event ids will be returned. The actual event metadata can be obtainedd by calling GetEvents(event_ids).
  If storage_state is 1 then Zeitgeist will attempt to only return events for which the user has access to the subject. Storage state 0 indicates an unavailable storage medium
  The num_events parameter specifies how many events to return. It also specifies the page size of the result set if a result set is requested
  The order parameter specifies how the results are to be sorted:
  - 1 : By event timestamp
  - 2 : By most recently used subject
  - 3 : By most popular subject
DeleteLog ()
- Delete the entire log and all its content.

org.gnome.zeitgeist.Repository

This is a very simple interface for accessing general item metadata. It is designed to be simple enough to be implemented on top of most modern desktop repository systems such as Nepomuk/Soprano, Tracker, CouchDB, etc. Zeitgeist ships with a simple default implementation backed by an SQLite database.

FIXME

Zeitgeist Internals and Implementation Details

This section contains a rough draft on how the above API could be implemented.

Rename Content and Source

The concepts of "Content" and "Source" types has evidently confused a lot of people. In this proposal they have been renamed to "Interpretation" and "Manifestation" respectively. So:

Interpretation (was Content) What kind of object is this from a conceptual point of view. What does the user interpret this object as? In event terminology event interpretations can be such as "opened", "closed", "saved", etc. In Nepomuk terms the interpretation is nie:InformationElement
Manifestation (was Source) How is this object stored. Is it a file, an email, an attachment to an email. In event terminology the manifestations is "how did this event happen?", which normally is either "user action" or "system notification". In Nepomuk terms the manifestation is nie:DataObject

Log Database

There is a bunch of value tables that stores pairs (integer, string), with the integer part being the primary key. These are used to remove duplicate strings from the DB and shrink DB file size to a minimum. The core event table will link to these value tables via the integer primary key.

Value Tables

There are 6 values tables:

uri
interpretation
manifestation
mimetype
actor
text

Each value tables is constructed as like the uri table below:

CREATE TABLE IF NOT EXISTS uri (id INTEGER PRIMARY KEY, value VARCHAR UNIQUE);

CREATE UNIQUE INDEX IF NOT EXISTS uri_value ON uri(value);

Payload Table

Any event can have a payload assigned which is simply a binary blob that can contain any old kind of data. The payloads are stored in a table similar to a value table, but with a BLOB in stead of a VARCHAR:

CREATE TABLE IF NOT EXISTS payload (id INTEGER PRIMARY KEY, value BLOB)

Storage/Availability Table

The storage table contains an entry for each storage medium or resource the user has data on. Storage media can UUIDS of hard drives or USB sticks or well known names such as "online" for data that requires online access. If the user is not on line then the row with the "online" storage medium as not available. The idea with this table is for the log to be able to return only events about items that are currently available.

CREATE TABLE IF NOT EXISTS storage
                        (id INTEGER PRIMARY KEY,
                         value VARCHAR UNIQUE,
                         state INTEGER);

CREATE UNIQUE INDEX IF NOT EXISTS storage_value ON storage(value);

Storage states:

  0 : Not available
  1 : Available

Event Table

The event table does not contain any data it self only relational ids to values in the value tables.

Notice that event.id is not the primary key this is because we can have several subjects per event.

CREATE TABLE IF NOT EXISTS event
                    (id INTEGER,                  # uri.id
                     timestamp INTEGER,           # timestamp in system millis
                     interpretation INTEGER,      # interpretation.id
                     manifestation INTEGER,       # manifestation.id
                     actor INTEGER,               # uri.id
                     origin INTEGER,              # uri.id
                     payload INTEGER,             # payload.id
                     subj_id INTEGER,             # uri.id
                     subj_interpretation INTEGER, # interpretation.id
                     subj_manifestation INTEGER,  # manifestation.id
                     subj_mimetype INTEGER,       # uri.id
                     subj_origin INTEGER,         # uri.id
                     subj_text INTEGER,           # text.id
                     subj_storage INTEGER         # storage.id
                     )