Metadata on removable devices
Contents
This is a Draft
Authors
Core author
Philip Van Hoof <philip@codeminded.be>
Contributions by
Jürg Billeter <j@bitron.ch>
Evgeny Egorochkin <phreedom.stdin@gmail.com>
Ivan Frade <ivan.frade@nokia.com>
Language and spelling improvements
Tinne Hannes <tinne.hannes@gmail.com>
- Spellchecks are a bit outdated as sections got added. Feel free to contribute spell fixes (this is a wiki).
Introduction
Metadata scanners want to store collected metadata about content on a removable device onto the removable device. This way, different scanners can rapidly access and share the metadata that is related to the device.
Dependencies
Turtle tools
Issues to solve
- Writing the entire file in time is going to cause too much I/O and require too much time to succeed before unmount has completed
- Rewriting the entire file for each change is going to cause too much I/O and on older flash devices it is going to cause level wearing
- Our users take USBSticks and MMC cards forcefully out of their sockets. The format therefore needs to be append-only unless the developer can do an atomic rename().
The full format of Turtle might be too slow to parse for relatively easy records
- We don't want to burden developers with a complex format or with a format that has no existing parsers.
Reasoning behind some decisions
This format is append-only. This solves most I/O problems while writing. It's also the best strategy for avoiding data corruption whenever a user forcefully removes the removable device from its socket.
- XML is not appendable. Therefore it's not a good format for large amounts of records nor is it a usable format when data corruption is a realistic possibility.
- Google protocol buffers is an interesting component, but the format prepends instead of appends. Which makes sense for Network I/O, but not for disk I/O nor for avoiding data corruption.
- cLucene sounded like a too large dependency and complexity. Although we have found reasons to believe that the format is made to be robust against data corruption too.
- SQLite has transaction support, sure, but try pulling a USB-stick out of its socket while SQLite is writing to the .db file. Usually next time you open the database, it'll be corrupted. We also don't expect this from SQLite: its purpose is not to protect you against this use-case.
Since it's compatible with Turtle, has this format existing parsers. Like for example Raptor. This makes it more easy for implementers.
Sure we have heard about XMP, but to write a sidekick.xmp file of a few hundred bytes next to each file on a FAT32 partition is going to lead to sector waste. Especially given that a sector is between 8k and usually 32kb in size on FAT32. This file doesn't mean that you can't embed your XMP data into files that can embed it. You can still do this of course. XMP sidekick files also create a filename conflict when you need to store metadata in a sidekick file for for example "My Work.txt" and "My Work.doc".
Filename
The filename is .cache/metadata/metadata.ttl stored relative to the removable device's location. For example, for meta data for the device /media/USBStick the location for the metadata will be /media/USBStick/.cache/metadata/metadata.ttl.
Namespaces
Any namespace written as @prefix and in resource predicate in front of a : is allowed. When a namespace only occurs in resource predicates, then it's unnecessary, but allowed, to make a @prefix for it.
Ontology
rdf:type |
The RDF type of the resource in your triple. The format of the value is defined the ontology you use |
resource:Modified |
The modification timestamp in seconds since UNIX epoch of the triple (when the record got added to the TTL file) |
Variables
Using @base is not allowed. When using libraptor you must do this with your serializer:
raptor_serializer *serializer = raptor_new_serializer ("turtle"); raptor_serializer_set_feature (serializer, RAPTOR_FEATURE_WRITE_BASE_URI, 0);
Relative URIs
It's recommended to use relative URIs for all URIs in the file. An absolute URI might not be useful, especially not if the URI is relative to the removable device's mount point.
Format of the Turtle file
When all triples have been processed, this file forms an accurate representation of all collected metadata on the removable device.
The subject is always the URI without @base (it's a relative URI), the predicate is the property that you want to store and the object is the value of the property.
The @base is defined by the system that consumes the metadata. Therefore, we don't put the @base directive in the Turtle file.
You have different kinds of triples: additions, moves, removals and updates
Additions and updates
The required predicates for addition and updates are <resource:Modified> and <rdf:type>. Additions and updates share the same triple format.
Each update/addition-record can contain extra predicates. It's up to the consumer of the metadata to decide what it will do with this metadata content.
<Files/file1.txt> <resource:Modified> "1200355200" ; <rdf:type> "File" . <Files/file2.txt> <resource:Modified> "1200355200" ; <rdf:type> "File" . <Files/file3.txt> <resource:Modified> "1200355200" ; <rdf:type> "File" . <Files/file4.txt> <resource:Modified> "1200355200" ; <rdf:type> "File" .
If, as a result of chronological processing of other triples, an addition triple already existed, the last triple will update the one with all predicates being mentioned.
<Files/file4.txt> <resource:Modified> "1200441600" ; <rdf:type> "File" .
Moves
The required predicates for moving a subject is <> (blank value as a resource). The required value for <> is <to-URI> written as a resource (a URI) without @base. This is only correct if to-URI points to a location on the same removable device as what this metadata cache is about. If this is not the case, then you must use a removal, which is explained lower.
<Files/file1.txt> <> <Files/file2.txt> .
For a move it is also allowed to append a removal (see below) followed by an addition triple. The addition triple must then contain all the predicates that are still valid about the subject that is being removed in the initial addition triple for (the new) to-URI (a URI without @base).
<Files/file1.txt> <> <> . <Files/file2.txt> <resource:Modified> "1200268800" ; <rdf:type> "File" ; <File:Name> "file2.txt" ; <File:Etcetera> "etcetera ..." ; <Unknown:OntologyX> "for me" ; <Unknown:OntologyY> "for me too" .
Removals
Removing a subject
The required predicate for the removal of a subject is <> (blank value as a resource). The required value for <> is <> (blank value as a resource). You can read this construction as: move to nothing.
<Files/file1.txt> <> <> .
The removal action for a subject is only an advisory. A subject is also to be considered removed if your system's storage (the removable device itself) knows about it but the most recent Turtle file doesn't mention it at all.
Removing a predicate from a subject
To remove a predicate from a subject you simply assign its value to blank. Note that you must (unlike other object or values) use the notation as a <resource>, not a literal. Blank written as a <resource> goes like this: <>. Example:
This removes the title of an image:
<Images/IMG00001.JPEG> <resource:Modified> "1200268800" ; <rdf:type> "Image" ; <Image:Title> <> .
Clearing a group value
Same as removing a predicate from a subject.
<Images/IMG00001.JPEG> <resource:Modified> "1200268800" ; <rdf:type> "Image" ; <User:Keywords> <> .
Group value predicates must respect the order of events in the file. Examples:
Setting keywords from scratch
# The clear of the group value <Images/IMG00001.JPEG> <resource:Modified> "1200268801" ; <rdf:type> "Image" ; <User:Keywords> <> . <Images/IMG00001.JPEG> <resource:Modified> "1200268802" ; <rdf:type> "Image" ; <User:Keywords> "set1" . <Images/IMG00001.JPEG> <resource:Modified> "1200268803" ; <rdf:type> "Image" ; <User:Keywords> "set2" . # Unrelated subject in the middle. This can be allowed because we # have resource:Modified and rdf:type repeated each time <Something.else> <resource:Modified> "1200268804" ; <rdf:type> "Something" . <Images/IMG00001.JPEG> <resource:Modified> "1200268805" ; <rdf:type> "Image" ; <User:Keywords> "set3" . <Images/IMG00001.JPEG> <resource:Modified> "1200268806" ; <rdf:type> "Image" ; <User:Keywords> "set4" ; <User:Keywords> "set5" .
Setting these keywords in one statement (equivalent of above):
<Something.else> <resource:Modified> "1200268804" ; <rdf:type> "Something" . <Images/IMG00001.JPEG> <resource:Modified> "1200268806" ; <rdf:type> "Image" ; <User:Keywords> <> ; <User:Keywords> "set1" ; <User:Keywords> "set2" ; <User:Keywords> "set3" ; <User:Keywords> "set4" ; <User:Keywords> "set5" .
Extra assumptions on the turtle format
The turtle file generated following this specification has some extra assumptions. This means that the file will be a valid turtle file, but not every turtle file can be used in his place.
When a predicate has multiple values, each one is a different triple but all of them are consecutive on the file.
This is fine:
<Files/file1.txt> <resource:Modified> "1200355200" ; <rdf:type> "File" ; <User:Keywords> "tasks" ; <User:Keywords> "boooring" ; <User:UseTime> 100 .
This is not fine:
<Files/file1.txt> <resource:Modified> "1200355200" ; <rdf:type> "File" ; <User:Keywords> "tasks" ; <User:UseTime> 100 ; <User:Keywords> "boooring" .
All addition triples must be grouped together if <resource:Modified> and <rdf:type> are not repeated
This is fine:
<Files/file1.txt> <resource:Modified> "1200355201" ; <rdf:type> "File" ; <User:UseTime> 100 ; <User:Keywords> "tasks" . <Files/file1.txt> <User:Keywords> "boooring" . <Files/file2.txt> <resource:Modified> "1200355202" ; <rdf:type> "File" ; <User:UseTime> 100 .
This is fine:
<Files/file1.txt> <resource:Modified> "1200355201" ; <rdf:type> "File" ; <User:UseTime> 100 ; <User:Keywords> "tasks" . <Files/file1.txt> <resource:Modified> "1200355202" ; <rdf:type> "File" ; <User:UseTime> 100 ; <User:Keywords> "boooring" . <Files/file2.txt> <resource:Modified> "1200355203" ; <rdf:type> "File" ; <User:UseTime> 100 .
This is not fine (note the file1.txt vs. file2.txt position and the last triple not repeating <resource:Modified> and <rdf:type>):
<Files/file1.txt> <resource:Modified> "1200355201" ; <rdf:type> "File" ; <User:Keywords> "tasks" . <Files/file2.txt> <resource:Modified> "1200355202" ; <rdf:type> "File" ; <User:UseTime> 100 . <Files/file1.txt> <User:Keywords> "boooring" .
Conflict resolution
If in your local database a record already exists, you use the <resource:Modified> predicate to determine whether your system will update its own store, or if it will append an update triple to the Turtle document instead.
Rewriting
Rewriting advisory
If more than 30% of all triples are removals plus duplicate additions, the advisory is to rewrite the entire Turtle file to a file that contains only unique additions using as <resource:Modified> predicate the value of <resource:Modified> in the last update/addition triple that was related to the subject.
Rewriting rules
The rewrite must be implemented atomically. On most operating systems and filesystems the only atomic operation is rename(). Therefore, you can, for example, prepare the new Turtle file in .cache/metadata/metadata.ttl.tmp and then rename() that to .cache/metadata/metadata.ttl.
When rewriting the file it's not allowed to remove predicates from the addition triples that are unknown to your system. Note that this is very important.
If a second update/addition triple added a predicate that was unseen in previous triples about the subject, then your merged addition triple must append it.
When rewriting don't include addition triples that have as a last triple in the original a remove one. Any remove triple means that previous predicates must be reset in case a later addition triple for the same subject occurs.
Day to day use advisory
Day to day use is recommended. This means that during the time you are not rewriting the Turtle file as described in previous section, you append remove, addition and update triples to the file as soon as possible.
You should try to make sure that the file is properly closed before the removable device is unmounted.
Corrupted file handling
When a user forcefully takes out a USBStick out of the USB socket while software was writing to the metadata.ttl file it can indeed happen that the kernel of the host computer decided to corrupt the file. It's not that we support this as a valid use-case, but nonetheless does this specification attempts to be as robust and pragmatic about this use-case as possible (the ideological people who want to punish users who forcefully remove removable media from the host computer by making this event frequently break software that wants to use this metadata file are being ignored on-purpose. By that I mean that this format disagrees by specification with these ideological people: they should not implement their software using this format, they should indeed invent their own format).
Therefore
In case the file can't be parsed with a Turtle parser (like libraptor's Turtle parser), you can assume the file to have become corrupted. If this happened your software is allowed to move the file to a file called metadata.ttl.NNN where N are numbers. A software that tries to reconstruct such backup files ain't defined within this spec except that surely ... it's allowed to try to merge them back into a proper metadata.ttl. Remember to use the rules as defined in the Rewriting section of this document when writing such a piece of software.
Examples
Images
Tinne takes photograph of mom with a cake
The resulting file: Just before the compress.ttl
Tinne makes photo
Tinne made a photo, stored it on a Flash disk, and plugged the device in a N810. The N810 will get this in the .ttl file if the original host of the Flash disk would have made one.
You'll see <Image:Date> being added here. It wont have the same format as <resource:Modified>. That's of course because you define the format for your predicate-value pairs yourself. But you don't define the format for <resource:Modified> (see above at the section Ontology).
File gets created: /media/USBStick/.cache/metadata/metadata.ttl
<Images/IMG00001.JPEG> <resource:Modified> "1200268800" ; <rdf:type> "Image" ; <Image:Title> "IMG000001 Jan. 14 2008" ; <Image:Height> 1024 ; <Image:Width> 1024 ; <Image:Date> "2008-01-14T00:00:00" .
Changes Title
Tinne changed the Title of the photo on her N810. She also added a few keywords.
Data gets appended to: /media/USBStick/.cache/metadata/metadata.ttl
<Images/IMG00001.JPEG> <resource:Modified> "1200355200" ; <rdf:type> "Image" ; <Image:Title> "Mom with a cake" ; <Image:Keywords> <> ; <Image:Keywords> "mommy" ; <Image:Keywords> "birthdays" .
Categorizes
Tinne took the Flash disk out of her N810, and inserted it into her laptop. On her laptop she resized the image to 1500x1500 and she put the photo in the Family album. The software for categorizing photos has both automatically set the Album properties and added a keyword for this.
Data gets appended to: /media/USBStick/.cache/metadata/metadata.ttl
<Images/IMG00001.JPEG> <resource:Modified> "1200441600" ; <rdf:type> "Image" ; <Image:Height> 1500 ; <Image:Width> 1500 ; <Image:Keywords> <> ; <Image:Keywords> "mommy" ; <Image:Keywords> "birthdays" ; <Image:Keywords> "family" ; <Image:Album> "Family" .
Renames
Tinne starts editing the photo and her photo software renamed the photo to Images/MomWithCake.JPEG.
Data gets appended to: /media/USBStick/.cache/metadata/metadata.ttl
<Images/IMG00001.JPEG> <> <> . <Images/MomWithCake.JPEG> <resource:Modified> "1200528000" ; <rdf:type> "Image" ; <Image:Title> "Mom with a cake" ; <Image:Date> "2008-01-14T00:00:00" ; <Image:Height> 1500 ; <Image:Width> 1500 ; <Image:Keywords> <> ; <Image:Keywords> "mommy" ; <Image:Keywords> "birthdays" ; <Image:Keywords> "family" ; <Image:Album> "Family" .
or
Data gets appended to: /media/USBStick/.cache/metadata/metadata.ttl
<Images/IMG00001.JPEG> <> <Images/MomWithCake.JPEG> .
Adding keywords
Tinne reinserts the Flash disk into her N810, and on her N810 she decides to add the keyword "cute".
Data gets appended to: /media/USBStick/.cache/metadata/metadata.ttl
<Images/MomWithCake.JPEG> <resource:Modified> "1200614400" ; <rdf:type> "Image" ; <Image:Keywords> <> ; <Image:Keywords> "mommy" ; <Image:Keywords> "birthdays" ; <Image:Keywords> "family" ; <Image:Keywords> "cute" .
Unsets the album
Tinne unsets the album of the image on her N810
Data gets appended to: /media/USBStick/.cache/metadata/metadata.ttl
<Images/MomWithCake.JPEG> <resource:Modified> "2008-01-18T00:00:00" ; <rdf:type> "Image" ; <Image:Album> <> .
Idle, software compresses the file
The N810 sits idle for some time, and decides to 'compress' the Turtle file
The resulting file: Just after the compress.ttl (multiple possibilities)
File gets truncated to: /media/USBStick/.cache/metadata/metadata.ttl
<Images/IMG00001.JPEG> <> <> . <Images/MomWithCake.JPEG> <resource:Modified> "1200614400" ; <rdf:type> "Image" ; <Image:Title> "Mom with a cake" ; <Image:Date> "2008-01-14T00:00:00" ; <Image:Height> 1500 ; <Image:Width> 1500 ; <Image:Keywords> <> ; <Image:Keywords> "mommy" ; <Image:Keywords> "birthdays" ; <Image:Keywords> "family" ; <Image:Keywords> "cute" .
Or gets truncated to:
<Images/IMG00001.JPEG> <resource:Modified> "1200528000" ; <rdf:type> "Image" ; <Image:Title> "Mom with a cake" ; <Image:Date> "2008-01-14T00:00:00" ; <Image:Height> 1500 ; <Image:Width> 1500 ; <Image:Keywords> <> ; <Image:Keywords> "mommy" ; <Image:Keywords> "birthdays" ; <Image:Keywords> "family" ; <Image:Album> "Family" . <Images/IMG00001.JPEG> <> <Images/MomWithCake.JPEG> . <Images/MomWithCake.JPEG> <resource:Modified> "1200614400" ; <rdf:type> "Image" ; <Image:Keywords> <> ; <Image:Keywords> "mommy" ; <Image:Keywords> "birthdays" ; <Image:Keywords> "family" ; <Image:Keywords> "cute" ; <Image:Album> <> .