Data Comprehension

I believe that the Number 1 thing computers need to do is understand my data.

Right now, the state of the art is that, except for a few standardized formats, data is handled only by applications designed for it.

I believe that should change and data understanding should move into core framework libraries. Evolution has a start for this with evolution-data-server and there are various image processing libraries and XML libraries. Search applications like Beagle and Tracker try to understand file formats in order to pull metadata out for their search databases.

Let's extend this to all data formats. Instead of placing data parse/load/save/convert logic into applications, put it into libraries with standardized interfaces. The interface should be a kind of layered Document Object Model.

Features

  • Extract generic metadata about the document and provide it as a series of DOM nodes.
  • Extract higher level document data and metadata in a series of layers and provide each layer as a set of DOM nodes. For example:
    • Text document is a data layer with nodes for character, word, paragraph.
      • Above text layer is the semantic layer. Semantic provides information such as heading, definition, index item, etc.
        • Semantic can convert up to a higher layer like HTML, XML, Word/OoWriter with fixed styling, spreadsheet, database, etc.
      • Also above the text layer is a presentation layer. The presentation layer applies to several different lower layers including images as well as text. Presentation provides information for special positioning of each object in the document.
    • Image document is a data layer including color space, image dimensions, bitmaps.
      • Above that generic image layer is JPEG which provides access to JPEG specific information, enabling loss-less rotations, etc.
  • On demand DOM generation to save memory and processing time.
  • Prepare the best possible difference/patch between two documents of the same format, suitable for change conflict resolution and for viewing document revisions.
    • Produce semantic patches, in a word processing document for example, as well as the text change diffs, Style changes would be their own diff, and it should attempt to reduce find/replace changes to a single diff entry instead of 100 repetitions of word1 > word2.

  • Convert the document to any possible destination format, including application in-memory formats for save/load. If necessary, go through intermediate conversions to the final result. These conversion formats include the above DOM node formats.
  • Configuration Via GConf
    • The data-server and plugins would obey gconf keys to determine default conversions where options exist. For example, if asked to convert text to Word, is that Word 2007 or Word 95? A gconf key decides.
  • Synchronize documents
    • Save a copy of the document revision at sync completion.
    • Search for common ancestor documents using saved copies.
    • Prepare patches with timestamps.
    • Resolve conflicting changes if possible.
  • Periodic Document Snapshots
    • Able to save a document revision of all known documents via cron job.
    • Maintain 24 * hourly, 7 * daily, 5 * weekly, 12 * monthly revisions, deleting unneeded intermediates.
    • Actual implementation could be LVM, ZFS, BTRFS, NetApp snapshots or hidden dotfile prefix or tilde-version suffix. It would depend on the available features, configuration and plugins.

  • Virtual Documents
    • Email Store would be a virtual document containing all the emails in the user's configured locations, including network IMAP accounts.
    • Documents would contain every document known to data-server. The DOM (as every DOM would) contains search functions to return filtered lists of document nodes.
      • The Documents virtual document plus search functions actually subsumes the Beagle/Tracker search daemon, or perhaps this virtual document is provided by a plugin linking to the search daemon.
  • Plugin Design
    • The data-server would not be a monolithic chunk of code, but a collection of shared objects sharing a defined interface.
    • Plugins should be robust. Place the well-debugged core plugins in a collection that runs in-process, but run less-trusted plugins in a child process, perhaps communicating using DBUS or another IPC API.
    • The option to use IPC would also enable plugins written in Python, C#, Perl or maybe even shell script.
    • Plugins for data formats (best if these could do loss-less conversion to and from DOM representation).
    • Plugins for patch generation.
    • Plugins for DOM layers.
    • Plugin versioning support. Plugins can provide and require strings and versions similar to RPM and APT. This is so many Word document plugins can be installed and the best one chosen.

Usage Examples

  • The official Gnome image viewer, instead of reading files itself, passes the data to the data-server library or process, asking for a bitmap result. No matter what format the document is in, if it can be rendered as a bitmap it is. JPEG, GIF, XPM and even Word, Keynote, PowerPoint and PDF documents are rendered.

  • The Gnome search engine passes the document to the data-server, asking for meta-data and text data. JPEG note data and timestamp is pulled. Word documents spit out their contents.
  • The official Gnome media player is given a Word document. It asks data-server for a GStreamer output, which is provided via text to speech conversion of the Word document's text document DOM (or the semantic DOM).
  • Microsoft releases a new Word with document format changes. OoWriter is helpless until a new release. User downloads a new Gnome document format plugin for data-server and Abiword loads the new format just by asking for document-source to abiword-memory conversion.

  • User owns three laptops, one with Mac OS X, one with Linux and one with Windows Vista, and user owns a smart-phone. The Gnome Linux synchronization program using data-server is configured to push/pull documents to OS X in Pages format, Word 2007 format on Vista and compressed Mobipocket Reader or TXT on the smart-phone. He edits his grocery list to remove items on his smart phone while his wife adds items on one of the laptops. On next synchronize, all items removed and added get resolved.
  • User wants to take every email message received from Bob and add the first paragraph of each to a OOWriter document. So User writes a Python script using data-server to open the virtual Email Store document, uses a find() filter to get a list of DOM email nodes, takes each email body as a text document DOM and copies the first paragraph node to another text document DOM, then writes that DOM as OOWriter.
  • Programmer wants an email client with a spiffy new look. Instead of rewriting all email handling, he manipulates the virtual Email Store document and only has to write the GUI, all in C#.
  • User thinks .Mac is spiffy but too expensive and no privacy. User gets a cheap virtual host and installs data-server plus a PHP script (or an Apache module), and sets synchronization of all his data-server instances to that virtual host where all his systems can reach it.
  • User tells data-server to read all available revisions of grocery-list.txt and output as a Word 2007 document with change tracking.
  • User takes 5 revisions of a OOWriter document that was emailed around the world for comments and attempts to synchronize all 5 to one final Word document with change tracking. This probably requires assistance from an application to display the documents plus patches and allow editing to a final version, but data-server is used to produce the best possible patches between the parent document, the revised document and the final merge document.
  • Take all revisions of a text document, convert to gedit-memory. The revisions appear as Undo history.
  • (a bit crazy) Feed a binary object code file to data-server and ask for C source. The data-server converts to ASM, then attempts to decompile to C (probably very bad C).
  • Feed a C source file to data-server and ask for HTML. Source is run through Doxygen and output.
  • Feed a C source file and ask for a 32-bit ELF Executable. Source is compiled through GCC.

Similar Projects

  • Storage (using a DB makes it too heavy I think)
  • Unison (doesn't seem to understand any file contents so just glorified rsync)
  • The many Perl, Python and Ruby modules for handling XML, HTML, OOWriter ODF, PDF, etc.
  • Beagle, Tracker (many meta-data extractors)
  • GIT, SVN, and others for common ancestor patch and conflict resolution.

Comments

AndersFeder: Did you look at SemanticDesktop (and AndersFeder/SemanticSpace) yet? It has many things in common with what you describe here, I think, only instead of DOM nodes, metadata would be in RDF, easily queried using SPARQL. In my opinion, RDF is really the only right option for metadata today and asking for a file in a different format could be a matter of asking for a resource whose :sourceFile property is "<source-filename>" and :fileFormat is "<destination-format>". Also, there will be no need for the "user" (what kind of user writes Python) to write a script to search his e-mail. An application performing standard SPARQL queries can do that at least as well.

Attic/ScratchPad/DataEverything (last edited 2013-12-03 19:46:27 by WilliamJonMcCann)