This is a scratch pad of the components involved in a global, content-aware searching solution for GNOME. This was discussed on #gnome-hackers on Friday, April 06 2007.
We hope to use this page to outline the major functional units involved, and identify the various constraints, requirements and dependencies of each. This will help us to move forward on a unified approach to search, however that may be embodied. We should try to avoid detailed technical discussion such as formats, protocols, or transport mechanisms, processes and libraries for now.
Lead engineers from both existing projects (Beagle and Tracker) have pledged to work together to find an acceptable compromise.
Metadata Storage Engine
This component is responsible for storing and retrieving rich metadata related to an arbitrary URL. The URL which may not necessarily "exist" or have a physical location on either locally or on the network.
The metadata stored will consist of many kinds of information from strings to binary blobs to lists, though the internal representation of that data is internal the engine.
The metadata query interface exposed to users of the engine is naive. Users can:
- Enumerate all metadata records for a given URL
- Enumerate all known URLs with a certain metadata field set
Filter these results using basic atomic matching rules (==, !=, <, >, contains)
The metadata storage interface exposed to users of the engine is naive. Users can:
- Set/Add a metadata record value for a URL
- Delete a specific metadata record for a URL
- Delete all metadata records for a URL
The metadata stored may duplicate the inline metadata contained elsewhere such as in an on-disk file format. The metadata engine is not responsible for synchronizing data to or from such file representations. It simply responds to requests to store or retrieve data from it's storage backing store.
- Robust and scalable
- Must support multiple on-disk storage locations (Joe)
Why? -- AlexGraveley
- Store on removable media
- Read-only data stores shared by all users
- Must be able to enumerate all known on-disk storage locations
- Must store data in relational DB, like sqlite (Jamie)
- Why? Higher levels don't care how it's implemented.
- Robust and scalable
- Crawling can be initiated via multiple channels
- Direct kernel invocation
- Cron script
- Filesystem changes via inotify
- Device mount / media insertion