1. Fixing Search
This is a scratch pad of the components involved in a global, content-aware searching solution for GNOME. This was discussed on #gnome-hackers on Friday, April 06 2007.
We hope to use this page to outline the major functional units involved, and identify the various constraints, requirements and dependencies of each. This will help us to move forward on a unified approach to search, however that may be embodied. We should try to avoid detailed technical discussion such as formats, protocols, or transport mechanisms, processes and libraries for now.
Lead engineers from both existing projects (Beagle and Tracker) have pledged to work together to find an acceptable compromise.
2. Functional Components
2.1. Metadata Storage Engine
This component is responsible for storing and retrieving rich metadata related to an arbitrary URL. The URL which may not necessarily "exist" or have a physical location on either locally or on the network.
The metadata stored will consist of many kinds of information from strings to binary blobs to lists, though the internal representation of that data is internal the engine.
The metadata query interface exposed to users of the engine is naive. Users can:
- Enumerate all metadata records for a given URL
- Enumerate all known URLs with a certain metadata field set
Filter these results using basic atomic matching rules (==, !=, <, >, contains)
The metadata storage interface exposed to users of the engine is naive. Users can:
- Set/Add a metadata record value for a URL
- Delete a specific metadata record for a URL
- Delete all metadata records for a URL
The metadata stored may duplicate the inline metadata contained elsewhere such as in an on-disk file format. The metadata engine is not responsible for synchronizing data to or from such file representations. It simply responds to requests to store or retrieve data from it's storage backing store.
2.1.1. Constraints
- Robust and scalable
- Must support multiple on-disk storage locations (Joe)
Why? -- AlexGraveley
- Store on removable media
- Read-only data stores shared by all users
- Must be able to enumerate all known on-disk storage locations
- Must store data in relational DB, like sqlite (Jamie)
- Why? Higher levels don't care how it's implemented.
2.1.2. Requirements
2.1.3. Dependencies
None.
2.1.4. Open Questions
2.2. Full-Text Indexer
2.2.1. Constraints
- Robust and scalable
2.2.2. Requirements
2.2.3. Dependencies
None.
2.2.4. Open Questions
2.3. Crawling Agent
2.3.1. Constraints
2.3.2. Requirements
- Crawling can be initiated via multiple channels
- Direct kernel invocation
Why? -- AlexGraveley
Faster start up time - no need to do a stat dance to work out all directories to be watched - JamieMcCracken
- Cron script
- Filesystem changes via inotify
Jamie says inotify fails for many files -- AlexGraveley
- Then we need to fix inotify, or else it's completely useless
Inotify fails for remote mounts (NFS, SAMBA etc) JamieMcCracken
- Device mount / media insertion
- Direct kernel invocation
2.3.3. Dependencies
2.4. Query Agent
2.4.1. Constraints
2.4.2. Requirements