1. Fixing Search

This is a scratch pad of the components involved in a global, content-aware searching solution for GNOME. This was discussed on #gnome-hackers on Friday, April 06 2007.

We hope to use this page to outline the major functional units involved, and identify the various constraints, requirements and dependencies of each. This will help us to move forward on a unified approach to search, however that may be embodied. We should try to avoid detailed technical discussion such as formats, protocols, or transport mechanisms, processes and libraries for now.

Lead engineers from both existing projects (Beagle and Tracker) have pledged to work together to find an acceptable compromise.

2. Functional Components

2.1. Metadata Storage Engine

This component is responsible for storing and retrieving rich metadata related to an arbitrary URL. The URL which may not necessarily "exist" or have a physical location on either locally or on the network.

The metadata stored will consist of many kinds of information from strings to binary blobs to lists, though the internal representation of that data is internal the engine.

The metadata query interface exposed to users of the engine is naive. Users can:

Enumerate all metadata records for a given URL
Enumerate all known URLs with a certain metadata field set
- Filter these results using basic atomic matching rules (==, !=, <, >, contains)

The metadata storage interface exposed to users of the engine is naive. Users can:

Set/Add a metadata record value for a URL
Delete a specific metadata record for a URL
Delete all metadata records for a URL

The metadata stored may duplicate the inline metadata contained elsewhere such as in an on-disk file format. The metadata engine is not responsible for synchronizing data to or from such file representations. It simply responds to requests to store or retrieve data from it's storage backing store.

2.1.1. Constraints

Robust and scalable
Must support multiple on-disk storage locations (Joe)
- Why? -- AlexGraveley
  - Store on removable media
  - Read-only data stores shared by all users
Must be able to enumerate all known on-disk storage locations
Must store data in relational DB, like sqlite (Jamie)
- Why? Higher levels don't care how it's implemented.

2.1.2. Requirements

2.1.3. Dependencies

None.

2.1.4. Open Questions

2.2. Full-Text Indexer

2.2.1. Constraints

Robust and scalable

2.2.2. Requirements

2.2.3. Dependencies

None.

2.2.4. Open Questions

2.3. Crawling Agent

2.3.1. Constraints

2.3.2. Requirements

Crawling can be initiated via multiple channels
- Direct kernel invocation
  - Why? -- AlexGraveley
  - Faster start up time - no need to do a stat dance to work out all directories to be watched - JamieMcCracken
- Cron script
- Filesystem changes via inotify
  - Jamie says inotify fails for many files -- AlexGraveley
    - Then we need to fix inotify, or else it's completely useless
    - Inotify fails for remote mounts (NFS, SAMBA etc) JamieMcCracken
- Device mount / media insertion