Camel Searching

Searching is one of the core features provided by Camel folders. A very similar mechanism is also used to implement filtering as well, although there are some subtle differences. Searching is also the basis of the vFolder implementation.

Searches are based on s-expressions, which are lisp-like bracketed expressions; these are easy to parse, and easy to extend. They are implemented using Evolution/EDS.ESexp.

Camel.FolderSearch

CamelFolderSearch is a class which is used to implement searching. All of the basic search functions are implemented using virtual methods, which allows them to be overriden by implementations.

All of the functions work over the message-set currently being searched. This will be a subset of a given folder. They return either a UID of the matching message, or an array of UIDs.

Note that they must all return a vector result; for most functions this means you wrap them in a match-all block. Although the body-contains function may well be more efficient to implement as a vector result. Similarly for the header-matching functions.

Control functions

'(' 'match-all' expression ')'

match-all will convert a boolean expression into an array, and will evaluate expression against every message in the current message-set.

'(' 'match-threads' '"' thread-type '"' expression ')'

 thread-type = 'none' | 'all' | 'replies' | 'replies_parents'

match-threads evaluates the expression, which must return an array. Then, for every message in the message-set, it will include any messages that are also part of the same thread, based on thread-type option.

Matching functions

'(' 'body-contains' string * ')'

Will match messages that contain any of the supplied strings in textual content. The strings are matched using a sub-string match. The implementation will search text/plain parts and will also perform a simple tag-strip on text/html parts. Note that encrypted or non-text parts will not be searched, although a different implementation may do whatever it wants. For example, IMAP will intercept this function and use the IMAP SEARCH command instead.

If an index is supplied, then it will be used to perform this search. See #Index matching for some details on how this works.

Then comes the header matching functions:

'(' 'header-contains' search-header string * ')' '(' 'header-matches' search-header string * ')' '(' 'header-starts-with' search-header string * ')' '(' 'header-ends-with' search-header string * ')'

 search-header = '"' search-header-options '"'
 search-header-options = 'subject' | 'date' | 'to' | 'cc' | 'from' | 'x-camel-mlist'

These will lookup the header identified by the first string, and then perform the particular match against its value. A non-existant header will be treated as an empty string, so that a search against "" will always match everything.

Note that for searching (and vFolders), the raw message data is not used. This is so that searching can run as fast as possible. Instead, only the data from the CamelMessageInfo is available. This has a couple of effects, 1. only a subset of headers is available to match against, and 2. character set information is already processed, so invalidly encoded headers cannot be reinterpreted in a changed locale.

The headers which can be searched against are limited to these:

  • subject: The Subject value. It is converted to UTF-8, and defolded.
  • date: The GMT-normalised sent-date.
  • from: The from address, either from the From or the Sender headers. The address string is first un-parsed into a structured form, and then matched as in Evolution/address matching.

  • to, cc: The to addresses, from all of the To or Cc headers.
  • x-camel-mlist: A special pseudo-header which matches against the mlist token.

'(' 'header-exists' expression ')'

Don't use this, it may well crash, whomever implemented it did it wrong! This relies on the raw message, which search doesn't have access to normally, it was just copied from the filter matching code.

'(' 'uid' string * ')'

Will match any messages with the UIDs supplied.

Query functions

There are a set of functions which can query information about the current message (inside a match-all).

'(' 'user-tag' string ')' '(' 'user-flag' string ')' '(' 'get-sent-date' ')' '(' 'get-received-date' ')' '(' 'get-current-date' ')' '(' 'get-size' ')'

These will retrieve the various fields from the CamelMessageInfo structure.

'(' 'system-flag' '"' system-flag '"' ')'

 system-flag = 'answered' | 'deleted' | 'draft' | 'flagged' | 'seen'
               | 'attachments' | 'junk' | 'secure'

This will look-up a system-flag by name and retrieve it from the CamelMessageInfo structure, return a boolean if the flag is set.

Camel.FilterSearch

This is not a class, but just a function which matches against a single message. Since this has access to the message itself, additional matches can be made. For example, all of the header searches run against the actual headers.

It is used by Evolution/Camel.FilterDriver to implement filter matching, but could be used for any general purpose message matching, where searching is inadequate.

Control functions

'(' 'match-all' expression ')'

match-all is basically a NOOP, it is so search expressions can be re-used for filtering. It just executes expression.

Matching functions

'(' 'body-contains' string * ')'

Will match messages that contain any of the supplied strings in textual content. The strings are matched using a sub-string match. The implementation will search text/plain parts and will also perform a simple tag-strip on text/html parts. Note that encrypted or non-text parts will not be searched, and an index is not used.

Then comes the header matching functions. These match agains the physical headers, although the special x-camel-mlist pseudo-header is also supported as a header name.

'(' 'header-contains' string string * ')' '(' 'header-matches' string string * ')' '(' 'header-starts-with' string string * ')' '(' 'header-ends-with' string string * ')' '(' 'header-soundex' string string * ')'

These will lookup the header identified by the first string, and then perform the particular match against its value. A non-existant header will be treated as an empty string, so that a search against "" will always match everything. A multi-valued header will be matched against each header present, and the result OR'd together. Address headers are matched using #Address matching, all of the other headers are matched using Evolution/String matching, and the data is interpreted to be in an rfc2047 encoded form with comments.

'(' 'header-regex' string string ')'

This will find the first occurance of the header identified by the first string, and apply the regular expression supplied in the second string against it. Only the RAW content of the header is matched against. The regular expression is case insensitive.

'(' 'header-full-regex' string ')'

Here the string is the pattern to apply to the full raw headers. The headers will be in RAW format, and probably in un-folded form, but otherwise the same as a normal RFC-822 header list. REG_NEWLINE is also passed to regexec, so that it behaves like perl or egrep expressions.

'(' 'header-exists' string ')'

Will return true if the given header exists or not.

'(' 'header-source' string ')'

This will attempt to match the source-account of the message against the string. The string is the URI of the Evolution/Camel.Service from which the message came, determined using Evolution/Camel.DataWrapper#Camel.MimeMessage.get_source(). This is pretty badly named, it isn't really implemented using a header.

Note that the CamelService URI is not the same as the Evolution account URI, which is backend-independent.

Query functions

There are a set of functions which can query information about the current message (inside a match-all).

'(' 'user-tag' string ')' '(' 'user-flag' string ')' '(' 'get-sent-date' ')' '(' 'get-received-date' ')' '(' 'get-current-date' ')' '(' 'get-size' ')' '(' 'system-flag' string ')'

These are the same as the CamelFolderSearch functions.

Processing functions

'(' 'junk-test' ')'

This will run the message against the junk processing filter if one is set on the Evolution/Camel.Session. It will a boolean to indicate if the message was detected as junk.

'(' 'pipe-message' string string * ')'

This will send the message to an external command, and return it's return value as an integer. The command is executed directly using execvp given the arguments supplied; normally you will want to run the command inside a shell. The command is given the raw RFC822 message in its entirety.

Matching types

The matching performed depends a little on the types.

String matching

Strings are matched using a simple algorithm. If the strings are all lower-case, then a case-insensitive (based on the current locale) will try to be used. If any of the characters are upper-case, then a simple case-sensitive match is performed.

Address matching

Addressess are matched against each part of the address separately. i.e. the real-name and addr-spec part of the address is treated separately and the result is OR'd together.

If the header being matched against was pre-parsed, then it will be converted back to a structured format first.

Regex matching

Regular expressions are matched case insensitively.

Index matching

When an index is used to match, a more complex algorithm is used to achieve a substring match. This is because the index itself only contains stripped down words, which have had spaces and puncutation removed.

Likewise, when a string is to be matched in the index, it undergoes transformation into tokens which will exist in the index itself.

# Punctuation is stripped, and used as a word separator # Words are separated out # Each word is looked up in the index separately # The resultant sets are then ANDed together

If there was more than 1 subword present, or trailing or leading puncutation was present, then the resultant set is scanned message by message to perform a manual sub-string match.

In practice this works very effectively at quickly limiting the search space, so that even a simple sub-string index can greatly improve complex substring searching.

Examples

A few contrived examples to show how the expressions are put together.

Example: Checking some mail against an external program

This is a filter search example which checks for mail sent to notzed@somecompany.com, which additionally contains the word java in it in lower-case - checked using grep.

 (and
  (or
    (header-contains "to" "notzed@somecompany.com")
    (header-contains "cc" "notzed@somecompany.com"))
  (=
    0
    (pipe-message "/bin/sh" "-c" "grep java")))

Note that this is a poor example of pipe-message, normally the command will do something more complex - e.g. matching against address lists.

Example: Mail sent to a user and not others

This is a search match which matches mail sent to notzed@somecompany.com, but not mail also sent to zed@somecompany.com.

 (match-all
  (and
   (header-contains "to" "notzed@somecompany.com")
   (not
    (or
     (header-contains "to" "zed@somecompany.com")
     (header-contains "cc" "zed@somecompany.com")))))

Note that since header-contains will return a vector result when not inside a match-all, the surrounding match-all is not strictly required in this example.

Example: Unread, undeleted mail

This is a search expression which matches all unread and undeleted mail.

 (match-all
  (not
   (or
    (system-flag "deleted")
    (system-flag "seen"))))

Notes

There are two basic search interfaces because filtering has access to more information which may slow down a normal search - however it might make sense to merge them into a single object.

Evolution abstracts the E-Sexp details from the user and application by using the filter code (see evolution/filter). This system basically implements a macro language using XML syntax. This lets the s-expressions actually evaluated to be changed at any time, without affecting saved details or the application. For this reason, much of the above is internal information - normally the details should not be exposed to the user.

The match-all stuff is a bit weird. If it is always there it should just be implicit. It may be possible to then calculate an optimised execution path for vectorised functions automatically - which is what this is trying to do manually.

Apps/Evolution/Camel.Search (last edited 2013-08-08 22:50:04 by WilliamJonMcCann)