Why On-Disk Summaries?

Mail is a real hog - but that depends on how much mail you have. If you only have ~10K messages it is fine, once you get 100k+ messages and big folders it really grows. 2.2 has made vFolders quite a bit cheaper - but even then there are still scalability issues. It is hard to fix, because they still need to run quickly. And people want mail access from e-d-s, which isn't going to be easy, and adds other problems. The main problem is that it keeps a 'summary' of all messages in memory for a given folder (and generally, for all folders) - and not only does it do this, most of the api's reflect and rely on this, so it isn't simply a matter of changing the backend.

So, splitting the mail problem up a bit and assuming we wanted to move mail into eds too:

vFolders

These are just searches. The problem is now that searches are synchronous and return all results, and rely on in-memory data for performance. If the search was made incremental, it could take advantage of 'on-disk summaries' to only load bits of the summary into memory at a time, rather than loading all of the summary into memory. This would slow them down somewhat though and just convert possible paging (if you have insufficien memory) into manual i/o - normal speed/performance tradeoff - it's alll very well saying your application uses less 'core memory' but if it is just paging anyway it could be just as slow if not worse (since global system i/o scheduling has less of a chance to knead the system). Also, many searches are on data which cannot change- headers, content, rather than message state - these could be optimised/cached further so the search doesn't have to be re-done every startup, so there is room for possible improvement here which could offset the incremental search. But certain things like threading are currently calculated each time - and it is difficult to cache any results since each new message has to be potentially matched against all other messages in the folder.

This is difficult to test - nontrivial changes are required to the search api and implementation (one api lets you search a few messages at a time, but this would be much more expensive than a proper cursor-based implementation).

The folder view

The folder view is also basically just a search, of the current folder. It currently always has to get all messages into memory. With a properly written custom treeview model, it may be possible to make this use much less memory - but it would be very difficult, and impact on performance significantly. Many operations on the tree (e.g. setting it up, and sorting) require access to all rows of the tree (sometimes many times), adding a disk-fetch for such operations is going to blow out the already slow performance here. I don't see this as being practical.

If we moved mail into eds, then we'd also have to fetch the whole list of messages from eds too. At least if we had an incremental search we wouldn't have to load the whole lot into eds, just pipe it from eds to the application. So it wouldn't really be any worse than it is currently, apart from being slower from the CORBA overheads.

The other alternative is have a new custom tree widget, which has a 'remote' model. i..e the folder is managed inside eds, the sorting, threading and other view related things are also managed there (conceivably, you could even do server-side sorting etc with imap). Then it only pipes data required for the currently displayed messages (i.e. 20-80 or so, rather than thousands) based on the scrollbar position. This would obviously require the most work, but would let eds do things like store temporary results on disk to reduce its memory use/etc, and reduce the 'startup time' since the view would only need to retrieve a tiny fraction of the folders list (done right, it would probably be faster than the current mechanism). A custom widget would let us make problems like 'the selection performance bottleneck' vanish - but we're probably talking 1 man year to write it. It would require a lot of changes to camel api's too (although this could end up with simpler api's ...).

NB: If we didn't have to calculate the 'unread' counts of vFolders, we could just postpone the vFolder search until the user actually viewed the folder. This would save the need to keep vFolders 'open' all the time. vFolders probably need to become persistent objects rather than dynamic queries, so they can be updated incrementally without needing to keep everything in memory.

API Changes to help

The CamelFolder object is the core of all these operations, so what specific api changes might help the above:

Remove these calls:

 guint32                   camel_folder_get_message_flags      (CamelFolder *folder, const char *uid);
 gboolean          camel_folder_set_message_flags      (CamelFolder *folder, const char *uid, guint32 flags, guint32 set);
 gboolean          camel_folder_get_message_user_flag  (CamelFolder *folder, const char *uid, const char *name);
 void              camel_folder_set_message_user_flag  (CamelFolder *folder, const char *uid, const char *name, gboolean value);
 const char *      camel_folder_get_message_user_tag  (CamelFolder *folder, const char *uid, const char *name);
 void              camel_folder_set_message_user_tag  (CamelFolder *folder, const char *uid, const char *name, const char *value);

They are already replaced by just talking to the 'message info' structure (this isn't directly related, but its a good time to mention them).

Remove these:

 GPtrArray *        camel_folder_get_summary           (CamelFolder *folder);
 void               camel_folder_free_summary          (CamelFolder *folder, GPtrArray *array);
 GPtrArray *        camel_folder_get_uids              (CamelFolder *folder);
 void               camel_folder_free_uids             (CamelFolder *folder, GPtrArray *array);

These are just two different ways to get a list of all messages, but they both only work on full message sets, which we must avoid.

Make these more capable and incremental:

 gboolean           camel_folder_has_search_capability (CamelFolder *folder);
 GPtrArray *       camel_folder_search_by_expression  (CamelFolder *folder, const char *expr, CamelException *ex);
 GPtrArray *       camel_folder_search_by_uids        (CamelFolder *folder, const char *expr, GPtrArray *uids, CamelException *ex);
 void              camel_folder_search_free           (CamelFolder *folder, GPtrArray*);

i.e., run a 'query' and get back a cursor, which you then iterate over all values. A shortcut, like no query == just get all items.

Each iteration of the cursor gets you a refcounted 'messageinfo' on which you can do whatever you want, and contains all the information you need.

It would also be nice to remove the s-exp from this level, and pass in an expression tree, but not necessary.

Now, if we added the ability to do sorting and threading on the query at the same time, and search ranges of values/and or do reverse iteration over the cursor - that is enough to get a basic 'remote view' working.

e.g. something like

 CFS = camel_folder_search_new(CamelFolder, flags, expression)
  • get a search object/handle, flags could specify global options like threading

 camel_folder_search_add_sort(CFS, columnid, direction)
  • add sort field in order from high to low priority. This is only really necessary if you wanted a remote view

 camel_folder_search_add_uids(CFS, uids)
  • add a limiting set of uid's to match against, rather than all. This is required for vFolder efficiency.

 mi = camel_folder_search_next(cfs, CamelException)
 mi = camel_folder_search_prev(cfs, CamelException)
  • iterate through the search results. Prev is only really required if you wanted a remote view.

and others perhaps ... like set_range() to select specific 'rows' of the search, etc.

(one question would be whether the search auto-updates as folders change, or if it locks itself based on the time it was run).

Once there is an iterative api for listing messages (with a search), then a whole host of lighter-weight things become possible, although for things like vFolders, you're probably still going to have to store the results somewhere - and then you run into possible consistency problems.

Oh, and for all of the above to have any beneificial effect, the camelfoldersummary object would basically need to be re-done, so it could store the 'summaries' on disk. Even if it went to the real message header to build the messageinfo, and only stored offsets, that might not be too slow (actually it probably would be). But it might just be easier/more reliable to just use a libdb database (that might be slow too, but the ability to perform transactions and let libdbhandle all the tricky data-consistency issues is a big plus).

EDS wrappers to camel

This should be simple, so it isn't too much overhead to implement. Probably just a 'store' concept and a 'folder' concept, but you just get a search/iterator of message-info's (probably blocks rather than one at a time), and an api to get an rfc2822 message 'stream' which is decoded/handled by local camel-mime-message/camel-data-wrapper processing (and same for putting messages). So no need to fully wrap all camel api directly. The eds api should be synchronous+threaded, not the really messy calendar/addressbook approach using an asynchronous listener, which is just too difficult to program. The big problem would be with lifecycle issues and crashing or badly behaved clients, for any of the stateful calls.

There is probably also something to be said for changing the camel_folder_get_message() api to just return a raw message stream anyway, not just for eds. This would make certain operations more efficient, and also make it obvious that the message is a read-only object (it would however, kill the whole 'imap4.1 ideal' of only downloading the bits of a message you are looking at - but that has proven to be far more effort than it was ever worth - we could still do range fetches anyway, which would probably work as well or nicer in the ui).

Apps/Evolution/On-disk_summaries (last edited 2013-08-08 22:50:00 by WilliamJonMcCann)