Camel.MimeParser

CamelMimeParser is an object which will parse either a raw message (e.g. maildir) or a Berkeley Mailbox file from a stream or a file descriptor. It is a state machine which handles all the failure cases (truncated parts, missing multipart bits, io errors, etc) - you're always guaranteed to get a valid state sequence, and you can skip bits you're not interested in if you just want to scan the headers/whatever.

In addition to being a push, stack-based parser, you can also use it to automatically re-format data as it is parsed, skip sections efficiently, and a few other little tricks. This is the absolute core object used for all mail parsing, and is robust and fast. It has one minor bug due to an early design change, but otherwise is very reliable.

Parser States

There are a number of states defined for the parser.

 typedef enum _camel_mime_parser_state_t {
        CAMEL_MIME_PARSER_STATE_INITIAL,
        CAMEL_MIME_PARSER_STATE_PRE_FROM,       /* data before a 'From' line */
        CAMEL_MIME_PARSER_STATE_FROM,           /* got 'From' line */
        CAMEL_MIME_PARSER_STATE_HEADER,         /* toplevel header */
        CAMEL_MIME_PARSER_STATE_BODY,           /* scanning body of message */
        CAMEL_MIME_PARSER_STATE_MULTIPART,      /* got multipart header */
        CAMEL_MIME_PARSER_STATE_MESSAGE,        /* rfc822 message */
        CAMEL_MIME_PARSER_STATE_PART,           /* part of a multipart */
        
        CAMEL_MIME_PARSER_STATE_END = 8,        /* bit mask for 'end' flags */
        
        CAMEL_MIME_PARSER_STATE_EOF = 8,        /* end of file */
        CAMEL_MIME_PARSER_STATE_PRE_FROM_END,   /* pre from end */
        CAMEL_MIME_PARSER_STATE_FROM_END,       /* end of whole from bracket */
        CAMEL_MIME_PARSER_STATE_HEADER_END,     /* dummy value */
        CAMEL_MIME_PARSER_STATE_BODY_END,       /* end of message */
        CAMEL_MIME_PARSER_STATE_MULTIPART_END,  /* end of multipart  */
        CAMEL_MIME_PARSER_STATE_MESSAGE_END,    /* end of message */
 } camel_mime_parser_state_t;

The sequence of these states is well defined. For each start state you get a matching END state, although various states are optional. You can use the CAMEL_MIME_PARSER_STATE_END bit to determine if the state is an end state - in practice you wont care.

The following diagram shows the state sequence you will get parsing with From enabled - without it, the From state is simply removed. The two states labelled <> are not states in themselves, but transation states, to which one of the other states pointed to by the solid arrows it will actually go to. * means the state can occur 0 or more times, 1*means 1 or more times.

none

This was created from the Dia file in camel/devel-docs/camel_parser_states.dia in the source distribution.

The PRE_FROM state is missing from that diagram, but would sit between INITIAL and FROM.

The interface

This is some extended documentation on the parser api entry points. The API documentation (or the source) for this file should be consulted as it covers the details too.

First, you need to create and initialise the parser. The parser content can either come from a Evolution/Camel.Stream, or from a Unix file descriptor. The latter exists because CamelMimeParser does it's own buffering, and thisis more efficient. It will steal the file descriptor.

Note that CamelMimeParser is a final object, and cannot be subclassed; the virtual methods defined its class are not implemented.

 CamelMimeParser *camel_mime_parser_new(void);
 int camel_mime_parser_init_with_fd(CamelMimeParser *parser, int fd);
 int camel_mime_parser_init_with_stream(CamelMimeParser *parser, CamelStream *stream);

Then a couple of functions to get the stream and file descriptor back. This may be used so that the stream or file descriptor do not need to be tracked separately.

 CamelStream *camel_mime_parser_stream(CamelMimeParser *parser);
 int camel_mime_parser_fd(CamelMimeParser *parser);

To greatly simplify the use of the parser states, it never returns an error. Any error in the input will cause the parser to terminate each outstanding state in turn, and return no data content. During this process, or afterwards, the parser can be queried to see if an i/o error occured. Note that the parser itself never signals a data format error; all data encountered will go somewhere, so it is not lost, although it may be reformatted slightly. This is to the user of the parser doesn't have to abort if it gets invalid data, it can fail gracefully.

 int camel_mime_parser_errno(CamelMimeParser *parser);

There are a couple of control functions, which modify the parser behaviour. If the from state is enabled, then Berkeley Mailbox "From " lines will be checked for in the input, and it will consistitute the base active state.

Similarly, if pre_from is enabled, an additional state will be added for any data found before the first (and only the first) "From " line discovered in the input. This can be used to skip over invalid content without losing it.

 void camel_mime_parser_scan_from(CamelMimeParser *parser, gboolean scan_from);
 void camel_mime_parser_scan_pre_from(CamelMimeParser *parser, gboolean scan_pre_from);

Oops, this was removed, don't use this one:

 int camel_mime_parser_set_header_regex(CamelMimeParser *parser, char *matchstr);

Now we get to the functions used to invoke the parser itself. The optional buf, and buflen arguments will accept pointers to internal buffers for data detected in certain states. For example, the CAMEL_MIME_PARSER_STATE_BODY, and CAMEL_MIME_PARSER_STATE_PRE_FROM states will pass any data present using this mechanism. This pointer is only valid until the next call to camel_mime_parser_step. A pointer to the internal buffer is used for efficiency; in many cases data doesn't need to be copied to be used.

 camel_mime_parser_state_t camel_mime_parser_step(CamelMimeParser *parser, char **buf, size_t *buflen);
 camel_mime_parser_state_t camel_mime_parser_state(CamelMimeParser *parser);

Unstep can be used to go back one state transition. Infact it merely means the existing state is repeated for as many times as unstep was called. This is useful if you've gone past the states you understand, but your callee might be expecting it.

 void camel_mime_parser_unstep(CamelMimeParser *parser);

Drop step is a little different, it will cause the parser to drop one level from its parser stack. What this means for example, is that if you just received a HEADER state, it will drop the following HEADER_END state, and skip that data until the next appropriate boundary.

 void camel_mime_parser_drop_step(CamelMimeParser *parser);

Push state can be used to pre-load the parser with a particular state. This wont let you control the entire parser proccess, but needs to be used in a few specific cases to ensure parsing continues properly. It is often used with the seek function.

 void camel_mime_parser_push_state(CamelMimeParser *mp, camel_mime_parser_state_t newstate, const char *boundary);

As well as parsing, you can force a non-parsed read through the parser from the current parse position. Like the other functions, a pointer to the internal buffer is returned, as well as the length of valid data available; which will be at most len bytes, but may be less.

 int camel_mime_parser_read(CamelMimeParser *parser, const char **databuffer, int len);

Then there are a bunch of location functions used to jump to locations in a file, and find the physical location of various pieces of information. These are used for loading specific messages from mailbox files, and working out where they are in the first place.

 off_t camel_mime_parser_tell(CamelMimeParser *parser);
 off_t camel_mime_parser_seek(CamelMimeParser *parser, off_t offset, int whence);
 
 off_t camel_mime_parser_tell_start_headers(CamelMimeParser *parser);
 off_t camel_mime_parser_tell_start_from(CamelMimeParser *parser);
 off_t camel_mime_parser_tell_start_boundary(CamelMimeParser *parser);

Once you are in one of the header-related states, you can query the parser for information about the current level. Once this state has been passed, much of this information is no longer available, so it needs to be used at the right time - see the API comments for more information.

You can get everything from the part's Content-Type in a pre-parsed form, to any specific header in raw form (including the character offset in the file for the start of the header), all headers, pre/post text in a multipart type, or the physical "From " line present in a mailbox file.

 CamelContentType *camel_mime_parser_content_type(CamelMimeParser *parser);
 const char *camel_mime_parser_header(CamelMimeParser *parser, const char *name, int *offset);
 struct _camel_header_raw *camel_mime_parser_headers_raw(CamelMimeParser *parser);
 
 const char *camel_mime_parser_preface(CamelMimeParser *parser);
 const char *camel_mime_parser_postface(CamelMimeParser *parser);
 
 const char *camel_mime_parser_from_line(CamelMimeParser *parser);

And last but not least, there are some utility functions that let you add Evolution/Camel.MimeFilter fillters to pre-process content before it leaves the parser. This is a covenient way to get data in the right form efficiently without having to process it afterwards (this was the original use for the CamelMimeFilter interface, hence it's name). Filters are active until they are removed, so filters can be added and removed as desired as the processing takes place - ideal for example for scanning an entire mailbox and performing some work on it.

 int camel_mime_parser_filter_add(CamelMimeParser *parser, CamelMimeFilter *filter);
 void camel_mime_parser_filter_remove(CamelMimeParser *parser, int id);

Using the parser

There are many ways to use the parser, from a very bare-bones, approach, to using it as a source for building structured Evolution/Camel.DataWrapper#Camel.MimeMessage objects.

Example: Reading an individual or Maidlir message

Reading a standalone file (or a Maildir entry) is particularly simple:

        stream = camel_stream_fs_new("/path/to/file", O_RDONLY, 0);
        msg = camel_mime_message_new();
        if (camel_data_wrapper_construct_from_stream((CamelDataWrapper *)msg, stream) != -1) {
                /* msg has accessors you can get subject/to/from/etc from directly */
                /* you can also get the raw headers by name, etc */
        }

Example: Reading A Berkeley Mailbox

Reading a Berkeley Mailbox is a bit more complex, but not much:

        mp = camel_mime_parser_new();
        /* the bit which turns it into an mbox parser */
        camel_mime_parser_scan_from(mp, TRUE);
        stream = camel_stream_fs_new(...);
        if (camel_mime_parser_init_with_stream(mp, stream) == -1)
                /* error */;
        while (camel_mime_parser_step(mp, NULL, NULL) == CAMEL_MIME_PARSER_STATE_FROM) {
                msg = camel_mime_message_new();
                if (camel_mime_part_construct_from_parser(msg, mp)) ==-1)
                        /* error */;
                /* use msg */
                /* skip over 'FROM_END' state */
                camel_mime_parser_step(mp, NULL, NULL);
        }

Example: Listing all messages in a Berkeley Mailbox

Often a client needs to scan a whole mailbox, extract some information from the top-level headers, but not parse the entire content. This is quite easy to achieve. We can even find out where the message was in the mailbox, etc.

        mp = camel_mime_parser_new();
        /* the bit which turns it into an mbox parser */
        camel_mime_parser_scan_from(mp, TRUE);
        stream = camel_stream_fs_new(...);
        if (camel_mime_parser_init_with_stream(mp, stream) == -1)
                /* error */;
        while (camel_mime_parser_step(mp, NULL, NULL) == CAMEL_MIME_PARSER_STATE_FROM) {
                frompos = camel_mime_parser_tell_start_from(mp);
                camel_mime_parser_step(mp, NULL, NULL);
                subject = camel_mime_parser_header(mp, "subject");
                printf("Message @ %ld, Subject: %s\n", (long)frompos, subject);
                camel_mime_parser_drop_step(mp);
        }
        camel_object_unef(mp);

NB: this example is untested.

Example: Searching a message

Now for something a bit more complicated. Searching a message. Using simple file i/o will not work - content may be in any encoding format, and may be in the wrong character set. You could convert everything to a message at a time, but that isn't particularly efficient.

This example also demonstrates the use of Evolution/Camel.MimeFilter filters as additional processing elements in the data pipeline.

For this code frament, we'll assume the parser has been initalised appropriately.

        while (!found && (state = camel_mime_parser_step(mp, &data, &len)) != CAMEL_MIME_PARSER_STATE_EOF) {
                switch (state) {
                case CAMEL_MIME_PARSER_HEADER:
                        ct = camel_mime_parser_get_content_type(mp);
                        if (camel_content_type_is(ct, "text", "*")) {
                                encoding = camel_mime_parser_header(mp, "Content-Transfer-Encoding", NULL);
                                enc = camel_content_transfer_encoding_decode(encoding);
                                switch (enc) {
                                case CAMEL_TRANSFER_ENCODING_BASE64:
                                        encfilter = (CamelMimeFilter *)camel_mime_filter_basic_new_type(CAMEL_MIME_FILTER_BASIC_BASE64_DEC);
                                        camel_mime_parser_add_filter(mp, encfilter);
                                        break;
                                case CAMEL_TRANSFER_ENCODING_QUOTEDPRINTABLE:
                                        encfilter = (CamelMimeFilter *)camel_mime_filter_basic_new_type(CAMEL_MIME_FILTER_BASIC_QP_DEC);
                                        camel_mime_parser_add_filter(mp, encfilter);
                                        break;
                                }
                                charset = camel_content_type_param(ct, "charset");
                                if (charset && g_ascii_strcasecmp(charset, "utf-8") != 0 && g_ascii_strcasecmp(charset, "us-ascii") != 0) {
                                        charfilter = camel_mime_filter_charset_new(charset, "utf-8");
                                        camel_mime_parser_add_filter(mp, charfilter);
                                }
                        } else {
                                camel_mime_parser_drop_step(mp);
                        }
                case CAMEL_MIME_PARSER_HEADER_END:
                        if (charfilter) {
                                camel_mime_parser_remove_filter(mp, charfilter);
                                camel_object_unref(charfilter);
                                charfilter = NULL;
                        }
                        if (encfiler) {
                                camel_mime_parser_remove_filter(mp, encfilter);
                                camel_object_unref(encfilter);
                                encfilter = NULL;
                        }
                        break;
                case CAMEL_MIME_PARSER_BODY:
                        /* Search in data, for len bytes, looking for a match of the text */
                        break;
                }
                frompos = camel_mime_parser_tell_start_from(mp);
                camel_mime_parser_step(mp, NULL, NULL);
                subject = camel_mime_parser_header(mp, "subject");
                printf("Message @ %ld, Subject: %s\n", (long)frompos, subject);
                camel_mime_parser_drop_step(mp);
        }

NB: This example is also untested.

Note how that because we are guarnteed a specific state sequence, we don't have to worry about the edge cases. And by dropping the step at uninteresting places we wont even bother parsing or converting binary parts. And because we're just worried about text-parts, we don't even care about the structure of the message, so we don't need to track the state transitions either.

To perform the same task on a Berkeley Mailbox file should be a simple matter of initialising the parser to scan From lines, and point to an appropriate file. The From states will just be ignored.

Unfortunately, since real messages are likely to break the rfc's - particularly those from someones little shareware bunky-mail-library-of-the-week, or worse, from LookOut, in real life things get a lot more complicated, and you can't really do fully streamed processing. But for anything created by Camel, or in controlled environments, then it can lead to massive memory and cpu savings.

Apps/Evolution/Camel.MimeParser (last edited 2013-08-08 22:50:12 by WilliamJonMcCann)