1. E-mail metadata RDF storage
Contents
1.1. Storage
In accordance with this proposal this document describes how you create E-mail content into Tracker's RDF store. You can find the Nepomuk Message Ontology here.
To get a good scheme for forming URLs for IMAP, read RFC 5092.
The FETCH command in IMAP explains how to use sections at page 55. For a URL scheme for POP read RFC 2384. Avoid inventing your own URL scheme. You want to use these URLs for the value of nie:url in RDF. You can also use them for the subjects of your resources, which is what I will do in the examples that follow.
1.1.1. A complete example
This complete example illustrates how a simple E-mail with a text/plain and a text/html body with one file attachment will look in RDF. I will start with how you'll receive it over IMAP as BODYSTRUCTURE and ENVELOPE . Then I'll describe the complete E-mail's metadata into proper RDF for the Nepomuk Message Ontology.
1.1.1.1. Section 1.1
Before you can know about these sections you have to analyze BODYSTRUCTURE or BODY first. But I'll explain that a bit later. I'll also explain later that it's not always the case that these specific parts are called 1.1 or 1.2, sometimes they are called part 1 and part 2, or even differently. BODYSTRUCTURE will tell you, but you'll get that explained in a minute.
A UID FETCH 1 BODY.PEEK[1.1] * 1 FETCH (UID 1 BODY[1.1] {170} test -- My signature Blabla ... ) A OK Completed
1.1.1.2. Section 1.2
A UID FETCH 1 BODY.PEEK[1.2] * 1 FETCH (UID 1 BODY[1.2] {1321} <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 TRANSITIONAL//EN"> <HTML> ... </HTML> ) A OK Completed
1.1.1.3. Section 2
A FETCH 1 BODY.PEEK[2] * 1 FETCH (BODY[2] {9541} diff --git a/src/libtracker-data/tracker-data-query.c b/src/libtracker-data/tracker-data-query.c index 20f0140..c9db8ec 100644 --- a/src/libtracker-data/tracker-data-query.c +++ b/src/libtracker-data/tracker-data-query.c @@ -210, ...
1.1.1.4. BODYSTRUCTURE
I broke this down, an IMAP server wont return it indented. Note that in Tinymail you can find a LGPL BODYSTRUCTURE parser in plain C here. Also check out this document about metadata collecting for E-mail.
To get the text/plain and the text/html out of this structure, you will need to call them 1.1 and 1.2. That's because there's a multipart/alternative above them. You can see it as the one with "ALTERNATIVE" with a "BOUNDARY" . Nearly everything, if not everything, you see in the body structure reply from IMAP is useful metadata that we are going use.
A01 UID FETCH 21 BODYSTRUCTURE * 1 FETCH ( UID 21 BODYSTRUCTURE ( ( ("TEXT" "PLAIN" NIL NIL NIL "7BIT" 170 7 NIL NIL NIL) ("TEXT" "HTML" ("CHARSET" "utf-8") NIL NIL "7BIT" 1321 24 NIL NIL NIL) "ALTERNATIVE" ("BOUNDARY" "=-GgGWuVS+goa+7OHIJWr0") NIL NIL ) ("TEXT" "X-PATCH" ("NAME" "fix_class_signals.diff" "CHARSET" "UTF-8") NIL NIL "7BIT" 9541 266 NIL ("ATTACHMENT" ("FILENAME" "fix_class_signals.diff")) NIL) "MIXED" ("BOUNDARY" "=-RBJ0QoWwq+KaBoV5H8JN") NIL NIL ) ) A01 OK Completed
Here is a BODYSTRUCTURE of a simple E-mail:
a UID FETCH 129 BODYSTRUCTURE * 72 FETCH ( UID 129 BODYSTRUCTURE ( "TEXT" "PLAIN" ("CHARSET" "us-ascii") NIL NIL "7BIT" 724 26 NIL NIL NIL NIL ) ) a OK Completed (0.000 sec)
Whereas the more complicated one has the typical 1.1, 1.2 part names for the text/plain and text/html , The simple one's text/plain part is called 1: There's no multipart/alternative above it.
Let's take a look at an even more interesting one:
a UID FETCH 111 BODYSTRUCTURE * 54 FETCH ( UID 111 BODYSTRUCTURE ( ("TEXT" "PLAIN" ("CHARSET" "us-ascii") NIL NIL "7BIT" 1226 25 NIL NIL NIL NIL) ("TEXT" "HTML" ("CHARSET" "us-ascii") NIL NIL "7BIT" 5934 157 NIL NIL NIL NIL) "ALTERNATIVE" ("BOUNDARY" "----=_NextPart_000_0001_01C3E681.3A67D9C0") NIL NIL NIL ) ) a OK Completed (0.000 sec)
This one means that we have two immediate text/* parts called 1 and 2. This must mean the E-mail's top Content-Type is multipart/alternative , let's check if that's the case.
a UID FETCH 111 BODY.PEEK[HEADER.FIELDS (CONTENT-TYPE)] * 54 FETCH (UID 111 BODY[HEADER.FIELDS (CONTENT-TYPE)] {93} Content-Type: multipart/alternative; boundary="----=_NextPart_000_0001_01C3E681.3A67D9C0" ) a OK Completed (0.000 sec)
Right.
1.1.1.5. ENVELOPE
I broke this down, an IMAP server wont return it indented. Note that in Tinymail you can find a LGPL ENVELOPE parser in plain C here. Also check out this document about metadata collecting for E-mail.
IMAP RFC about ENVELOPE. Some more info available at page 56.
a UID FETCH 21 ENVELOPE * 1 FETCH (UID 21 ENVELOPE ( "Mon, 06 Apr 2009 17:02:16 +0200" "Test subject" ((NIL NIL "from.some" "body.com")) ((NIL NIL "from.some" "body.com")) ((NIL NIL "from.some" "body.com")) ((NIL NIL "to.some" "body.com")) NIL NIL NIL "<9999999999.9999.99.yoursoftware@hostname>")) a OK Completed
1.1.1.6. Headers of invidivual parts
You use mime-spec.MIME. For example 1.MIME for the header of the alternative mime part container:
a UID FETCH 21 BODY.PEEK[1.MIME] * 1 FETCH (UID 21 BODY[1.MIME] {95} Content-Type: multipart/alternative; boundary="----=_NextPart_000_0001_01C5FD97.D696E730" ) a OK Completed (0.000 sec)
And to get the headers of the Plain Text mime part within the alternative mime part container:
a UID FETCH 21 BODY.PEEK[1.1.MIME] * 1 FETCH (UID 21 BODY[1.1.MIME] {83} Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit ) a OK Completed (0.000 sec)
Or to get the headers of the HTML one:
a UID FETCH 21 BODY.PEEK[1.2.MIME] * 1 FETCH (UID 21 BODY[1.2.MIME] {94} Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable ) a OK Completed (0.000 sec)
1.1.1.7. Be careful while coding!
It's not because I made examples with an E-mail that has a structure in such a way that the parts are called part 1.1, 1.2, etc that this is always the case. If the top E-mail is for example a Content-Type=multipart/alternative then you can expect the Plain Text part to be 1, and the HTML part to be 2. To get the headers of the multipart/alternative you should now ask for BODY.PEEK[HEADER] instead of something wrong like BODY.PEEK[0.MIME] . A similar thing happens for the more simple E-mails that you often get on mailing lists (many subscribers of mailinglists don't like HTML E-mails, so mailinglist E-mails are usually more simple ones).
For such simple E-mails the E-mail's top Content-Type is often immediately text/plain . This means that part 1 is immediately the text/plain part and that there are no parts 2 and 3. There is also no part 1.1 nor is there a part 1.2: that kind of structure would require a alternative/multipart (the E-mail is a simple one, it doesn't have it).
It's all kinda logical but you have to put the pieces together right in your mind. And, of course, some people don't find this logical. That was to be expected.
1.1.1.8. In RDF
This is for example the URL for MIME part 1.2 in E-mail with UID 20 in mailbox INBOX on joe's account on example.com's IMAP server:
<imap://joe@example.com/INBOX/;uid=21/;section=1.2> |
Text body part of the top E-mail in HTML format |
Overview of the complete E-mail:
<imap://joe@example.com/INBOX/;uid=21> |
The complete E-mail |
<imap://joe@example.com/INBOX/;uid=21/;section=TEXT> |
The content of the complete E-mail |
<imap://joe@example.com/INBOX/;uid=21/;section=1> |
Container MIME part (the ALTERNATIVE) for the Plain Text and HTML body parts of the E-mail |
<imap://joe@example.com/INBOX/;uid=21/;section=1.1> |
Text body part of the top E-mail in Plain Text format |
<imap://joe@example.com/INBOX/;uid=21/;section=1.2> |
Text body part of the top E-mail in HTML format |
<imap://joe@example.com/INBOX/;uid=21/;section=2> |
The .diff file attachment |
Turtle:
@prefix nco: <http://www.semanticdesktop.org/ontologies/2007/03/22/nco#> . @prefix nfo: <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#> . @prefix nie: <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#> . @prefix nmo: <http://www.semanticdesktop.org/ontologies/2007/03/22/nmo#> . @prefix nrl: <http://www.semanticdesktop.org/ontologies/2007/08/15/nrl#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix tracker: <http://www.tracker-project.org/ontologies/tracker#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . @prefix nmm: <http://www.tracker-project.org/temp/nmm#> . <imap://joe@example.com/INBOX/;uid=21/;section=2> a nmo:Attachment ; nie:mimeType "text/x-patch" ; nmo:contentTransferEncoding "7BIT" ; nfo:lineCount 266 ; nmo:contentDisposition "ATTACHMENT" ; nfo:fileName "fix_class_signals.diff" ; nmo:charSet "UTF-8" ; nie:byteSize 9541 ; nie:isStoredAs <imap://joe@example.com/INBOX/;uid=21/;section=2> ; nie:url 'imap://joe@example.com/INBOX/;uid=21/;section=2' . <imap://joe@example.com/INBOX/;uid=21/;section=1.1> a nmo:MimePart, nfo:PlainTextDocument ; nie:mimeType "text/plain" ; nmo:contentTransferEncoding "7BIT" ; nfo:lineCount 7 ; nie:byteSize 170 ; nie:isStoredAs <imap://joe@example.com/INBOX/;uid=21/;section=1.1> ; nie:url 'imap://joe@example.com/INBOX/;uid=21/;section=1.1' . <imap://joe@example.com/INBOX/;uid=21/;section=1.2> a nmo:MimePart, nfo:HtmlDocument ; nie:mimeType "text/html" ; nmo:contentTransferEncoding "7BIT" ; nfo:lineCount 24 ; nmo:charSet "UTF-8" ; nie:byteSize 1321 ; nie:isStoredAs <imap://joe@example.com/INBOX/;uid=21/;section=1.2> ; nie:url 'imap://joe@example.com/INBOX/;uid=21/;section=1.2' . <imap://joe@example.com/INBOX/;uid=21/;section=1> a nmo:MimePart, nmo:Multipart ; nie:hasPart <imap://joe@example.com/INBOX/;uid=21/;section=1.1> ; nie:hasPart <imap://joe@example.com/INBOX/;uid=21/;section=1.2> ; nmo:contentDisposition "ALTERNATIVE" ; nmo:partBoundary "=-GgGWuVS+goa+7OHIJWr0" ; nie:isStoredAs <imap://joe@example.com/INBOX/;uid=21/;section=1> ; nie:url 'imap://joe@example.com/INBOX/;uid=21/;section=1' . <imap://joe@example.com/INBOX/;uid=21/;section=TEXT> a nmo:MimePart, nmo:Multipart ; nie:hasPart <imap://joe@example.com/INBOX/;uid=21/;section=1> ; nie:hasPart <imap://joe@example.com/INBOX/;uid=21/;section=2> ; nmo:partBoundary "=-RBJ0QoWwq+KaBoV5H8JN" ; nie:isStoredAs <imap://joe@example.com/INBOX/;uid=21/;section=TEXT> ; nie:url 'imap://joe@example.com/INBOX/;uid=21/;section=TEXT' . <imap://joe@example.com/INBOX/;uid=21> a nmo:Email, nmo:MailboxDataObject ; nmo:to [ a nco:Contact ; nco:hasEmailAddress <mailto:to.some@body.com> ] ; nmo:from [ a nco:Contact ; nco:hasEmailAddress <mailto:from.some@body.com> ] ; nmo:messageSubject "Test subject" ; nmo:sentDate "Mon, 06 Apr 2009 17:02:16 +0200" ; nmo:hasContent <imap://joe@example.com/INBOX/;uid=21/;section=TEXT> ; nie:isStoredAs <imap://joe@example.com/INBOX/;uid=21> ; nie:url 'imap://joe@example.com/INBOX/;uid=21' .
1.1.2. E-mails that contain forwards
You can for example save this after copy pasting it into a import.ttl and use org.freedesktop.Tracker.Resources.Load, or you can use Resources.SparqlUpdate to insert individual triples.
Overview of the complete E-mail:
<imap://joe@example.com/INBOX/;uid=20> |
The complete E-mail |
<imap://joe@example.com/INBOX/;uid=20/;section=TEXT> |
The content of the complete E-mail |
<imap://joe@example.com/INBOX/;uid=20/;section=1> |
Container MIME part (the ALTERNATIVE) for the Plain Text and HTML body parts of the top E-mail |
<imap://joe@example.com/INBOX/;uid=20/;section=1.1> |
Text body part of the top E-mail in Plain Text format |
<imap://joe@example.com/INBOX/;uid=20/;section=1.2> |
Text body part of the top E-mail in HTML format |
<imap://joe@example.com/INBOX/;uid=20/;section=2> |
RFC822 E-mail MIME part containing the forward |
<imap://joe@example.com/INBOX/;uid=20/;section=2.TEXT> |
The content of the complete forwarded E-mail |
<imap://joe@example.com/INBOX/;uid=20/;section=2.1> |
Container MIME part for the plain and html body parts of the forwarded E-mail |
<imap://joe@example.com/INBOX/;uid=20/;section=2.1.1> |
Text body part of the forwarded E-mail in Plain Text format |
<imap://joe@example.com/INBOX/;uid=20/;section=2.1.2> |
Text body part of the forwarded E-mail in HTML format |
<imap://joe@example.com/INBOX/;uid=20/;section=2.2> |
Video attachment in the forwarded E-mail |
Turtle:
@prefix nco: <http://www.semanticdesktop.org/ontologies/2007/03/22/nco#> . @prefix nfo: <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#> . @prefix nie: <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#> . @prefix nmo: <http://www.semanticdesktop.org/ontologies/2007/03/22/nmo#> . @prefix nrl: <http://www.semanticdesktop.org/ontologies/2007/08/15/nrl#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix tracker: <http://www.tracker-project.org/ontologies/tracker#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . @prefix nmm: <http://www.tracker-project.org/temp/nmm#> . <imap://joe@example.com/INBOX/;uid=20/;section=2.1.1> a nmo:MimePart, nfo:TextDocument ; nfo:lineCount 123 ; nie:mimeType "text/plain" ; nmo:contentTransferEncoding "7BIT" ; nmo:charSet "UTF-8" ; nie:byteSize 1321 ; nfo:wordCount 89 ; nie:isStoredAs <imap://joe@example.com/INBOX/;uid=20/;section=2.1.1> ; nie:url 'imap://joe@example.com/INBOX/;uid=20/;section=2.1.1' . <imap://joe@example.com/INBOX/;uid=20/;section=2.1.2> a nmo:MimePart, nfo:HtmlDocument ; nfo:lineCount 160 ; nie:mimeType "text/html" ; nmo:contentTransferEncoding "7BIT" ; nmo:charSet "UTF-8" ; nie:byteSize 1452 ; nfo:wordCount 89 ; nie:isStoredAs <imap://joe@example.com/INBOX/;uid=20/;section=2.1.2> ; nie:url 'imap://joe@example.com/INBOX/;uid=20/;section=2.1.2' . <imap://joe@example.com/INBOX/;uid=20/;section=2.2> a nmo:MimePart, nmm:Video ; nie:title "Some movie" ; nfo:encoding "BASE64" ; nie:isStoredAs <imap://joe@example.com/INBOX/;uid=20/;section=2.2> ; nie:url 'imap://joe@example.com/INBOX/;uid=20/;section=2.2' . <imap://joe@example.com/INBOX/;uid=20/;section=2.1> a nmo:MimePart, nmo:Multipart ; nie:hasPart <imap://joe@example.com/INBOX/;uid=20/;section=2.1.1> , <imap://joe@example.com/INBOX/;uid=20/;section=2.1.2> ; nmo:partBoundary "--------------------fasdfs-----" ; nie:isStoredAs <imap://joe@example.com/INBOX/;uid=20/;section=2.1> ; nie:url 'imap://joe@example.com/INBOX/;uid=20/;section=2.1' . <imap://joe@example.com/INBOX/;uid=20/;section=2.TEXT> a nmo:MimePart, nmo:Multipart ; nie:hasPart <imap://joe@example.com/INBOX/;uid=20/;section=2.1> , <imap://joe@example.com/INBOX/;uid=20/;section=2.2> ; nmo:partBoundary "--------------------fa22fs-----" ; nie:isStoredAs <imap://joe@example.com/INBOX/;uid=20/;section=2.TEXT> ; nie:url 'imap://joe@example.com/INBOX/;uid=20/;section=2.TEXT' . <imap://joe@example.com/INBOX/;uid=20/;section=2> a nmo:Email ; nmo:from [ a nco:Contact ; nco:hasEmailAddress <mailto:forward@ed.from.com> ] ; nmo:to [ a nco:Contact ; nco:hasEmailAddress <mailto:forward@ed.to.com> ] ; nmo:messageSubject "Forward me" ; nmo:hasContent <imap://joe@example.com/INBOX/;uid=20/;section=2.TEXT> ; nie:isStoredAs <imap://joe@example.com/INBOX/;uid=20/;section=2> ; nie:url 'imap://joe@example.com/INBOX/;uid=20/;section=2' . <imap://joe@example.com/INBOX/;uid=20/;section=1.1> a nmo:MimePart, nfo:TextDocument ; nfo:lineCount 66 ; nie:mimeType "text/plain" ; nmo:contentTransferEncoding "7BIT" ; nmo:charSet "UTF-8" ; nie:byteSize 456 ; nfo:wordCount 40 ; nie:isStoredAs <imap://joe@example.com/INBOX/;uid=20/;section=1.1> ; nie:url 'imap://joe@example.com/INBOX/;uid=20/;section=1.1' . <imap://joe@example.com/INBOX/;uid=20/;section=1.2> a nmo:MimePart, nfo:HtmlDocument ; nfo:lineCount 60 ; nie:mimeType "text/html" ; nmo:contentTransferEncoding "7BIT" ; nmo:charSet "UTF-8" ; nie:byteSize 556 ; nfo:wordCount 40 ; nie:isStoredAs <imap://joe@example.com/INBOX/;uid=20/;section=1.2> ; nie:url 'imap://joe@example.com/INBOX/;uid=20/;section=1.2' . <imap://joe@example.com/INBOX/;uid=20/;section=1> a nmo:MimePart, nmo:Multipart ; nie:hasPart <imap://joe@example.com/INBOX/;uid=20/;section=1.1> , <imap://joe@example.com/INBOX/;uid=20/;section=1.2> ; nmo:partBoundary "--------------------hheer-----" ; nie:isStoredAs <imap://joe@example.com/INBOX/;uid=20/;section=1> ; nie:url 'imap://joe@example.com/INBOX/;uid=20/;section=1' . <imap://joe@example.com/INBOX/;uid=20/;section=TEXT> a nmo:MimePart, nmo:Multipart ; nie:hasPart <imap://joe@example.com/INBOX/;uid=20/;section=1> , <imap://joe@example.com/INBOX/;uid=20/;section=2> ; nmo:partBoundary "----------------ssdsd---------" ; nie:isStoredAs <imap://joe@example.com/INBOX/;uid=20/;section=TEXT> ; nie:url 'imap://joe@example.com/INBOX/;uid=20/;section=TEXT' . <imap://joe@example.com/INBOX/;uid=20> a nmo:Email, nmo:MailboxDataObject ; nmo:to [ a nco:Contact ; nco:hasEmailAddress <mailto:to.some@body.com> ] ; nmo:from [ a nco:Contact ; nco:hasEmailAddress <mailto:from.some@body.com> ] ; nmo:messageSubject "FWD: Forward me" ; nmo:hasContent <imap://joe@example.com/INBOX/;uid=20/;section=TEXT> ; nie:isStoredAs <imap://joe@example.com/INBOX/;uid=20> ; nie:url 'imap://joe@example.com/INBOX/;uid=20' .
1.2. Querying
1.2.1. Example queries on attachments
Getting the filename and attachment out of the first E-mail
SELECT nie:url (?attachment) nfo:fileName (?filename) WHERE { ?s nie:isStoredAs <imap://joe@example.com/INBOX/;uid=21> ; nmo:hasContent ?content . ?content nie:hasPart ?attachment . } imap://joe@example.com/INBOX/;uid=21/;section=2, fix_class_signals.diff
Getting the attachment and title of the forward in the second E-mail
SELECT nie:url (?forward_attachment) nie:title (?forward_attachment) WHERE { ?s nie:isStoredAs <imap://joe@example.com/INBOX/;uid=20> ; nmo:hasContent ?content . ?content nie:hasPart ?forward . ?forward nmo:hasContent ?forward_content . ?forward_content nie:hasPart ?forward_attachment . } imap://joe@example.com/INBOX/;uid=20/;section=2.2, Some movie
1.2.2. Example query using subject and recipient
Getting the recipient of the messages that have message subject "Forward me"
SELECT ?s WHERE { ?o nmo:to ?s ; nmo:messageSubject "Forward me" . }
1.2.3. Example query using subject
Getting the message that has subject "FWD: Forward me"
SELECT nie:url (?s) WHERE { ?s nmo:messageSubject "FWD: Forward me" }
1.2.4. Example query using subject
Getting the URL of the messages that have message subject "Forward me"
SELECT nie:url (?s) WHERE { ?s nmo:messageSubject "Forward me" }
Note that the result is different as previous query. This is why you want to properly assign a URL to each MIME part. The forwarded message is not the same as the container message. If you store E-mail metadata properly then you can with the right query get access to both of them.
1.2.5. More complex example queries
Getting the parent E-mail when only knowing the leaf forwarded E-mail's message subject.
SELECT nie:url (?subject) WHERE { ?x nie:hasPart ?y . ?y nie:hasPart ?z ; nmo:messageSubject ?subject . ?z nmo:messageSubject "Forward me" . }
Getting the boundary of the container MIME part of the (forwarded) message that has message subject "Forward me".
SELECT ?boundary WHERE { ?x nie:hasPart ?y ; nmo:partBoundary ?boundary . ?y nmo:messageSubject "Forward me" . }