E-mail metadata RDF storage

Storage

In accordance with this proposal this document describes how you create E-mail content into Tracker's RDF store. You can find the Nepomuk Message Ontology here.

To get a good scheme for forming URLs for IMAP, read RFC 5092.

The FETCH command in IMAP explains how to use sections at page 55. For a URL scheme for POP read RFC 2384. Avoid inventing your own URL scheme. You want to use these URLs for the value of  nie:url  in RDF. You can also use them for the subjects of your resources, which is what I will do in the examples that follow.

A complete example

This complete example illustrates how a simple E-mail with a  text/plain  and a  text/html  body with one file attachment will look in RDF. I will start with how you'll receive it over IMAP as  BODYSTRUCTURE  and  ENVELOPE . Then I'll describe the complete E-mail's metadata into proper RDF for the Nepomuk Message Ontology.

Section 1.1

Before you can know about these sections you have to analyze  BODYSTRUCTURE  or  BODY  first. But I'll explain that a bit later. I'll also explain later that it's not always the case that these specific parts are called 1.1 or 1.2, sometimes they are called part 1 and part 2, or even differently.  BODYSTRUCTURE  will tell you, but you'll get that explained in a minute.

A UID FETCH 1 BODY.PEEK[1.1]
* 1 FETCH (UID 1 BODY[1.1] {170}
test
-- 
My signature
Blabla
...
)
A OK Completed

Section 1.2

A UID FETCH 1 BODY.PEEK[1.2]
* 1 FETCH (UID 1 BODY[1.2] {1321}
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 TRANSITIONAL//EN">
<HTML>
...
</HTML>
)
A OK Completed

Section 2

A FETCH 1 BODY.PEEK[2]
* 1 FETCH (BODY[2] {9541}
diff --git a/src/libtracker-data/tracker-data-query.c b/src/libtracker-data/tracker-data-query.c
index 20f0140..c9db8ec 100644
--- a/src/libtracker-data/tracker-data-query.c
+++ b/src/libtracker-data/tracker-data-query.c
@@ -210, ...

BODYSTRUCTURE

I broke this down, an IMAP server wont return it indented. Note that in Tinymail you can find a  LGPL BODYSTRUCTURE  parser in plain C here. Also check out this document about metadata collecting for E-mail.

IMAP RFC about BODYSTRUCTURE

To get the  text/plain  and the  text/html  out of this structure, you will need to call them 1.1 and 1.2. That's because there's a  multipart/alternative  above them. You can see it as the one with  "ALTERNATIVE"  with a  "BOUNDARY" . Nearly everything, if not everything, you see in the body structure reply from IMAP is useful metadata that we are going use.

A01 UID FETCH 21 BODYSTRUCTURE
* 1 FETCH (
        UID 21 BODYSTRUCTURE (
                (
                        ("TEXT" "PLAIN" NIL NIL NIL "7BIT" 170 7 NIL NIL NIL)
                        ("TEXT" "HTML" ("CHARSET" "utf-8") NIL NIL "7BIT" 1321 24 NIL NIL NIL)
                        "ALTERNATIVE" ("BOUNDARY" "=-GgGWuVS+goa+7OHIJWr0") NIL NIL
                )

                ("TEXT" "X-PATCH" ("NAME" "fix_class_signals.diff" "CHARSET" "UTF-8") NIL NIL "7BIT" 9541 266 NIL
                        ("ATTACHMENT" ("FILENAME" "fix_class_signals.diff")) NIL)

                "MIXED" ("BOUNDARY" "=-RBJ0QoWwq+KaBoV5H8JN") NIL NIL
        )
)
A01 OK Completed

Here is a BODYSTRUCTURE of a simple E-mail:

a UID FETCH 129 BODYSTRUCTURE
* 72 FETCH (
        UID 129 BODYSTRUCTURE (
                "TEXT" "PLAIN" ("CHARSET" "us-ascii") NIL NIL "7BIT" 724 26 NIL NIL NIL NIL
        )
)
a OK Completed (0.000 sec)

Whereas the more complicated one has the typical 1.1, 1.2 part names for the  text/plain  and  text/html , The simple one's text/plain part is called 1: There's no  multipart/alternative  above it.

Let's take a look at an even more interesting one:

a UID FETCH 111 BODYSTRUCTURE
* 54 FETCH (
        UID 111 BODYSTRUCTURE (
                ("TEXT" "PLAIN" ("CHARSET" "us-ascii") NIL NIL "7BIT" 1226 25 NIL NIL NIL NIL)
                ("TEXT" "HTML" ("CHARSET" "us-ascii") NIL NIL "7BIT" 5934 157 NIL NIL NIL NIL) 
                "ALTERNATIVE" ("BOUNDARY" "----=_NextPart_000_0001_01C3E681.3A67D9C0") NIL NIL NIL
        )
)
a OK Completed (0.000 sec)

This one means that we have two immediate  text/*  parts called 1 and 2. This must mean the E-mail's top  Content-Type  is  multipart/alternative , let's check if that's the case.

a UID FETCH 111 BODY.PEEK[HEADER.FIELDS (CONTENT-TYPE)]
* 54 FETCH (UID 111 BODY[HEADER.FIELDS (CONTENT-TYPE)] {93}
Content-Type: multipart/alternative; boundary="----=_NextPart_000_0001_01C3E681.3A67D9C0"

)
a OK Completed (0.000 sec)

Right.

ENVELOPE

I broke this down, an IMAP server wont return it indented. Note that in Tinymail you can find a LGPL ENVELOPE parser in plain C here. Also check out this document about metadata collecting for E-mail.

IMAP RFC about ENVELOPE. Some more info available at page 56.

a UID FETCH 21 ENVELOPE
* 1 FETCH (UID 21 ENVELOPE (
        "Mon, 06 Apr 2009 17:02:16 +0200" "Test subject" 
        ((NIL NIL "from.some" "body.com")) 
        ((NIL NIL "from.some" "body.com")) 
        ((NIL NIL "from.some" "body.com")) 
        ((NIL NIL "to.some" "body.com")) 
        NIL NIL NIL "<9999999999.9999.99.yoursoftware@hostname>"))
a OK Completed

Headers of invidivual parts

You use mime-spec.MIME. For example 1.MIME for the header of the alternative mime part container:

a UID FETCH 21 BODY.PEEK[1.MIME]
* 1 FETCH (UID 21 BODY[1.MIME] {95}
Content-Type: multipart/alternative;
        boundary="----=_NextPart_000_0001_01C5FD97.D696E730"

)
a OK Completed (0.000 sec)

And to get the headers of the Plain Text mime part within the alternative mime part container:

a UID FETCH 21 BODY.PEEK[1.1.MIME]
* 1 FETCH (UID 21 BODY[1.1.MIME] {83}
Content-Type: text/plain;
        charset="us-ascii"
Content-Transfer-Encoding: 7bit

)
a OK Completed (0.000 sec)

Or to get the headers of the HTML one:

a UID FETCH 21 BODY.PEEK[1.2.MIME]
* 1 FETCH (UID 21 BODY[1.2.MIME] {94}
Content-Type: text/html;
        charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

)
a OK Completed (0.000 sec)

Be careful while coding!

It's not because I made examples with an E-mail that has a structure in such a way that the parts are called part 1.1, 1.2, etc that this is always the case. If the top E-mail is for example a  Content-Type=multipart/alternative  then you can expect the Plain Text part to be 1, and the HTML part to be 2. To get the headers of the  multipart/alternative  you should now ask for  BODY.PEEK[HEADER]  instead of something wrong like  BODY.PEEK[0.MIME] . A similar thing happens for the more simple E-mails that you often get on mailing lists (many subscribers of mailinglists don't like HTML E-mails, so mailinglist E-mails are usually more simple ones).

For such simple E-mails the E-mail's top  Content-Type  is often immediately  text/plain . This means that part 1 is immediately the  text/plain  part and that there are no parts 2 and 3. There is also no part 1.1 nor is there a part 1.2: that kind of structure would require a  alternative/multipart  (the E-mail is a simple one, it doesn't have it).

It's all kinda logical but you have to put the pieces together right in your mind. And, of course, some people don't find this logical. That was to be expected.

In RDF

This is for example the URL for MIME part 1.2 in E-mail with UID 20 in mailbox INBOX on joe's account on example.com's IMAP server:

<imap://joe@example.com/INBOX/;uid=21/;section=1.2>

Text body part of the top E-mail in HTML format

Overview of the complete E-mail:

<imap://joe@example.com/INBOX/;uid=21>

The complete E-mail

<imap://joe@example.com/INBOX/;uid=21/;section=TEXT>

The content of the complete E-mail

<imap://joe@example.com/INBOX/;uid=21/;section=1>

Container MIME part (the ALTERNATIVE) for the Plain Text and HTML body parts of the E-mail

<imap://joe@example.com/INBOX/;uid=21/;section=1.1>

Text body part of the top E-mail in Plain Text format

<imap://joe@example.com/INBOX/;uid=21/;section=1.2>

Text body part of the top E-mail in HTML format

<imap://joe@example.com/INBOX/;uid=21/;section=2>

The .diff file attachment

Turtle:

@prefix nco: <http://www.semanticdesktop.org/ontologies/2007/03/22/nco#> .
@prefix nfo: <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#> .
@prefix nie: <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#> .
@prefix nmo: <http://www.semanticdesktop.org/ontologies/2007/03/22/nmo#> .
@prefix nrl: <http://www.semanticdesktop.org/ontologies/2007/08/15/nrl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tracker: <http://www.tracker-project.org/ontologies/tracker#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix nmm: <http://www.tracker-project.org/temp/nmm#> .

<imap://joe@example.com/INBOX/;uid=21/;section=2> a nmo:Attachment ;
  nie:mimeType "text/x-patch" ;
  nmo:contentTransferEncoding "7BIT" ;
  nfo:lineCount 266 ;
  nmo:contentDisposition "ATTACHMENT" ;
  nfo:fileName "fix_class_signals.diff" ;
  nmo:charSet "UTF-8" ;
  nie:byteSize 9541 ;
  nie:isStoredAs <imap://joe@example.com/INBOX/;uid=21/;section=2> ;
  nie:url 'imap://joe@example.com/INBOX/;uid=21/;section=2' .

<imap://joe@example.com/INBOX/;uid=21/;section=1.1> a nmo:MimePart, nfo:PlainTextDocument ;
  nie:mimeType "text/plain" ;
  nmo:contentTransferEncoding "7BIT" ;
  nfo:lineCount 7 ;
  nie:byteSize 170 ;
  nie:isStoredAs <imap://joe@example.com/INBOX/;uid=21/;section=1.1> ;
  nie:url 'imap://joe@example.com/INBOX/;uid=21/;section=1.1' .

<imap://joe@example.com/INBOX/;uid=21/;section=1.2> a nmo:MimePart, nfo:HtmlDocument ;
  nie:mimeType "text/html" ;
  nmo:contentTransferEncoding "7BIT" ;
  nfo:lineCount 24 ;
  nmo:charSet "UTF-8" ;
  nie:byteSize 1321 ;
  nie:isStoredAs <imap://joe@example.com/INBOX/;uid=21/;section=1.2> ;
  nie:url 'imap://joe@example.com/INBOX/;uid=21/;section=1.2' .

<imap://joe@example.com/INBOX/;uid=21/;section=1> a nmo:MimePart, nmo:Multipart ;
  nie:hasPart <imap://joe@example.com/INBOX/;uid=21/;section=1.1> ;
  nie:hasPart <imap://joe@example.com/INBOX/;uid=21/;section=1.2> ;
  nmo:contentDisposition "ALTERNATIVE" ;
  nmo:partBoundary "=-GgGWuVS+goa+7OHIJWr0" ;
  nie:isStoredAs <imap://joe@example.com/INBOX/;uid=21/;section=1> ;
  nie:url 'imap://joe@example.com/INBOX/;uid=21/;section=1' .

<imap://joe@example.com/INBOX/;uid=21/;section=TEXT> a nmo:MimePart, nmo:Multipart ;
  nie:hasPart <imap://joe@example.com/INBOX/;uid=21/;section=1> ;
  nie:hasPart <imap://joe@example.com/INBOX/;uid=21/;section=2> ;
  nmo:partBoundary "=-RBJ0QoWwq+KaBoV5H8JN" ;
  nie:isStoredAs <imap://joe@example.com/INBOX/;uid=21/;section=TEXT> ;
  nie:url 'imap://joe@example.com/INBOX/;uid=21/;section=TEXT' .

<imap://joe@example.com/INBOX/;uid=21> a nmo:Email, nmo:MailboxDataObject ;
  nmo:to [ a nco:Contact ; nco:hasEmailAddress <mailto:to.some@body.com> ] ;
  nmo:from [ a nco:Contact ; nco:hasEmailAddress <mailto:from.some@body.com> ] ;
  nmo:messageSubject "Test subject" ;
  nmo:sentDate "Mon, 06 Apr 2009 17:02:16 +0200" ;
  nmo:hasContent <imap://joe@example.com/INBOX/;uid=21/;section=TEXT> ;
  nie:isStoredAs <imap://joe@example.com/INBOX/;uid=21> ;
  nie:url 'imap://joe@example.com/INBOX/;uid=21' .

E-mails that contain forwards

You can for example save this after copy pasting it into a import.ttl and use org.freedesktop.Tracker.Resources.Load, or you can use Resources.SparqlUpdate to insert individual triples.

Overview of the complete E-mail:

<imap://joe@example.com/INBOX/;uid=20>

The complete E-mail

<imap://joe@example.com/INBOX/;uid=20/;section=TEXT>

The content of the complete E-mail

<imap://joe@example.com/INBOX/;uid=20/;section=1>

Container MIME part (the ALTERNATIVE) for the Plain Text and HTML body parts of the top E-mail

<imap://joe@example.com/INBOX/;uid=20/;section=1.1>

Text body part of the top E-mail in Plain Text format

<imap://joe@example.com/INBOX/;uid=20/;section=1.2>

Text body part of the top E-mail in HTML format

<imap://joe@example.com/INBOX/;uid=20/;section=2>

RFC822 E-mail MIME part containing the forward

<imap://joe@example.com/INBOX/;uid=20/;section=2.TEXT>

The content of the complete forwarded E-mail

<imap://joe@example.com/INBOX/;uid=20/;section=2.1>

Container MIME part for the plain and html body parts of the forwarded E-mail

<imap://joe@example.com/INBOX/;uid=20/;section=2.1.1>

Text body part of the forwarded E-mail in Plain Text format

<imap://joe@example.com/INBOX/;uid=20/;section=2.1.2>

Text body part of the forwarded E-mail in HTML format

<imap://joe@example.com/INBOX/;uid=20/;section=2.2>

Video attachment in the forwarded E-mail

Turtle:

@prefix nco: <http://www.semanticdesktop.org/ontologies/2007/03/22/nco#> .
@prefix nfo: <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#> .
@prefix nie: <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#> .
@prefix nmo: <http://www.semanticdesktop.org/ontologies/2007/03/22/nmo#> .
@prefix nrl: <http://www.semanticdesktop.org/ontologies/2007/08/15/nrl#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tracker: <http://www.tracker-project.org/ontologies/tracker#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix nmm: <http://www.tracker-project.org/temp/nmm#> .


<imap://joe@example.com/INBOX/;uid=20/;section=2.1.1> a nmo:MimePart, nfo:TextDocument ;
  nfo:lineCount 123 ;
  nie:mimeType "text/plain" ;
  nmo:contentTransferEncoding "7BIT" ;
  nmo:charSet "UTF-8" ;
  nie:byteSize 1321 ;
  nfo:wordCount 89 ;
  nie:isStoredAs <imap://joe@example.com/INBOX/;uid=20/;section=2.1.1> ;
  nie:url 'imap://joe@example.com/INBOX/;uid=20/;section=2.1.1' .

<imap://joe@example.com/INBOX/;uid=20/;section=2.1.2> a nmo:MimePart, nfo:HtmlDocument ;
  nfo:lineCount 160 ;
  nie:mimeType "text/html" ;
  nmo:contentTransferEncoding "7BIT" ;
  nmo:charSet "UTF-8" ;
  nie:byteSize 1452 ;
  nfo:wordCount 89 ;
  nie:isStoredAs <imap://joe@example.com/INBOX/;uid=20/;section=2.1.2> ;
  nie:url 'imap://joe@example.com/INBOX/;uid=20/;section=2.1.2' .

<imap://joe@example.com/INBOX/;uid=20/;section=2.2> a nmo:MimePart, nmm:Video ;
  nie:title "Some movie" ;
  nfo:encoding "BASE64" ;
  nie:isStoredAs <imap://joe@example.com/INBOX/;uid=20/;section=2.2> ;
  nie:url 'imap://joe@example.com/INBOX/;uid=20/;section=2.2' .

<imap://joe@example.com/INBOX/;uid=20/;section=2.1> a nmo:MimePart, nmo:Multipart ;
  nie:hasPart <imap://joe@example.com/INBOX/;uid=20/;section=2.1.1> ,
              <imap://joe@example.com/INBOX/;uid=20/;section=2.1.2> ;
  nmo:partBoundary "--------------------fasdfs-----" ;
  nie:isStoredAs <imap://joe@example.com/INBOX/;uid=20/;section=2.1> ;
  nie:url 'imap://joe@example.com/INBOX/;uid=20/;section=2.1' .

<imap://joe@example.com/INBOX/;uid=20/;section=2.TEXT> a nmo:MimePart, nmo:Multipart ;
  nie:hasPart <imap://joe@example.com/INBOX/;uid=20/;section=2.1> ,
              <imap://joe@example.com/INBOX/;uid=20/;section=2.2> ;
  nmo:partBoundary "--------------------fa22fs-----" ;
  nie:isStoredAs <imap://joe@example.com/INBOX/;uid=20/;section=2.TEXT> ;
  nie:url 'imap://joe@example.com/INBOX/;uid=20/;section=2.TEXT' .

<imap://joe@example.com/INBOX/;uid=20/;section=2> a nmo:Email ;
  nmo:from [ a nco:Contact ; nco:hasEmailAddress <mailto:forward@ed.from.com> ] ;
  nmo:to [ a nco:Contact ; nco:hasEmailAddress <mailto:forward@ed.to.com> ] ;
  nmo:messageSubject "Forward me" ;
  nmo:hasContent <imap://joe@example.com/INBOX/;uid=20/;section=2.TEXT> ;
  nie:isStoredAs <imap://joe@example.com/INBOX/;uid=20/;section=2> ;
  nie:url 'imap://joe@example.com/INBOX/;uid=20/;section=2' .

<imap://joe@example.com/INBOX/;uid=20/;section=1.1> a nmo:MimePart, nfo:TextDocument ;
  nfo:lineCount 66 ;
  nie:mimeType "text/plain" ;
  nmo:contentTransferEncoding "7BIT" ;
  nmo:charSet "UTF-8" ;
  nie:byteSize 456 ;
  nfo:wordCount 40 ;
  nie:isStoredAs <imap://joe@example.com/INBOX/;uid=20/;section=1.1> ;
  nie:url 'imap://joe@example.com/INBOX/;uid=20/;section=1.1' .

<imap://joe@example.com/INBOX/;uid=20/;section=1.2> a nmo:MimePart, nfo:HtmlDocument ;
  nfo:lineCount 60 ;
  nie:mimeType "text/html" ;
  nmo:contentTransferEncoding "7BIT" ;
  nmo:charSet "UTF-8" ;
  nie:byteSize 556 ;
  nfo:wordCount 40 ;
  nie:isStoredAs <imap://joe@example.com/INBOX/;uid=20/;section=1.2> ;
  nie:url 'imap://joe@example.com/INBOX/;uid=20/;section=1.2' .

<imap://joe@example.com/INBOX/;uid=20/;section=1> a nmo:MimePart, nmo:Multipart ;
  nie:hasPart <imap://joe@example.com/INBOX/;uid=20/;section=1.1> ,
              <imap://joe@example.com/INBOX/;uid=20/;section=1.2> ; 
  nmo:partBoundary "--------------------hheer-----" ;
  nie:isStoredAs <imap://joe@example.com/INBOX/;uid=20/;section=1> ;
  nie:url 'imap://joe@example.com/INBOX/;uid=20/;section=1' .

<imap://joe@example.com/INBOX/;uid=20/;section=TEXT> a nmo:MimePart, nmo:Multipart ;
  nie:hasPart <imap://joe@example.com/INBOX/;uid=20/;section=1> ,
              <imap://joe@example.com/INBOX/;uid=20/;section=2> ;
  nmo:partBoundary "----------------ssdsd---------" ;
  nie:isStoredAs <imap://joe@example.com/INBOX/;uid=20/;section=TEXT> ;
  nie:url 'imap://joe@example.com/INBOX/;uid=20/;section=TEXT' .

<imap://joe@example.com/INBOX/;uid=20> a nmo:Email, nmo:MailboxDataObject ;
  nmo:to [ a nco:Contact ; nco:hasEmailAddress <mailto:to.some@body.com> ] ;
  nmo:from [ a nco:Contact ; nco:hasEmailAddress <mailto:from.some@body.com> ] ;
  nmo:messageSubject "FWD: Forward me" ;
  nmo:hasContent <imap://joe@example.com/INBOX/;uid=20/;section=TEXT> ;
  nie:isStoredAs <imap://joe@example.com/INBOX/;uid=20> ;
  nie:url 'imap://joe@example.com/INBOX/;uid=20' .

Querying

Example queries on attachments

Getting the filename and attachment out of the first E-mail

SELECT nie:url (?attachment) nfo:fileName (?filename)
WHERE { ?s nie:isStoredAs <imap://joe@example.com/INBOX/;uid=21> ; 
           nmo:hasContent ?content . 
        ?content nie:hasPart ?attachment .
}

  imap://joe@example.com/INBOX/;uid=21/;section=2, fix_class_signals.diff

Getting the attachment and title of the forward in the second E-mail

SELECT nie:url (?forward_attachment) nie:title (?forward_attachment)
WHERE { ?s nie:isStoredAs <imap://joe@example.com/INBOX/;uid=20> ;
           nmo:hasContent ?content .
        ?content nie:hasPart ?forward .
        ?forward nmo:hasContent ?forward_content .
        ?forward_content nie:hasPart ?forward_attachment .
}

  imap://joe@example.com/INBOX/;uid=20/;section=2.2, Some movie

Example query using subject and recipient

Getting the recipient of the messages that have message subject "Forward me"

SELECT ?s  
WHERE {
   ?o nmo:to ?s ;
      nmo:messageSubject "Forward me" .  
}

Example query using subject

Getting the message that has subject "FWD: Forward me"

SELECT nie:url (?s)
WHERE {
 ?s nmo:messageSubject "FWD: Forward me"
}

Example query using subject

Getting the URL of the messages that have message subject "Forward me"

SELECT nie:url (?s) 
WHERE { 
  ?s nmo:messageSubject "Forward me" 
}

Note that the result is different as previous query. This is why you want to properly assign a URL to each MIME part. The forwarded message is not the same as the container message. If you store E-mail metadata properly then you can with the right query get access to both of them.

More complex example queries

Getting the parent E-mail when only knowing the leaf forwarded E-mail's message subject.

SELECT nie:url (?subject)
WHERE { 
   ?x nie:hasPart ?y . 
   ?y nie:hasPart ?z ;  
      nmo:messageSubject ?subject . 
   ?z nmo:messageSubject "Forward me" . 
}

Getting the boundary of the container MIME part of the (forwarded) message that has message subject "Forward me".

SELECT ?boundary 
WHERE { 
  ?x nie:hasPart ?y ;  
     nmo:partBoundary ?boundary . 
  ?y nmo:messageSubject "Forward me" . 
}

Attic/Tracker/Documentation/Examples/SPARQL/Email (last edited 2023-08-14 12:49:59 by CarlosGarnacho)