This is number six in a series of posts on developing a POP3 client in C#. These are the previous posts on this topic:
Two possibilities
There are actualy two options when writing a MIME parser. The first one is to download the whole message and then parse the content with regular expressions, the second option is to parse the message line by line while downloading. I'm not sure if there are some major advantages of one of the two possibilities. I could imagine that regarding performance the line by line method would be better, but I decided to take the regex approach because I think, it's easier to handle faulty MIME messages.
Parsing the message
First I shortly outline how the message parsing will happen. A MIME message consists of different MIME entities (as mentioned in my last post) which can hold some other entities, which are separated by a "boundary". This boundary string is defined in the header section of each entity (if it's a multipart entity). So first we have to parse the boundary string, then split the message so we get each part separated. We repeat this step with each part that has a content type of "multipart...".
Parsing Headers
MIME message headers are always formatted in the following way:
HeaderName: Header Value There are some headers which hold more than one value, for example the "Content-Type" header what gives then the following format:
Content-Type: multipart/mixed;
boundary="----=_NextPart_000_001A_034CE23.234EB34 As mentioned in my previous post, there are some standard headers, but there could also be headers appended by mail clients or spam filters. This means that there are some headers where we only want to get the value because we just know the name of the header. But there are also headers, we have no clue about. So there are two methods:
// parsing a "known" header
public string ParseHeader(string mimeMessage, string headerName)
{
if (String.IsNotNullOrEmpty(mimeMessage) && String.IsNotNullOrEmpty(headerName))
{
Regex r = new Regex(headerName + @":\s+(?<HeaderValue>.*\n");
Match m = r.Match(mimeMessage);
if (m.Groups["HeaderValue"] != null)
return m.Groups["HeaderValue"].Value.Trim();
}
return "";
}
// parsing all headers
// the regex matches single line and double line headers
public Dictionary<string, string> ParseHeaders(string mimeMessage)
{
Dictionary<string, string> headers = new Dictionary<string, string>();
if (String.IsNotNullOrEmpty(mimeMessage))
{
Regex r = new Regex(@"(?<HeaderName>[^\r\n:]+):\s+(?<HeaderValue>(.+[\r\n][\t\x20]+.+)|(.+))");
MatchCollection m = r.Matches(rawMessage);
foreach (Match match in m)
{
headers.Add(match.Groups["HeaderName"].Value.Trim(), match.Groups["HeaderValue"].Value.Trim());
}
}
return headers;
}
Parsing the MIME entities
To show you how basically works I'll give you a simplified "MimePart". In the code I provide for download in my next post, there are some other properties and methods, but for this example this works fine:
public class MimePart
{
public string ContentType { get; set; }
public string Content { get; set; }
public Dictionary<string, string> Headers { get; set; }
public List<MimePart> Parts { get; set; }
}
Let's assume we get the raw MIME message data from the POP3 client. After recieving we iterate recursively through the MIME entities:
string rawMimeData = Pop3.GetMessage(32);
MimePart message = GetMimePart(rawMimeData);
private MimePart GetMimePart(string rawData)
{
MimePart part = new MimePart();
part.Parts = new List<MimePart>();
part.ContentType = ParseContentType(rawData);
if (part.ContentType.StartsWith("multipart"))
{
// split into parts
string boundary = GetBoundary(mimeData);
Regex r = new Regex(@"[\s]*--" + boundary + @"[\s.]*\n", RegexOptions.IgnoreCase | RegexOptions.Multiline);
string[] rawParts = r.Split(mimeData);
part.Headers = parseHeaders(part[0]);
for (i = 1; i < rawParts.Length; i++)
{
if (part[i].Trim() != "")
part.Parts.Add(GetMimePart(part[i]));
}
}
else
{
part.Headers = parseHeaders(rawData);
part.Content = rawData.Substring(rawData.IndexOf("\r\n\r\n"));
}
return part;
}
That's basically all we have to do. Sure, this MimePart is not very comfortable to use, there's some need to decode encoded parts like attachments or get the plain text or html version of the body. Also basic data like the sender's email address or the subject of the message are not easily accessable. But at least the MIME message has now a structure which we can work with.
6ce43067-4658-4118-8ea0-5c350d5542d7|4|4.8