Parsing mbox Files with the Mailbox Library

In my day job I was recently given a task of pulling an export of over 2,200 emails and organizing them in a way that would be easy to work with. Our email archive exports emails in a .mbox format. The mbox format is essentially a very long text file that stores all of the email metadata, content, images, and attachments. The mbox file for the 2,200 emails I pulled was over 7 million lines long.

I was not able to find many viable options for opening an mbox file to work with the contents. There were some older blog posts that suggested using Mozilla's Thunderbird or some various windows utilities but I wasn't able to get any of them working. Ultimately, the emails needed to be separated into the individual messages so an individual could view and print them as part of the request.

Knowing the mbox archive was a text file, I decided to open it with a text editor to see what it looked like and maybe I could write something in Python to parse the file. From reading up on the mbox format, I knew each message began with the same 5 characters: "From ". Knowing that I could start to discern individual messages from the text file. Most messages were stored in multiple formats, including text-only, html, mixed text and html, and any images or attachments in the messages was text encoded in base64. I was able to start to write a parser but there was no way I was going to get it done in the time I had.

So I figured someone else had already solved this problem, so I turned to Google. While I couldn't find anything that did exactly what I wanted, I did find out that the Mailbox library in the Python standard library could parse an mbox file in such a way that each message could be stored in a directory with a text and html version, along with saving any attachments for the message in the directory as well.

Setting up the script

The script is designed to be run from the command line in the directory with the mbox file. The script takes two command line arguments when run: the name of the file to process, and the name of the directory where all of the messages should be stored. Below, the first part of the script uses argparse to take the two command line arguments and stores them as the variables to use as the file name and directory name. I create a csv with some summary information about all the messages that goes in the directory as well. The headers for the csv file as well as the list of dictionaries to create the csv are here, too.

mbox_process.py

import mailbox
import os
import csv
import argparse

############################
### PARSE CLI ARGUEMENTS ###
############################

# Create the parser
my_parser = argparse.ArgumentParser(description='List the content of a folder')

# Add the arguments
my_parser.add_argument('file_name',
                       action='store',
                       help='Name of file to be processed.')

my_parser.add_argument('dir_name',
                       action='store',
                       help='Name of directory to store messages.')

# Execute the parse_args() method
args = my_parser.parse_args()

###################################
### PREPARE FOR FILE PROCESSING ###
###################################

# get filename from user
file_name = args.file_name

# get directory to store messages in from user
dir_name = args.dir_name

# check if the directory exists
if not os.path.isdir(dir_name):
    # make a folder for this email (named after the subject)
    os.mkdir(dir_name)

# create a list to store information on each message to be written to csv
csv_headers = ['Message', 'From', 'To', 'Date', 'Subject', 'Attachment', 'PNG', 'JPG']
email_list = []

Processing the mbox file

Next, the script processes the mbox file. First, it uses the enumerate function in order to get an index that I could use to track which message I was working with. Then it processes each message, doing the following:

Create a directory to store the message and any images or attachments
Adds the summary information from the message to the email summary list as a dictionary
Create a header based on the message information to print above the body of each message
Check if the message is a multipart message:
If yes:
- Save the plain text version of the message as a .txt file
- Save the html text version of the message as a .html file
- Save the mixed text version of the message as a .html file
- Save any images in the message as .png or .jpg files and update the csv summary dictionary to indicate their presence
- Save any attachments in the format indicated in the mbox file and update the csv summary dictionary.
If no:
- Check if the message is plain text only and save it as a .txt file
- Check if the message is html and save it as a .html file
- Print "MESSAGE SKIPPED" if the content isn't recognized (for debugging)

#########################
### PROCESS MBOX FILE ###
#########################

# iterate over messages
for idx, message in enumerate(mailbox.mbox(file_name)):
    # create a folder for the message
    folder_name = f"{dir_name}/msg-{idx + 1}"
    prev_folder = f"{dir_name}/msg-{idx}"
    if not os.path.isdir(folder_name):
        # make a folder for this email (named after the subject)
        os.mkdir(folder_name)

    # add message to summary list for csv
    msg_dict_temp = {
        'Message': idx + 1,
        'From': message['from'],
        'To': message['to'],
        'Date': message['date'],
        'Subject': message['subject'],
        'Attachment': "N",
        'PNG': "N",
        'JPG': "N"
    }


    # add header info to full message
    full_message = f'''### ### ### Start Message {idx + 1} from {file_name} ### ### ### \n
TO: {message['to']}
FROM: {message['from']}
DATE: {message['date']}
SUBJECT: {message['subject']}

CONTENT:
    '''

    # iterate through each message
    if message.is_multipart():
        # iterate over the message parts
        for part in message.walk():
            # get email content
            content_type = part.get_content_type()
            content_disposition = str(part.get("Content-Disposition"))
            # print(f"Msg: {idx +1}")
            # print(content_type)
            # print(content_disposition)
            try:
                # get the email body
                body = part.get_payload(decode=True).decode()
            except:
                pass

            # save plain text of email
            if content_type == "text/plain":
                # add the body of the message
                full_message_text = full_message + f"{body}"

                # write the message to a file
                # name the file
                filename = f"msg-{idx + 1}.txt"
                filepath = os.path.join(folder_name, filename)
                # check if file exists to append
                if os.path.exists(filepath):
                    # append additional text to existing file
                    open(filepath, "a").write(body)
                else:
                    # write the file
                    open(filepath, "w").write(full_message_text)

            # save html text of email
            elif content_type == "text/html":
                # add the body of the message
                full_message_html = full_message + f"CONTENT: \n{body}"

                # write the message to a file
                # name the file
                filename = f"msg-{idx + 1}.html"
                filepath = os.path.join(folder_name, filename)
                # write the file
                open(filepath, "a+").write(full_message_html)

            # multipart/mixed messages
            # BUG writes to wrong folder and wrong file name (+1)
            elif content_type == "multipart/mixed":
                # add the body of the message
                full_message_mixed = full_message + f"CONTENT: \n{body}"

                # write the message to a file
                # name the file
                filename = f"msg-{idx}-mxd.html"
                filepath = os.path.join(prev_folder, filename)
                # write the file
                open(filepath, "w").write(full_message_mixed)

            # save png attachments
            elif content_type == "image/png":
                # update email summary list
                msg_dict_temp['PNG'] = "Y"
                # download attachment
                attachment_name = part.get_filename()
                if attachment_name:
                    filepath = os.path.join(folder_name, attachment_name)
                    # download attachment and save it
                    open(filepath, "wb").write(part.get_payload(decode=True))

            # save jpg attachments
            elif content_type == "image/jpeg":
                # update email summary list
                msg_dict_temp['JPG'] = "Y"
                # download attachment
                attachment_name = part.get_filename()
                if attachment_name:
                    filepath = os.path.join(folder_name, attachment_name)
                    # download attachment and save it
                    open(filepath, "wb").write(part.get_payload(decode=True))

            # save email attachment
            elif "attachment" in content_disposition:
                # update email summary list
                msg_dict_temp['Attachment'] = "Y"
                # download attachment
                attachment_name = part.get_filename()
                if attachment_name:
                    filepath = os.path.join(folder_name, attachment_name)
                    # download attachment and save it
                    open(filepath, "wb").write(part.get_payload(decode=True))

        # append final message dict to email summary list
        email_list.append(msg_dict_temp)

    else:
        if message.get_content_type() == "text/plain":
            for part in message.walk():
                # add the body of the message
                try:
                    # get the email body
                    body = part.get_payload(decode=True).decode()
                except:
                    pass
                full_message_text = full_message + f"{body}"

                # write the message to a file
                # name the file
                filename = f"msg-{idx + 1}.txt"
                filepath = os.path.join(folder_name, filename)
                # write the file
                open(filepath, "a+").write(full_message_text)
        elif message.get_content_type() == "text/html":
            for part in message.walk():
                # add the body of the message
                try:
                    # get the email body
                    body = part.get_payload(decode=True).decode()
                except:
                    pass
                full_message_text = full_message + f"{body}"

                # write the message to a file
                # name the file
                filename = f"msg-{idx + 1}.html"
                filepath = os.path.join(folder_name, filename)
                # write the file
                open(filepath, "a+").write(full_message_text)
        else:
            print(f"MESSAGE SKIPPED {idx + 1}")

Writing the summary csv

Then finally, it writes the summary of emails to a csv file.

####################################
###  WRITE EMAIL SUMMARY TO CSV  ###
####################################

with open(f'{dir_name}/{dir_name}.csv', 'w') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=csv_headers)
    writer.writeheader()
    writer.writerows(email_list)

When the processing is over, a directory is created that contains all the messages and attachments in their own directory. When the script is complete you end up with a directory structure like this:

Directory
|
|---example.mbox
|---example
    |---example.csv
    |---msg-1
    |   |---msg-1.txt
    |   |---msg-1.html
    |---msg-2
    |   |---msg-2.txt
    |---msg-3
    |   |---msg-3.html
    |---msg-4
    |   |---msg-4.txt
    |   |---msg-4.html
    |   |---image.png
    |---msg-5
        |---msg-5.txt
        |---msg-5.html
        |---attachment.pdf
        |---screenshot.jpg

Notes on above example
The summary csv has the same name as the directory that stores all the messages
msg-1 is a multipart message with both content_type of plain/text and plain/html
msg-2 is a message with only a content_type of plain/text
msg-3 is a message with only a content_type of plain/html
msg-4 is a multipart message with both content_type of plain/text and plain/html and a png image named image.png
msg-5 is a multipart message with both content_type of plain/text and plain/html and a jpg image named screenshot.png and an attachment call attachment.pdf

Overall, it took about 8 hours of research, writing, and testing to complete the script. There were a number of challenges presented, mainly in trying out how to handle the multi-part messages. There is a bug that that is still in the script for mixed content messages that indicates that the index is incorrect and it uses the previous directory because it always seemed to end up in the wrong folder. I'm happy with the workaround but need to further refine the script. I could probably remove the mixed content message type because I am not sure it contains any content that is not in the text or html version of the files. I also want to write a separate header for the html messages so it diplays a little better when opened with a browser. The final script can be found on my Github page.

Resources
Python Mailbox Documentation
How to Read Emails with Python from PythonCode