Parsing mbox Files with the Mailbox Library
In my day job I was recently given a task of pulling an export of over 2,200 emails and organizing them in a way that would be easy to work with. Our email archive exports emails in a .mbox
format. The mbox format is essentially a very long text file that stores all of the email metadata, content, images, and attachments. The mbox file for the 2,200 emails I pulled was over 7 million lines long.
I was not able to find many viable options for opening an mbox file to work with the contents. There were some older blog posts that suggested using Mozilla's Thunderbird or some various windows utilities but I wasn't able to get any of them working. Ultimately, the emails needed to be separated into the individual messages so an individual could view and print them as part of the request.
Knowing the mbox archive was a text file, I decided to open it with a text editor to see what it looked like and maybe I could write something in Python to parse the file. From reading up on the mbox format, I knew each message began with the same 5 characters: "From ". Knowing that I could start to discern individual messages from the text file. Most messages were stored in multiple formats, including text-only, html, mixed text and html, and any images or attachments in the messages was text encoded in base64. I was able to start to write a parser but there was no way I was going to get it done in the time I had.
So I figured someone else had already solved this problem, so I turned to Google. While I couldn't find anything that did exactly what I wanted, I did find out that the Mailbox library in the Python standard library could parse an mbox file in such a way that each message could be stored in a directory with a text and html version, along with saving any attachments for the message in the directory as well.
Setting up the script
The script is designed to be run from the command line in the directory with the mbox file. The script takes two command line arguments when run: the name of the file to process, and the name of the directory where all of the messages should be stored. Below, the first part of the script uses argparse
to take the two command line arguments and stores them as the variables to use as the file name and directory name. I create a csv with some summary information about all the messages that goes in the directory as well. The headers for the csv file as well as the list of dictionaries to create the csv are here, too.
mbox_process.py
import mailbox
import os
import csv
import argparse
############################
### PARSE CLI ARGUEMENTS ###
############################
# Create the parser
my_parser = argparse.ArgumentParser(description='List the content of a folder')
# Add the arguments
my_parser.add_argument('file_name',
action='store',
help='Name of file to be processed.')
my_parser.add_argument('dir_name',
action='store',
help='Name of directory to store messages.')
# Execute the parse_args() method
args = my_parser.parse_args()
###################################
### PREPARE FOR FILE PROCESSING ###
###################################
# get filename from user
file_name = args.file_name
# get directory to store messages in from user
dir_name = args.dir_name
# check if the directory exists
if not os.path.isdir(dir_name):
# make a folder for this email (named after the subject)
os.mkdir(dir_name)
# create a list to store information on each message to be written to csv
csv_headers = ['Message', 'From', 'To', 'Date', 'Subject', 'Attachment', 'PNG', 'JPG']
email_list = []
Processing the mbox file
Next, the script processes the mbox file. First, it uses the enumerate
function in order to get an index that I could use to track which message I was working with. Then it processes each message, doing the following:
- Create a directory to store the message and any images or attachments
- Adds the summary information from the message to the email summary list as a dictionary
- Create a header based on the message information to print above the body of each message
- Check if the message is a multipart message:
- If yes:
- Save the plain text version of the message as a
.txt
file - Save the html text version of the message as a
.html
file - Save the mixed text version of the message as a
.html
file - Save any images in the message as
.png
or.jpg
files and update the csv summary dictionary to indicate their presence - Save any attachments in the format indicated in the mbox file and update the csv summary dictionary.
- Save the plain text version of the message as a
- If no:
- Check if the message is plain text only and save it as a
.txt
file - Check if the message is html and save it as a
.html
file - Print "MESSAGE SKIPPED" if the content isn't recognized (for debugging)
- Check if the message is plain text only and save it as a
#########################
### PROCESS MBOX FILE ###
#########################
# iterate over messages
for idx, message in enumerate(mailbox.mbox(file_name)):
# create a folder for the message
folder_name = f"{dir_name}/msg-{idx + 1}"
prev_folder = f"{dir_name}/msg-{idx}"
if not os.path.isdir(folder_name):
# make a folder for this email (named after the subject)
os.mkdir(folder_name)
# add message to summary list for csv
msg_dict_temp = {
'Message': idx + 1,
'From': message['from'],
'To': message['to'],
'Date': message['date'],
'Subject': message['subject'],
'Attachment': "N",
'PNG': "N",
'JPG': "N"
}
# add header info to full message
full_message = f'''### ### ### Start Message {idx + 1} from {file_name} ### ### ### \n
TO: {message['to']}
FROM: {message['from']}
DATE: {message['date']}
SUBJECT: {message['subject']}
CONTENT:
'''
# iterate through each message
if message.is_multipart():
# iterate over the message parts
for part in message.walk():
# get email content
content_type = part.get_content_type()
content_disposition = str(part.get("Content-Disposition"))
# print(f"Msg: {idx +1}")
# print(content_type)
# print(content_disposition)
try:
# get the email body
body = part.get_payload(decode=True).decode()
except:
pass
# save plain text of email
if content_type == "text/plain":
# add the body of the message
full_message_text = full_message + f"{body}"
# write the message to a file
# name the file
filename = f"msg-{idx + 1}.txt"
filepath = os.path.join(folder_name, filename)
# check if file exists to append
if os.path.exists(filepath):
# append additional text to existing file
open(filepath, "a").write(body)
else:
# write the file
open(filepath, "w").write(full_message_text)
# save html text of email
elif content_type == "text/html":
# add the body of the message
full_message_html = full_message + f"CONTENT: \n{body}"
# write the message to a file
# name the file
filename = f"msg-{idx + 1}.html"
filepath = os.path.join(folder_name, filename)
# write the file
open(filepath, "a+").write(full_message_html)
# multipart/mixed messages
# BUG writes to wrong folder and wrong file name (+1)
elif content_type == "multipart/mixed":
# add the body of the message
full_message_mixed = full_message + f"CONTENT: \n{body}"
# write the message to a file
# name the file
filename = f"msg-{idx}-mxd.html"
filepath = os.path.join(prev_folder, filename)
# write the file
open(filepath, "w").write(full_message_mixed)
# save png attachments
elif content_type == "image/png":
# update email summary list
msg_dict_temp['PNG'] = "Y"
# download attachment
attachment_name = part.get_filename()
if attachment_name:
filepath = os.path.join(folder_name, attachment_name)
# download attachment and save it
open(filepath, "wb").write(part.get_payload(decode=True))
# save jpg attachments
elif content_type == "image/jpeg":
# update email summary list
msg_dict_temp['JPG'] = "Y"
# download attachment
attachment_name = part.get_filename()
if attachment_name:
filepath = os.path.join(folder_name, attachment_name)
# download attachment and save it
open(filepath, "wb").write(part.get_payload(decode=True))
# save email attachment
elif "attachment" in content_disposition:
# update email summary list
msg_dict_temp['Attachment'] = "Y"
# download attachment
attachment_name = part.get_filename()
if attachment_name:
filepath = os.path.join(folder_name, attachment_name)
# download attachment and save it
open(filepath, "wb").write(part.get_payload(decode=True))
# append final message dict to email summary list
email_list.append(msg_dict_temp)
else:
if message.get_content_type() == "text/plain":
for part in message.walk():
# add the body of the message
try:
# get the email body
body = part.get_payload(decode=True).decode()
except:
pass
full_message_text = full_message + f"{body}"
# write the message to a file
# name the file
filename = f"msg-{idx + 1}.txt"
filepath = os.path.join(folder_name, filename)
# write the file
open(filepath, "a+").write(full_message_text)
elif message.get_content_type() == "text/html":
for part in message.walk():
# add the body of the message
try:
# get the email body
body = part.get_payload(decode=True).decode()
except:
pass
full_message_text = full_message + f"{body}"
# write the message to a file
# name the file
filename = f"msg-{idx + 1}.html"
filepath = os.path.join(folder_name, filename)
# write the file
open(filepath, "a+").write(full_message_text)
else:
print(f"MESSAGE SKIPPED {idx + 1}")
Writing the summary csv
Then finally, it writes the summary of emails to a csv file.
####################################
### WRITE EMAIL SUMMARY TO CSV ###
####################################
with open(f'{dir_name}/{dir_name}.csv', 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=csv_headers)
writer.writeheader()
writer.writerows(email_list)
When the processing is over, a directory is created that contains all the messages and attachments in their own directory. When the script is complete you end up with a directory structure like this:
Directory
|
|---example.mbox
|---example
|---example.csv
|---msg-1
| |---msg-1.txt
| |---msg-1.html
|---msg-2
| |---msg-2.txt
|---msg-3
| |---msg-3.html
|---msg-4
| |---msg-4.txt
| |---msg-4.html
| |---image.png
|---msg-5
|---msg-5.txt
|---msg-5.html
|---attachment.pdf
|---screenshot.jpg
- Notes on above example
- The summary csv has the same name as the directory that stores all the messages
msg-1
is a multipart message with bothcontent_type
ofplain/text
andplain/html
msg-2
is a message with only acontent_type
ofplain/text
msg-3
is a message with only acontent_type
ofplain/html
msg-4
is a multipart message with bothcontent_type
ofplain/text
andplain/html
and apng
image namedimage.png
msg-5
is a multipart message with bothcontent_type
ofplain/text
andplain/html
and ajpg
image namedscreenshot.png
and an attachment callattachment.pdf
Overall, it took about 8 hours of research, writing, and testing to complete the script. There were a number of challenges presented, mainly in trying out how to handle the multi-part messages. There is a bug that that is still in the script for mixed content messages that indicates that the index is incorrect and it uses the previous directory because it always seemed to end up in the wrong folder. I'm happy with the workaround but need to further refine the script. I could probably remove the mixed content message type because I am not sure it contains any content that is not in the text or html version of the files. I also want to write a separate header for the html messages so it diplays a little better when opened with a browser. The final script can be found on my Github page.
Resources
Python Mailbox Documentation
How to Read Emails with Python from PythonCode