Predicting Email Leaks

Written by Mike Rede on March 4, 2009

There has been more news lately about people sending emails that included offensive contents such as the case of the Los Alamitos Mayor Dean Grose who emailed a group of recipients a picture that offended at least one of the recipients.

And we all remember the recent leaks of email by an employee at Yahoo! whose new CEO, Carol Bartz, had offered a $1000 bounty for the head of the employee responsible for leaking the CEO’s internal emails to the media.

So I started doing a little research into the subject of email leaks and unintended email recipients and I came across a Carnegie Mellon University study entitled, “Preventing Information Leaks in Email” written by Vitor R. Carvalho, Language Technologies Institute, and William W. Cohen, Machine Learning Department.

 http://www.cs.cmu.edu/~wcohen/postscript/sdm-2007-leak.pdf

Most of us have received emails and then wondered why we received them. In the company I work for I get emails that are intended for someone else whose last name is different from mine by a single character and whose first name also has the same first initial as my own. Other times I’m included on an unintentional “reply all” email – just by looking a the contents of the message reply I know that I was not meant to see it – it either didn’t affect me or worse the sender’s audience was meant to be one specific person but not everyone on the reply list. Auto-completion is usually the cause of the first type of unintended emails. And being in a rush is usually the cause of the second type of unintended emails being sent.

In their study, they modeled different methods for predicting email leaks and also applied leak detection methods to prevent email leaks from happening. One method, the aptly named “Cosine” method, modeled how to identify potential “leak recipients” based on recipient-message pairs. Another method used a classification-based approach. In this method, textual and social network features were extracted from the messages and then a technique applied to predict an email leak. What is surprising is that their research showed they could identify potential “leak recipients” in almost 82% of the messages.

Basically their Cosine method would look at the current message sent to a set of recipients and compute a vector representation based on the textual contents of the message. They then would look at all previous messages sent by the sender to each recipient and for each recipient they would compute a corresponding vector value. So now they can mathematically compare previously sent messages to each recipient against the current message. Based on the smallest similarity value they could then predict one recipient as the leak-recipient of the email.

You, as an administrator don’t have the time and luxury to mathematically model email leaks but you can present this information to your CIO or users as a reminder that they should always be careful when using auto-complete and to think twice before hitting the reply-all button.

Subscribe to my RSS feed

Leave a Comment

Comment Policy