I made a client-side spamfilter for Tutanota email

By admin
I created a spamfilter that runs in the browser for my personal email.

Introduction

When

Google Workspaces
PRODUCT

started charging money, I switched to another email provider:

Tutanota
GPE

. They are great in many ways, but the spam filter is not very good. I got

dozens
CARDINAL

of spam messages

each day
DATE

. Often I get the same spam message multiple times. These show up in my inbox, even after I reported an email as spam. I was getting tired of this, and decided to do something about it. I made my own client-side spam filter.

Design


Tutanota
GPE

doesn’t have IMAP or something that would make it possible to retrieve and filter messages in a standard way. I mainly use the

Tutanota
ORG

web app from the browser, and decided to implement the spam filter there, using

JavaScript
PRODUCT

. A browser extension can inject my spam filtering script in the

Tutanota
GPE

web app.

The filtering logic doesn’t have to be very complex. I can often determine whether a message is spam just by skimming the subject and the sender, and I figure an automated filter could do the same.

Reverse engineering the

API
ORG

My JavaScript spam filter needs access to emails, so I wanted to hook in to the

Tutanota
GPE

web app. I tried some things in the browser console, and looked at the source code of the

Tutanota
GPE

web app, and after some time I figured out how to retrieve emails. There’s a global tutao variable that offers an entry into the

Tutanota
GPE

internals.

Most web applications have such a global variable, but this is not necessarily the case. It is entirely possible to implement an application without using global variables.

The code to retrieve the inbox from the current mailbox is quite straightforward:

tutao.locator.mailModel.getMailboxDetails().then( details => inbox = details[0].folders.getSystemFolderByType("1") );

However, it is less straightforward to retrieve the mails in this folder. The inbox does have a mails property, but it is an identifier, not a list of emails. Instead, we have to pass identifier to the EntityClient to load a list of mails. Emails, contacts, and calendar items are all forms of entities, and these are all loaded in the same way.

The EntityClient has functionality to load emails (or other entities) starting from a certain email, again indicated with an identifier. This is a common pattern used for pagination. It makes it easy to only retrieve and learn new emails. We remember the identifier from the last email we saw, and request only newer emails.

Many functions within

Tutanota
GPE

and the classifier are asynchronous. Asynchronous function return a

Promise
PRODUCT

, and I handled that explicitly by calling then() on the promise and creating new promises to return. I feel like using await and async keywords better could improve things, and I am missing some understanding here how this works exactly.

The code for the

Tutanota
GPE

web app is open source. However, it is written in

TypeScript
ORG

and then compiled and packaged into a JavaScript bundle. This step removes some useful metadata. That’s why I call

getSystemFolderByType("1
EVENT

") and not getSystemFolderByType(MailFolderType.INBOX) : the values from MailFolderType are inlined when compiling to

JavaScript
ORG

, and the type itself is no longer available.

Functions within

TypeScript
ORG

that are defined as public survive the translation to

JavaScript
ORG

and keep the same name. Some private functions also keep their name, but most are mangled. I currently use some of the private functions that kept their name, but this may be fragile. A recompilation of the app may inline the private function or name it differently.

Running script on the page

I created a browser extension with a content script that runs on the

Tutanota
GPE

mail app. A content script has access to the

DOM
ORG

, but still runs within its own context. Since I want to hook into the Tutanota JavaScript, I want to access the global variables in the page. This is not possible from the content page, but it is possible to add a script to the page itself. So my content script does nothing else than adding another script to the page. That scripts can access the global variables that contain

Tutanota
GPE

functionality:

const script = document.createElement(‘script’); script.setAttribute(‘src’,

chrome.runtime.getURL(‘page.js
PERSON

‘)); document.body.appendChild(script);

The plugin adds a button to the

UI
ORG

, which selects all spam. This button needs to be added just after the

Tutanota
ORG

app is finished loading. I tried to hook into the

Tutanota
GPE

app to get a notification when this is the case, but in the end I decided to keep polling for a <div> with the correct name. Less elegant, but it works.

Naive Bayesian classifier


Thomas Bayes
PERSON

invented some statistics rules, which

today
DATE

help us to filter spam. Specifically, when a word occurs more often in spam than in normal messages, it gives us some information on whether a message is spam if it contains that word. A Bayesian classifier can automatically learn which words fit which categories, and then assign some probability that a message belongs to a certain category. It is naive in the sense that it considers the words and probabilities to be independent: it calculates the probability for “vicodin” and “pills” separately, even though these may in practice only occur together.

NPM has

at least 50
CARDINAL

implementations of Bayesian classifiers in

JavaScript
ORG

. I chose one, and just copy-pasted the code into my script, instead of messing with a build environment. It has a simple API:

classifier.learn(‘Buy

Vicodin Pills Today’
ORG

, ‘spam’); classifier.learn(‘Schedule

week 38′
DATE

, ‘ham’);

classifier.categorize(‘Cheap
ORG

Vicodin Pills’); // => ‘spam’

The categorize function calculates the probability of each category, and returns the most likely category. This may give unexpected results if it is not very sure about the category. If it is

1%
PERCENT

certain it’s ham and

2%
PERCENT

certain it’s spam, it will be marked as spam, even though it basically doesn’t know. This gave unexpected results while developing, but it doesn’t really give problems in practice.

In the example above, we pass a sentence to the

API
ORG

, but the classifier learns on words. The input is split into words by a tokenizer. I began with passing the subject to the default tokenizer, but later on I changed this in a couple of ways.

First
ORDINAL

, a special token is added when the subject is empty. An empty subject of course does not match any words and would be hard to classify, but the fact that it’s empty is a good sign that it’s spam. To inform the classifier of this, I add a __emptySubject token when the subject is empty.

Second
ORDINAL

, I don’t split the recipient address into separate words. That something is sent to [email protected] is useful for the classifier, and shouldn’t be confused with a subject that contains “

Sjoerd
PERSON

” and “GitHub”.

Third
ORDINAL

, I made the tokenizer case-insensitive.

After changing the tokenizer, the classifier does not work correctly anymore and needs to be retrained. Before making the tokenizer case insensitive, it learned about “GitHub” and “

VICODIN
ORG

”, and now it only sees mails about “github” and “vicodin”. I forgot this step a couple of times, and then it looks like the change to the tokenizer made the classifier a lot worse.

I store the classifier in the browser’s localStorage, so that it keeps persisted and I can use and update it every time I open the mail app.

Conclusion

The spam filter works well. The Bayesian filtering was easier than I expected. Integrating with the

Tutanota
ORG

app was the biggest challenge. Because the

Tutanota
ORG

app does not have a well-defined API, I feel like my spam filter could stop working when they push a new update of their app. But until that time, I have an easy way to handle my spam problem.

Read more