Bloomgrey

The Postfix source includes an example greylisting policy server in examples/smtpd-policy/greylist.pl, and Postgrey is a production-quality greylister based on it. Both use a Berkeley DB for storing triples and timestamps, which means that they are fairly disk-intensive, and on high-traffic systems the database can grow to huge proportions.

Bloomgrey avoids these problems by using a Bloom filter to record triples which have successfully passed greylisting, eliminating the need for disk-based storage (with a few caveats). The eventual goal is for Bloomgrey to run in less memory than Postgrey occupies just with the db handle open.

Status

At present Bloomgrey exists only as a prototype/proof-of-concept Python script, bloomgrey.py (with accompanying unittests in bloomgrey_test.py).¹ And it seems I have already succeeded in my major goal, that the Bloomgrey process would use less virtual memory than Postgrey — on my development system, Postgrey consumes >9 MB of virtual on starting up (with an empty database), while Bloomgrey occupies only 4624 KB (with an empty greylist). I plan to eventually rewrite Bloomgrey in a compiled language to achieve the smallest possible CPU and memory usage.

I have successfully deployed the above script on a very low-traffic site running Postfix 2.2.5. I will be testing it at a busier site in the near future, to make sure that it performs as well as it should.

If you would like to use Bloomgrey on your own systems, please be aware that the code is supplied with no warranty. Any suggestions or feedback are greatly appreciated.

1. Note that this needs Python 2.4, only because I use the new decorator syntax which isn't available in earlier versions. You could replace the decorators with the older-fashion syntax and you should have no problems in Python 2.2 or later. In fact probably the script could be made fully backwards compatible with a bit more effort, but I am too lazy for that.

License

Implementation details

I will assume that the reader is familiar with the basic concept of greylisting, as well as the various issues that arise when it is employed. Evan Harris' whitepaper provides excellent coverage.

A Bloom filter is a data structure for efficiently storing arbitrary sets of (hashable) objects. It supports insertion and membership tests, but not removal. It is also impossible to extract a member of the set when its original value is not known. And most significantly, membership tests have a probability of returning a falsely positive result (although false negatives are not possible). For details, this page is very informative.

All these properties are perfectly acceptable for a whitelist used in greylisting. Note that here I use the term "whitelist" to mean a list of triples which have successfully "passed" greylisting. A false positive in this context simply means a triple which is allowed, when it should have been deferred. Note also that the probability of a false positive is tunable (lower probabilty equals more memory used); Bloomgrey defaults to <2.4%.

Bloomgrey keeps a "greylist" of triples and timestamps, onto which new triples are placed when they are first seen. If a triple is seen more than a small delay after it was first seen, it is "passed" and added to the Bloom filter whitelist. Greylist entries expire after a reasonable period.

Because entries cannot be removed from the whitelist after insertion, the whitelist maintains a count of the number of insertions and after a certain threshold resets itself to be empty again. This threshold is the single most important tunable of Bloomgrey (max_white_entries in the Python prototype) — it should be set to the number of unique triples your Postfix server sees in a 36-day period (see Harris for why). Memory consumed by Bloomgrey is directly proportional to this value.

I feel like this section is inadequate, and I will improve it later. In the meantime, just check out the code. ;-)

Origin and motivation

As far as I know, the idea of using a Bloom filter for greylisting has not been suggested before. The idea came to me one afternoon when I was simultaneously trying out Postgrey, and reading about LOAF, which uses Bloom filters. I was horrified at the virtual size of the Postgrey process on my home server² and wondered if there was any better way to go about greylisting. The idea of using Bloom filters occurred to me, and the more I thought about it the more it seemed like a good idea — it would use less memory, not require disk access (very slow on my home server), and not depend on weighty libraries like BerkDB.

2. I use a Linksys NSLU2 as my home server, with OpenSlug hacked to run Gentoo. It has an excellent 266 MHz ARM (XScale) CPU but an abysmal 32 MB of memory. Naturally I was horrified when Postgrey started up and filled 9 MB of virtual memory — as much as all of my Postfix processes combined!

Contact

My homepage is at http://www.djc.id.au/. You can send e-mail to me at djc@djc.id.au. If you find Bloomgrey useful or otherwise interesting, I would love to hear from you.