FAQ:
What is the SRBL?
The Statistical Realtime Blackhole List is similar to other RBLs in that it publishes a zone file of IP addresses from which email should not be accepted; it's a blacklist of known spam sources. Where this differs from a conventional blacklist is that the process is completely automated, and seeks to be unbiased by human prejudices of what constitutes spam. Different people think different things constitute spam. What I hope to implement here is a sort of "voting" system, where a central server can tally, based on the results of highly accurate statistical filters throughout the world, the decisions of many different users as to which emails are spam - and whether the IPs that they originate from constitute spam sources.
In a nutshell, every user in the network will automatically submit (through a script) the IP addresses in the header of each email, and the verdict on whether their trained Bayesian filters (dspam, crm114, etc) think that that particular email is spam. The central server will then make a decision based on that data whether to blacklist a given IP or not, based on a reasonably fair and simple algorithm.
What motivates the SRBL?
Spam is a problem. We all know this. But most of the existing RBLs suck; they're entirely too manual for many people's tastes. Stories abound of IP addresses being blacklisted for various nontechnical or overly aggressive reasons. Blacklists as a rule are a lovely way of putting pressure on ISPs to become more aggressive themselves in enforcing their user policies and working to reduce the spam coming from their network blocks; however, when a blacklist is created and a lot of people subscribe to it, the administrator of that blacklist gains a lot of control, and can enforce his opinion of what constitutes spam on the end user. I propose to avoid that by using an automatic means to create short or long-term IP address blocks. I do not intend to create a BL that puts social pressure on ISPs, but rather a blacklist that stops spam runs in something approaching real time, by watching for a flood of spam reports from a particular IP address and then blacklisting that IP.
How do you decide which IPs get listed?
The algorithm has yet to be decided; I am collecting data at this time. Probably I will decide that the simplest way is a straight majority: if the amount of guilty reports for an IP address exceed the innocent reports by a 100-to-1 margin, that seems to me a reasonable threshold for blacklisting. Other, more sophisticated algorithms may come up later. A minor refinement to the idea allows each user to submit only one vote. The algorithm would probably be time-based; that is, blacklists could be calculated only using the most recent reports. (A more realistic approach might weight reports based on how old they are, where more recent reports are weighted heavier. Another approach might involve weighting reports based on how trusted the user is as well as the age. Etc. I suspect that I may publish multiple lookup zones, each corresponding to a different algorithm, and see which work best.)
What vulnerabilities does this approach have?
Something like the SRBL will have a major weak spot in zombie networks; if a large number of users simultaneously report a proportionally small number of innocent messages, it can fairly easily "outvote" a spam decision by the central server. With that proposed 100-to-1 algorithm, for every 100 spams reported, all the spammer would have to do is send 1 non-spam. I hope to avoid this kind of abuse, and similar kinds of abuse by implementing a user "trust" system. The algorithm I have in mind is that, at first, new users will have a "probationary" period in which their IP address reports are compared against other person's reports of the same IP (this requires a fairly mature database). If they strongly agree, they'll get a trust point. If they strongly disagree, they'll lose a trust point. If neither, then they won't get any points. After a while, they'll build up enough trust in the system, and their reports will be trusted. There will probably be automatic mechanisms by which trust can be lost (say, if they consistently submit reports that strongly disagree with the consensus of other users) but I haven't committed to anything yet. That should put significant barriers to people trying to fool the network for spam runs.
What measures do you take to avoid FPs?
I hope to avoid FPs by carefully choosing the data that is allowed to be put into the network - the "trust" mechanism will serve to eliminate error by eliminating users with filters that strongly disagree with trusted users' filters (which are assumed to be highly accurate). Especially at first, the users submitting data to the database must have a proven accuracy record with the Bayesian filters - 99.5% and higher, on a large corpus of data, which will establish a trusted body of data and a trusted pool of users, reducing the impact that new users would have on the overall database. Secondly, IP blocks will be dynamic - one algorithm for blocking might calculate using only reports in the last 24 hours, and as long as that condition obtains, the block would remain in place until the reports cycle through the window.
How can I report IPs?
Currently there's only one way: I have a script for my own setup: http://novylen.net/~rbos/srbl-client.text - that will scan an email, pull out all the IP addresses in the header, and report them to a PHP script that I've set up in order to insert them into the database. Usernames are implemented, but no trust mechanism as yet (the threshold is set to zero and does not yet increment). You should adapt this script to your setup. If you can't, then you probably don't want to be doing this yet anyway. As for username and password - just pick one. For the "whitelisting" variable, stick in any mail server that you know is going to be the source of spam, but that you don't necessarily want to be reported. A secondary MX is a good example of this. This should really not be necessary at all, the algorithms should account for it - but there are situations in which this would be useful. The script is fairly primitive, I admit. As for usernames and passwords - just pick an arbitrary pair at this point, the script will add it if it doesn't already exist.
On a more general level, the script expects four parameters: ip=, spam=, username=, and password=, in a GET request. The only one that should need explaining is "spam", which is set 1 if the IP address represents a spam, and 0 otherwise.
I have been falsely listed. How can I fix this?
The SRBL maintains an internal whitelist. If you've been mistakenly added, email.
To see the current state of the database, see http://novylen.net/~srbl/dump_contents.php