Blocking Referer Spam

If you run an Internet-facing website these days, you've almost certainly been hit with referer spam. Dealing with spam is always annoying, but this page will hopefully simplify the process and help you block most of it. I will introduce an idea and implementation using mod_security that I have not yet seen anywhere else.

Since this site focuses on Linux, this guide will be geared towards Apache. Similar implementations may be available on other platforms as well, though — for example, mod_security can be deployed as a platform-independent reverse proxy, and mod_rewrite has an equivalent on IIS called ISAPI_Rewrite.

What is referer spam?
Simple ways of dealing with referer spam
1. mod_rewrite
2. mod_setenvif
Rolling in the big guns: mod_security
Final remarks

What is referer spam?

This page is not intended as an introduction to referer spam, so I have kept this section as brief as possible. For more detail, I suggest Googling referer spam. There are tons of great articles out there describing what referer spam is. If you don't even know what a referer is, I suggest its Wikipedia article.

Basically, referer spamming is the act of making a request to a website (such as the one you're reading) and including a faked HTTP Referer field that contains the URL of a spam site. Usually this is done dozens or hundreds of times. In my experience, the onslaughts come in waves, and can be as heavy as several requests per second, enough to cause a denial-of-service on a low-bandwidth pipe.

The HTTP referer (and thus, any referer spam) is almost always saved in the webserver logs to help in generating statistics and such. Too often, sites will post statistics, including a list of top referers for that month, in a publicly accessible area of the site. (For example, search "usage statistics for" "total referrers" webalizer on Google.) It is generally believed that the primary motivation of referer spammers is to get their PageRanks up by spamming innocent (but poorly-configured) sites and causing them to link to the spammers.

An interesting bit: "referer" is a misspelling from the HTTP/1.0 RFC. The misspelling was caught by the time of the HTTP/1.1 RFC, but was left unchanged.

Simple ways of dealing with referer spam

The two simplest ways of dealing with referer spam are mentioned in various blogs and postings around the web, and both take less than five minutes to set up on an existing Apache installation. Both are usually included with the Apache distribution and enabled by default. They are equally flexible and powerful for the purposes of blocking referer spam, though mod_rewrite maybe be a bit more difficult to pick up at first.

mod_rewrite

With mod_rewrite, you can match requests based on certain criteria, and optionally respond with a redirect to somewhere. The following rules will perform regex matching on the referer of each incoming request:

RewriteEngine on
RewriteCond %{HTTP_REFERER} "^http://(www\.)?cheapmortgage\.com/" [NC,OR]
RewriteCond %{HTTP_REFERER} "viagra|cialis|phentermine" [NC]

You can tell mod_rewrite to return a 403 Forbidden to matching requests by putting this rule after the conditions above:

RewriteRule .* - [F,L,E=spam:refspam]

Quick reference: NC = ignore case, OR = match this rule OR the next one (you will want to have OR in every RewriteCond except the last one), F = forbidden, L = last rule, E = set environment variable. See the mod_rewrite documentation for more details.

You could also return a page explaining why the request was blocked (in case legitimate users trigger a match) by using this RewriteRule instead:

RewriteRule .* /refspam.html [L,E=spam:refspam]

Or, for fun, you could redirect them to the site they're trying to advertise, sucking up the bandwidth of the spammer's own server:

RewriteRule .* %{HTTP_REFERER} [L,E=spam:refspam]

One of the annoying things about referer spam is that it clutters up your server's log files with bogus requests. Luckily, Apache lets you separate out your log files based on environment variables. Since we set the spam environment variable for requests that match the above rules, we can log requests that look like spam into spam_log and everything else into access_log:

CustomLog "logs/spam_log" combined env=spam
CustomLog "logs/access_log" combined env=!spam

mod_setenvif

The idea here is to check to see if certain conditions match, and if so, set an environment variable which will allow you to handle requests based on the variable. First, here is how to block requests that look like referer spam:

# SetEnvIfNoCase is case-insensitive but otherwise identical to SetEnvIf
SetEnvIfNoCase Referer "^http://(www\.)?cheapmortgage\.com/" spam=refspam
SetEnvIfNoCase Referer "viagra|cialis|phentermine" spam=refspam
Deny from spam=refspam

The basic idea is that the list of rules will execute, and if any of them match, the spam variable will be set. The Deny rule will then return a 403 to any matching requests.

You can also use mod_redirect in conjunction with mod_setenvif. For example, you can redirect based on whether the spam environment variable is set to refspam (this does the same thing as the last part of the mod_rewrite section):

RewriteEngine on
RewriteCond %{ENV:spam} =refspam
RewriteRule .* %{HTTP_REFERER} [L]

Just like in the previous section, we can separate the log files:

CustomLog "logs/spam_log" combined env=spam
CustomLog "logs/access_log" combined env=!spam

Rolling in the big guns: mod_security

The solutions described above are easy to set up and can be very effective, but they have a major weakness: they inspect each request individually, without context of other recent requests. There are tons of Google results for using mod_security to block referer spam, but 90% of the dozens of guides, blog posts, etc that I read use mod_security in much the same way as mod_rewrite or mod_setenvif as described above, that is, blocking individual requests using simple substring/regex matching. The other 10% are slightly more helpful, like this one, which suggests scanning POST payloads for comment spam, and even individual arguments in HTTP requests. (Be advised that the previous link, as well as almost all of the top Google search results like the ones I've linked to, show syntax for mod_security < version 2 and will not work with current versions.)

I will show you a new way of dealing with referer spam using mod_security.

First, you need to install it. In CentOS and Fedora, you can get it as an Apache module just by doing # yum install mod_security and restarting Apache. For other distros, consult Google or the documentation. This guide is for mod_security version 2 or higher.

Dynamically blocking IPs

If one IP sends a bunch of spammy requests within a short period of time, then it may be a good idea to block that IP. This is surprisingly effective, at least for me: over a 14-day period during which I collected data, I received 8216 spam hits, but they came from only 249 IP addresses, and of those, 80.0% of the hits were accounted for by just 20 IPs. I have also found that blog comment spammers often have the same IP addresses as referer spammers. I guess zombie machines often do double duty.

Anyways, the following example code (for httpd.conf) shows how to block IPs dynamically:

SecAction phase:1,initcol:IP=%{REMOTE_ADDR},deprecatevar:IP.spam=3/86400,nolog
SecRule IP:spam "@gt 15" phase:1,setvar:IP.spam=+1,drop,setenv:spam=spam
SecDefaultAction phase:1,setvar:IP.spam=+3,setenv:spam=refspam,deny
SecRule REQUEST_HEADERS:Referer "^http://(www\.)?cheapmortgage\.com/"
SecRule REQUEST_HEADERS:Referer "viagra|cialis|phentermine"

The first line tells mod_security to initialize a collection (a collection maps things, in this case IP addresses, to nonnegative numbers that are initialized to zero), using the client's IP address as the key for this request. The spam variable of this collection is to be decremented by 3 every day (86400 seconds) but, by definition, cannot drop below 0. Also, do not log this SecAction since it runs on every HTTP request and would be uninteresting log material.

The second line sets a rule that if the spam counter for this IP is greater than 15, then add 1 to the counter, drop the connection (this breaks the TCP connection and doesn't even bother to return an HTTP 403), and set the spam environment variable to refspam.

The third line sets the default action that any rules after this point are to take. Notice that lines 4-5 have a condition (a regular expression match) but no defined action. This is because they will execute the default action. In this example, the default action is to increase the spam counter for this IP by 3, set the spam environment variable, and deny the request with a 403.

A summary of the above rules is: Every request that matches the regular expressions causes that client's IP to get 3 points. When the IP has >15 points, any further requests are automatically denied without checking the regexes, and the IP gets 1 more point per request. The accumulated points drop by 3 per day.

Because mod_security can read and write Apache environment variables, it can communicate with other Apache modules, like the logger. Also, given the rules above, you can set certain regexes to be worth more points. For example, if a certain referer regex can't possibly result in a false positive, you may want to assign, say, 50 points to any IP who sends it. Here is a more complete set of rules that does something along those lines. You can put this in httpd.conf itself or drop it in the conf.d directory:

# Note: you can use these rules, but make sure you tune the regexes as they 
# have been simplified here for clarity. You're welcome to contact me if you 
# want the rules I actually use. If you redistribute my stuff, please give 
# credit to this site.

# Clear the default action from whatever may have been set in a previous
# part of httpd.conf
SecDefaultAction phase:1,pass

# Initialize collection and deprecate by 3 points per day (86400 seconds)
SecAction phase:1,initcol:IP=%{REMOTE_ADDR},deprecatevar:IP.spam=3/86400,nolog

# If there are already >15 spam points for this IP, then drop 
# the connection and add 1 point (instead of 3, as below).
SecRule IP:spam "@gt 15" phase:1,setvar:IP.spam=+1,drop,setenv:spam=spam

# There are no ASP files on my site, so nobody can legitimately be 
# referred by one. This is spam without a doubt, though not referer spam --
# these requests are usually made by blog comment spammers.
# Give this IP +50 points for being a definitely faked request.
SecRule REQUEST_HEADERS:Referer "^http://(www\.)?example\.com/.*\.asp" \
	phase:1,setvar:IP.spam=+50,setenv:spam=self,deny

# Let's say I have manually inspected this site and it is definitely spam,
# then I can ban anybody who claims to be referred by it:
SecRule REQUEST_HEADERS:Referer "^http://texasholdem\.example\.com/" \
	phase:1,setvar:IP.spam=+50,setenv:spam=refspam,deny

# Set the default action for the list of rules after this.
SecDefaultAction phase:1,setvar:IP.spam=+3,setenv:spam=refspam,deny

# I have included only two simple examples here. You should use more regexes, 
# both in quality and in quantity. This section is where the bulk of the 
# rules go.
SecRule REQUEST_HEADERS:Referer \
	"google\.(com|de)/group/[^./]*(poker|insur|payday)[^./]*/web/"
SecRule REQUEST_HEADERS:Referer \
	"(poker|insur|payday)[a-z0-9]*\.com/(item)?[0-9]+(\.html|\.php|/)$"

# Clear the default action for any mod_security rules later in httpd.conf.
SecDefaultAction phase:1,pass

Separating log files

Since the rules above set the spam environment variable for spammy requests, we can separate out the Apache logs as described in the mod_setenvif section:

CustomLog "logs/spam_log" combined env=spam
CustomLog "logs/access_log" combined env=!spam

Caveats

There are some things to be careful of when using mod_security:

First, mod_security comes with a set of default rules that when you first install it, will probably break parts of your site. For example, the default installation on CentOS and Fedora blocks Apache from sending a directory listing (which Apache tries to do by default in directories without an index.html). The configuration files on CentOS/Fedora are in /etc/httpd/modsecurity.d/. You may have to manually comment out some rules.
Next, know that SecRule and SecAction will perform the action you explicitly put on that line and whatever actions were defined in the last SecDefaultAction. For example, consider this code:
```
SecDefaultAction phase:1,deny
SecRule REQUEST_HEADERS:Referer \
	"google\.(com|de)/group/[^./]*(poker|insur|payday)[^./]*/web/"
SecRule REQUEST_HEADERS:Referer \
	"(poker|insur|payday)[a-z0-9]*\.com/(item)?[0-9]+(\.html|\.php|/)$"
# ...more SecRules...
# set environment variable for localhost, maybe to log hits separately
SecRule REMOTE_ADDR "^127\.0\.0\.1$" setenv:localuser=yes
```
This is incorrect: it will set the environment variable when localhost visits, but will also deny the request. You must clear the SecDefaultAction, for example by setting SecDefaultAction phase:1,pass.
Remember that httpd.conf does not execute linearly. For example, consider the following code:
```
SecRule REQUEST_HEADERS:Referer "viagra" flag=yes
SetEnvIf flag "yes" spam=refspam
```
It seems that if the referer contains "viagra" then the variable flag gets set, and the next line checks whether flag is set and sets the spam variable if so. (This example is hypothetical, but similar situations have come up in my actual configuration.) However, what actually happens is that all the mod_setenvif directives are run before any of the mod_security directives, so the spam variable never gets set (because when SetEnvIf runs, flag is not set yet). Something to watch out for.

The documentation is pretty crappy and the syntax makes obfuscated C look like a work of art. It's a definitely a powerful tool and I highly recommend it, but caveat emptor. I still have no idea what the phase:1 in some of my actions does...

Final remarks

I hope I've been helpful. Referer spam won't die out until it is no longer economically feasible, but unfortunately, with so many servers publishing referer stats (and many of their operators don't even realize it's happening), this is unlikely in the near future. In the meantime, good luck!

If you have any questions or comments, don't hesitate to contact me.