A Simple Matter of Programming…
January 18th, 2006I have been scouring the internet for code for extracting URLs from e-mail for the purposes of spam filtering. There’s not much out there that isn’t proprietary, written in perl, or defeated by relatively simple obfuscations. I’ve been thinking of writing my own, but one look at what I’m up against Is almost enough to make me scream, run away, and “just hit delete” for a while…
A simple scanner can be built around a regexp, or strstr(foo,”http://”), and gathering up all valid hostname characters after each occurrence. This will take care of plain text e-mail and HTML with minimal obfuscations.
Add in some header parsing logic, and a Base64 and Quoted-printable decoder, and you can get past another pair of simple obfuscations. Now, you’ve pretty much handled all plaintext cases where a URL will be rendered “clickable.” Further obfuscations like “type www DOT spamdomain.biz into your web browser to order”, or including the URL as an inline image are a bit harder to deal with, but also make the advertising less effective.
Now we can turn our attention to the wonderful world of HTML URL obfuscation. Putting the URL into an <a xhref=> tag hides it from the human, and feeds it to a rather tolerant HTML parser. If the URL scanner isn’t as tolerant as the HTML parser, links like this:
<a xhref="http://skvalwekvasde> lkaaewlkvaweses.spamdomain.info">
will get past. (For good measure, this one landed in my inbox with the domain name broken up by a quoted-printable newline). Throw in some %XX character encodings to see what a URL scanner’s up against. HTML mail is evil.
Now that I’ve laid it out this way, it doesn’t seem quite as bad as it did at first… I may have to give this a shot.
One thing I would like the code to do is keep track of any obfuscations it undoes and report them along with the URL. It’s nice to know that the message contains a URL in the spammer.biz domain, but the fact it was rendered as %53pammer%2Ebiz in a Base64-encoded us-ascii message might also be worth knowing. I haven’t seen any code which does this yet; pretty much everything I see is just there for the URL or URL domain…