The importance of obscuring email addresses on websites

mailto:email@address.com links in sites are pretty common place, but having this in the source code of the site means that email is going to get soo much extra spam.

This is not a new topic btw, it's one since the dawn of time - internet time.

"Why will this lead to spam?" Robots, that's why.

Robots as we call them, or crawlers, or harvesters, or just scripts can be coded to search round the internet and gather email addresses from sites.

Sometimes it's just for the sake of gathering random emails, while other times they can be looking for sites in specific industries etc.

You might be thinking this doesn't happen much, but likely it's happening right now on your site... and again... and again. Many people are running these automated crawlers, and why, because it's easy to do and they can profit from it.

"How easy is it?"

Here's a quick bash script I wrote in about 5 minutes and it's far from perfect. It searches google.co.uk for "web designers" using curl and returns the first 100 results. Then it uses a combination of grep and awk find all the sites returned. Then it uses xargs to execute curl again to request each site and grep to look for any email addresses. At the end it prints out all the email addresses it's found.

Go ahead, try this in your terminal now: (Tested on mac & linux/ubuntu)

curl -s "https://www.google.co.uk/search?q=web+designers&num=100" --user-agent "Chrome" \  
    | grep -oE '<a href="\/url\?q=(https?:\/\/.*?\/)' \
    | awk -F'=|&' '{print $3}' \
    | xargs -L 1 bash -c '\
        curl -s $0 --user-agent "Chrome" \
        | grep -oE "\b[A-Za-z0-9._%+-]+(@|\[at\])[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" ' \
    | uniq

That 2nd to last line matching the email addresses comes from Shell Hacks btw.

For ease of copying here it is as a single line:

curl -s "https://www.google.co.uk/search?q=web+designers&num=100" --user-agent "Chrome" | grep -oE '<a href="\/url\?q=(https?:\/\/.*?\/)' | awk -F'=|&' '{print $3}' | xargs -L 1 bash -c 'curl -s $0 --user-agent "Chrome" | grep -oE "\b[A-Za-z0-9._%+-]+(@|\[at\])[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" ' | uniq  

As I said this isn't perfect by far, but it shows how quick you can knock up a script to do this type of thing.

At the time this was built this gave me 54 email addresses.

Many crawlers out are often more detailed in the way they request sites, crawl their pages and the way they search for email addresses. The above script may not work in a month or so. But that's fine as I wrote it to prove it works now.

"Ok, I get it, how do i fix this?"

If you want to help save your clients email spam boxes a bit then you're going to want to make the job a little harder for these robots.

Email obfuscation is a technique of encoding the email so it's not clearly shown in the source, but instead it's shown as html entities which the browser decodes or it's added through the use of javascript on to the page.

Some systems may come with simple ways to obscure email addresses.

Working in Wordpress?

You can use <?php echo antispambot('email@address.com'); ?>

Ref: https://codex.wordpress.org/Function_Reference/antispambot

Working in REC?

You can use {{ "email@address.com" | safe_email }}

Other systems often have similar techniques or plugins to help.

In fact some CDN services such as Cloudflare offer this as a setting enabled by default.

But if you're looking for a manual way to do this, there's plenty tools out there to help.

Here's a few that seem good: