DEV Community

Cover image for An Attempt to Ban Bad Bots Crawling My Sites
Douxx
Douxx

Posted on

An Attempt to Ban Bad Bots Crawling My Sites

I don't really like bad bots, and by that I mean crawlers that don't care about robots.txt. The reason is simple: I don't want my data fed into obscure systems, and also just by principle, if we give you rules, follow them.

Credit where it's due: the idea came from Caolan's website.

The idea is simple: make the bad bots click a link they aren't supposed to, then ban them. To do that, I added a robots.txt at the root of my site, explicitly disallowing robots from a specific page (I went with /roboty/, because why not):

User-agent: *
Disallow: /roboty/
Enter fullscreen mode Exit fullscreen mode

Then I slipped a link to that page somewhere on the root page.

link

Since I don't want curious humans getting instantly banned, the page itself just explains what's going on and links to article.php, the actual dangerous script. I named it like that to bypass possible keyword blacklists like ban or ban-ip. ¯\_(ツ)_/¯

Talking about the script, here it is:

<?php

$cf_api_token = '...';
$zone_id = '...';
$note = 'Auto banned by dtech/roboty at ' . date("H:i d/m/y");
$ip = $_SERVER['REMOTE_ADDR'];

$payload = json_encode([
    'mode' => 'block',
    'configuration' => [
        'target' => 'ip',
        'value' => $ip,
    ],
    'notes' => $note,
]);

$ch = curl_init("https://api.cloudflare.com/client/v4/zones/{$zone_id}/firewall/access_rules/rules");
curl_setopt_array($ch, [
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_POST           => true,
    CURLOPT_POSTFIELDS     => $payload,
    CURLOPT_IPRESOLVE      => CURL_IPRESOLVE_V4,
    CURLOPT_HTTPHEADER     => [
        "Authorization: Bearer {$cf_api_token}",
        "Content-Type: application/json",
    ],
]);

$response = json_decode(curl_exec($ch), true);
curl_close($ch);

header("Location: /?blehhhhh"); // redirect to '/', should be blocked
echo "Bye ;)";
Enter fullscreen mode Exit fullscreen mode

Right now it only bans the bot's IP on douxx.tech (proxied through Cloudflare), but I plan to eventually implement it into an internal API to block across every domain I own, and maybe throw in some iptables rules too.

So yeah, I'll keep it running for a bit and see how many IPs we get.

For the record, the first one to be banned is an IP from Tencent datacenters 🤡

tencent ipban
ip info screenshot

Top comments (0)