Log in

View Full Version : ANSWERED! --> Firefox Extension requests robots.txt: makes spider detection hard!



JAB Creations
01-29-2008, 11:13 PM
There is an extension for Firefox and other Gecko based browsers that requests robots.txt. I do not rememberer the name of the extension off hand but I remember actively trying to contact the creator unsuccessfully. This is very obnoxious as it makes it difficult to detect new spiders. To detect new spiders I have to manually delete the line from the access log, delete the script's log, run the script again, and repeat for every single request! To negate this I want all requests from useragents with the string 'Gecko' to receive an HTTP 403 error code when requesting the robots.txt file (as this will not be counted as a successful request on the file).

I do not have the ability to execute PHP in txt files nor do I have access to httpd.conf to allow this. So...

1.) How do we detect 'Gecko'?
2.) How do we forbid 'Gecko' from accessing robots.txt?

I've been searching and this is my current best guess though it generates a server error (Apache 1.3.39).

Here are some examples that I've tried but none of these work! I also don't understand Apache's syntax very well as their documentation could use more examples...lots more. :rolleyes:


RewriteCond %{HTTP_USER_AGENT}!Gecko [NC]
RewriteRule!^(robots\.txt) - [F]

This works too well, can anyone reform this to use a rule to only apply this to robots.txt file only?

SetEnvIf User-Agent "Gecko" Gecko
<Files /error/error-403.php>
order allow,deny
allow from all
</Files>
deny from env=Gecko

JAB Creations
01-29-2008, 11:48 PM
Working Answer: I was unaware of the ability to allow PHP to be executed by modifying .htaccess. So there are three simple steps to do this though correctly...

Step One
.htaccess
First you must allow PHP to execute on files with the txt extension...

AddType application/x-httpd-php .txt

Do not use this however (with it set to specifically PHP5 as I encountered problems on Apache 1.3.39 running Apache 5.2.4)...

AddType application/x-httpd-php5 .txt

Step Two
robots.txt
You must ensure that the media type (or mime) is still the same as it was. It will be changed (and in Firefox it will ask you to save the file instead of simply displaying it). Since I'm using PHP you must insert the following code at the very top of the file without any whitespace...

<?php
header("Content-type: text/plain");
?>

Step Three
robots.txt
Now it's time to have PHP define Gecko requests to the file as HTTP code 403 (Forbidden)...

<?php
header("Content-type: text/plain");
$useragent = $_SERVER['HTTP_USER_AGENT'];
if (preg_match("/Gecko/", $useragent)) {header('HTTP/1.0 403'); die('Error 403: This file is forbidden for browser access.');}
else if (preg_match("/MSIE/", $useragent)) {header('HTTP/1.0 403'); die('Error 403: This file is forbidden for browser access.');}
?>

You can remove the die syntax if you wish to still display the contents of the file (such if you are manually checking it online yourself).

- John