robots.txt scanner proof of concept
Due to the recent scandal of American Express listing their publicly available admin debug panel in their robots.txt file, here’s a sloppy proof of concept that can be used to find similar security issues.
Remember:
- robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
- the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don’t want robots to use.
http://www.robotstxt.org/robotstxt.html
Either way, a location such as /us/admin/ would be in any wordlist of interesting locations and you should always protect sensitive parts of your system with authorization requirements.
<?php
/**
* Quick hack to determine HTTP status codes of locations listed in a host's
* robots.txt file.
* Author: Niklas Femerstrand, qnrq.se, 2011
* License: http://sam.zoy.org/wtfpl/
*/
// Determine HTTP status code of $file on $host
function httpstatus($host, $file) {
$fp = fsockopen($host,80,$errno,$errstr,30);
$out = "GET /$file HTTP/1.1\r\n".
"Host: $host\r\n".
"Connection: Close\r\n\r\n";
fwrite($fp,$out);
$response = fgets($fp);
return chop($response);
}
if (!$argv[1])
die("usage: $ php robotscan.php <host>\n eg: $ php robotscan.php www.google.com\n");
$host = preg_replace("/\/$/", "", $argv[1]);
$target_url = $argv[1] . "/robots.txt";
$filters = array("/[ ]?Allow:[ ]?/",
"/[ ]?Disallow:[ ]?/",
"/[ ]?Sitemap:.*/",
"/[ ]?Request-rate:.*/",
"/[ ]?Crawl-delay:.*/",
"/[ ]?Visit-time:.*/",
"/.*[$*].*/",
"/User-agent: .*/",
"/#+.*/");
echo "[+] Getting robots.txt content\n";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "{$argv[1]}/robots.txt");
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
$robots = curl_exec($ch);
if(!$robots)
{
$httpStatus = httpstatus($host, "robots.txt");
if(preg_match("/200 OK/", $httpStatus))
die("[-] robots.txt exists but seems empty\n");
else
die("[-] {$httpStatus}\n");
}
elseif(preg_match("/([\<])([^\>]+)*([\>])/i", $robots))
die("[-] robots.txt is incorrectly formatted (html?)\n");
else
{
$robots = preg_replace($filters, "", $robots);
$robots = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $robots);
$arr = explode("\n", $robots);
foreach($arr as $loc)
printf("[+] %s: %s\n", "{$host}{$loc}: ", httpstatus($host, $loc));
}


October 12th, 2011 at 7:12 pm
It is much easier:
google: “robots filetype:txt admin”
:D
Ionizer