robots.txt scanner proof of concept

Due to the recent scandal of American Express listing their publicly available admin debug panel in their robots.txt file, here’s a sloppy proof of concept that can be used to find similar security issues.

Remember:

  • robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
  • the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don’t want robots to use.

http://www.robotstxt.org/robotstxt.html

Either way, a location such as /us/admin/ would be in any wordlist of interesting locations and you should always protect sensitive parts of your system with authorization requirements.

<?php
/**
 * Quick hack to determine HTTP status codes of locations listed in a host's
 * robots.txt file.
 * Author: Niklas Femerstrand, qnrq.se, 2011
 * License: http://sam.zoy.org/wtfpl/
 */

// Determine HTTP status code of $file on $host
function httpstatus($host, $file) {
   $fp = fsockopen($host,80,$errno,$errstr,30);
   $out = "GET /$file HTTP/1.1\r\n".
          "Host: $host\r\n".
          "Connection: Close\r\n\r\n";
   fwrite($fp,$out);
   $response = fgets($fp);
   return chop($response);
}

if (!$argv[1])
	die("usage: $ php robotscan.php <host>\n   eg: $ php robotscan.php www.google.com\n");

$host = preg_replace("/\/$/", "", $argv[1]);
$target_url = $argv[1] . "/robots.txt";

$filters = array("/[ ]?Allow:[ ]?/",
                 "/[ ]?Disallow:[ ]?/",
                 "/[ ]?Sitemap:.*/",
                 "/[ ]?Request-rate:.*/",
                 "/[ ]?Crawl-delay:.*/",
                 "/[ ]?Visit-time:.*/",
                 "/.*[$*].*/",
                 "/User-agent: .*/",
                 "/#+.*/");

echo "[+] Getting robots.txt content\n";

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "{$argv[1]}/robots.txt");
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
$robots = curl_exec($ch);

if(!$robots)
{
	$httpStatus = httpstatus($host, "robots.txt");
	if(preg_match("/200 OK/", $httpStatus))
		die("[-] robots.txt exists but seems empty\n");
	else
		die("[-] {$httpStatus}\n");
}
elseif(preg_match("/([\<])([^\>]+)*([\>])/i", $robots))
	die("[-] robots.txt is incorrectly formatted (html?)\n");
else
{
	$robots = preg_replace($filters, "", $robots);
	$robots = preg_replace("/(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+/", "\n", $robots);
	$arr = explode("\n", $robots);

	foreach($arr as $loc)
		printf("[+] %s: %s\n", "{$host}{$loc}: ", httpstatus($host, $loc));
}

One Response to “robots.txt scanner proof of concept”

  1. Ionizer Says:

    It is much easier:

    google: “robots filetype:txt admin”

    :D

    Ionizer

Leave a Reply

Leave a Reply

Your email address will not be published.