on 12-Mar-2015 21:17
This article discusses a method for regulating HTTP accesses from robots (aka. crawlers, spiders, or bots) by the use of F5 LineRate Precision Load Balancer.
A growing number of accesses from robots can potentially affect the performance of your web services. Studies show that robots accounted for 35% of accesses in 2005 [1], and increased to 61.5% in 2013 [2]. Many sites employ the de facto Robots Exclusion Protocol standard [3] to regulate the access, however, not all the robots follow this advisory mechanism [4]. You can filter the disobedient robots somehow, but that will be an extra burden for already heavily loaded web servers. In this use-case scenario, we utilize LineRate scripting to exclude the known robots before they reach the backend servers.
The story is simple. When a request hits LineRate, it checks the HTTP User-Agent request header. If it belongs to one of the known robots, LineRate sends the 403 Forbidden message back to the client without conveying the request to the backend servers. A list of known robots can be obtained from a number of web sites. In this article, user-agent.org was chosen as the source because it provides a XML formatted list. The list contains legitimate user agents, so only the entries marked as 'Robots' or 'Spam' must be extracted.
Here is the code.
'use strict';
var vsm = require('lrs/virtualServerModule');
var http = require('http');
var event = require('events');
var xml = require('xml2js');
First, include necessary modules. lrs/virtualServerModule is a LineRate specific module that handles traffic. http and events are standard Node.js modules: The former is used to access the user-agent.org server as a HTTP client, and the latter is for custom event handling. xml2js is an NPM module that translates a XML formatted string to the JSON object.
function GetRobots () {
this.vsname = 'vs40';
this.ops = {
host: '80.67.17.172',
path: '/allagents.xml',
headers: {'Host': 'www.user-agents.org',
'Accept': '*/*'}
};
this.stat404 = 403;
this.body403 = Forbidden;
this.head403 = {'Content-Type': 'text/plain',
'Content-length': this.body403.length};
this.xml = ''; // populated by getter()
this.list = {}; // populated by parser()
};
GetRobots.prototype = new event.EventEmitter;
The GetRobot class stores information such as HTTP access information and 403 response message. In order to handle custom events, the class is extended by the Events.EventEmitter. The class contains two methods (functions): GetRobot.parser() is for parsing XML strings into JSON objects, and GetRobot.getter() is for getting the XML data.
// Parse XML string into an object
GetRobots.prototype.parser = function() {
var reg = /[RS]/i;
var self = this;
try {
xml.parseString(self.xml, function(e, res) {
if (e || ! res)
console.error('robot: parser eror: ' + e);
else if (! res['user-agents'] || ! res['user-agents']['user-agent'])
console.error('robot: parser got malformed data.');
else {
var array = res['user-agents']['user-agent'];
for (
var i=0; i<array.length; i++) {
if (reg.test(array[i].Type))
self.list[(array[i].String)[0]] = 1;
}
self.emit('parser');
}
});
}
catch(e) {
console.error('robot: parser got unknown error ' + e);
}
};
This is the parser method. The XML data retrieved is structured in the <user-agents><user-agent>....</user-agent></user-agents> format. Each <user-agent>....</user-agent> contains information of an user agent. The tags we are after are <String> and <Type>. The <String> tag contains the value of the HTTP's User-Agent. The <Type> tag contains the type of agents: We are after the Type R(obot) or S(pam) as shown in the regular expression in the code. After it completes parsing, it emits the custom 'parser' event.
// Retrieve the XML formatted user agent list
GetRobots.prototype.getter = function() {
var self = this;
try {
var client = http.request(self.ops, function(res) {
var data = [];
res.on('data', function(str) {
data.push(str);
});
res.on('end', function() {
self.xml = data.join('');
self.emit('getter');
});
}).end();
}
catch(e) {
console.error('robot: getter error: ' + e.message);
}
};
This snippet is the getter. It sends a HTTP GET request to the server, and receives the XML string. After it receives all the XML data, it emits the custom 'getter' event.
// main part
var robo = new GetRobots();
vsm.on('exist', robo.vsname, function(vso) {
robo.on('getter', function() {
console.log('robot: got XML file. ' + robo.xml.length + ' bytes.');
robo.on('parser', function() {
var num = (Object.keys(robo.list)).length;
console.log('robot: got ' + num + ' robots.');
vso.on('request', function(servReq, servResp, cliReq) {
var agent = servReq.headers['User-Agent'];
if (robo.list[agent]) {
servResp.writeHead(robo.stat403, robo.head403);
servResp.end(robo.body403);
}
else {
cliReq();
}
});
});
robo.parser();
});
robo.getter();
});
console.log('robot: retrieving info from ' + robo.ops.headers['Host']);
Now, combine them together. The code follows the following steps sequentially.
Let's test the script.
Try accessing to the LineRate with your browser. It should return the backend server's data as if there is no intermediate processor exists.
Try mimicking a robot (any of them with the R or S mark) using curl as below.
$ curl -D - -H "User-Agent: DoCoMo/1.0/Nxxxi/c10" 192.168.184.40 HTTP/1.1 403 Forbidden Content-Type: text/plain Content-length: 9 Date: Tue, 03-Mar-2015 04:03:56 GMT Forbidden
The User-Agent string must be the exact match to the string appeared in the user-agent.org list.
The script leaves the following log messages upon statup.
robot: retrieving info from www.user-agents.org robot: got XML file. 693519 bytes. robot: got 1527 robots.
While the script runs fine, there are a few possible alterations that can make it nicer.
Please leave a comment or reach out to us with any questions or suggestions and if you're not a LineRate user yet, remember you can try it out for free.
References:
[1] Yang Sun, Ziming Zhuang, and C. Lee Giles: "A large-scale study of robots.txt", Proc. 16th Int. Conf World Wide Web (WWW 2007), 1123-1124 (May 2007).
[2] Igal Zeifman: "Report: Bot traffic is up to 61.5% of all website traffic", Incapsula's Blog (09 Dec 2013).
[3] The Web Robots Pages. The protocol was proposed to IETF by M. Koster in 1996.
[4] C. Lee Giles, Yang Sun, and Isaac G. Councill: "Measuring the web crawler ethics", Proc. 19th Int. Conf. World Wide Web (WWW 2010), 1101-1102 (Apr 2010).