Demo: Filtering Wikipedia articles with LineRate Proxy

Demo of LineRate Proxy and the xml2js Node.js module

The demo below was created by one of the LineRate developers on a Friday hack-a-thon day.  The goal for the day was to create fun demos of some Node.js modules using LineRate Proxy.  This demo shows the power of using LineRate Proxy to intercept and filter or redirect HTTP requests based on the Proxy requesting additional XML data from an off-box service.

For the full description and actual code to implement this cool demo, see the full posting on Github.


Wikipedia articles are publicly edited. Some edits are associated with particular usernames. However, anyone on the Internet can click "edit" and make a change; those edits are "by IP". Often, "by IP" edits are advertising robots, vandals, or people/organizations with a point-of-view to inject.

When a user goes through a proxy running this script, they see the latest revision of an article that was edited by a user with a username. This may be a better revision, because that edit was less likely to be from a robot or vandal or POV-pusher.

In reality, this demo is intended to exhibit the LineRate Proxy Scripting Engine.

How it works

A forward-proxy object is configured through the normal means (CLI, REST JSON API, or web GUI). A range virtual-ip is created that listens for connections that are intended for the wikipedia servers. The script is attached to that forward proxy. The LineRate Proxy system is then placed into the network so that user requests are proxied through it.

When a user browses to a webpage, such as:

the browser does DNS resolution as normal. Then, it submits an HTTP request to the resolved address. Since the range virtual-ip is listening at that address, the forward-proxy receives the request. The script is invoked, and gets to choose how to handle it.

One option for the script is to simply call next() which means "I have nothing else to do for this request"; in this case, the request continues along the datapath, goes out to Wikipedia, gets the normal response, and this goes back to the client. Once next() is called, the script is no longer involved in processing the response, and the proxying happens through the low-level, high-performance proxy system. This is the path used for all the resources that aren't the wikipedia article. So, favicon.ico and the helper Javascripts and wikipedia logo bypass the rest of the processing.

However, for the main article request, the scripting engine holds the request, and makes a new HTTP request to get the history of the article, for instance:

POST /w/index.php?title=Special:Export&pages=Chunked_transfer_encoding&&offset=0&limit=5&action=submit&dir=desc 1.1
Accept: */*
Content-Length: 0
The POST is a Wikipedia requirement to get full version information, even though the body of the POST is empty. The HTTP response has a body in XML. The xml2js node module parses it into a javascript object, and the script can walk into the object and find the revisions and authors.

If the first author is by a registered user, then the next() call for the original user request is invoked and the request is passed through to the Wikipedia servers, and the response returned to the user.

If the first author is not a registered user, then the script walks backward in the revision history until it finds a revision that was made by a registered user. Then, it writes a response back to the user that is a temporary redirect to that version of the page, like:

HTTP/1.1 302 Found
Content-Length: xxx
<html><head><title>Redirecting Chunked_transfer_encoding</title></head>
  <h1>Redirecting to a human-edited version</h1>
  <p>The last version of Chunked_transfer_encoding was edited by an anonymous user.</p>
  <p>Redirecting to the last human-edited version: 
     <a href="...">...</a></p>

Published Sep 07, 2013
Version 1.0

Was this article helpful?

No CommentsBe the first to comment