Forum Discussion

Parveez_70209's avatar
Parveez_70209
Icon for Nimbostratus rankNimbostratus
Oct 03, 2013

How to hide URL( Virtual-Servers) which are exposed over Internet from search engines using robots.txt?

Hi,

 

Kindly guide into this, we wanted to hide our Virtual-server which is exposed into internet from search engines using robots.txt?

 

I was also going through the below link in which Kevin guided:

 

https://devcentral.f5.com/questions/irules-and-robotstxt-question

 

Kindly assist into this so that we can understand the relationship of irule and robots.txt w.r.t Virtual-Server o URL which is exposed over internet and requirement is to hide it from search-engines.

 

Thanks and Regards Parveez

 

5 Replies

  • As the other post alludes, a robots.txt file is purely advisory. Most of the major search engines do honor them, but they certainly don't have to. The contents of the robots.txt file, assuming you wanted to block all crawlers, is pretty straight forward:

    User-agent: *
    Disallow: /
    

    This tells all robots to go away.

    So to generate that with an iRule, you might do something like this:

    when HTTP_REQUEST {
        if { [string tolower [HTTP::uri]] equals "/robots.txt" } {
            HTTP::respond 200 content "User-agent: *\nDisallow: /"
        }
    }
    

    Something that is a little more forceful, and potentially more dangerous, is to track on the requesting client's User-Agent header. The list of potential crawlers could get large, so I'd probably include those in a string-based data group. Example:

    Crawler data group (ex. my_robots_dg)

    bingbot
    msnbot
    exabot
    googlebot
    slurp
    

    ** Reference: http://user-agent-string.info/list-of-ua/bots

    And then an iRule like this:

    when HTTP_REQUEST {
        if { [class match [string tolower [HTTP::header User-Agent]] contains my_robots_dg] } {
            drop
        }
    }
    

    Again, this approach is a bit more exhaustive and potentially dangerous if you don't get one of the bot names right in the data group, or a legitimate browser client sends this string.

  • Hi Kevin,

     

    Thank you for guiding, if we attach this irule with Virtual-Server, do the Server/Application team also need to include/configure something related to robot.txt.

     

    Thanks and Regards Parveez

     

  • In the first example, the robots.txt content is included in the iRule and includes ALL robots.

     

  • Yes,we are planning to test this with the below URL which covers all:

     

    when HTTP_REQUEST { if { [string tolower [HTTP::uri]] equals "/robots.txt" } { HTTP::respond 200 content "User-agent: *\nDisallow: /" } }

     

    So, nothing specific to do from server/application team right for the objective to achieve.

     

    Thanks and Regards Parveez

     

  • Correct. If the crawler makes a request for "/robots.txt", the iRule will serve it. Nothing else needs to be done.