Forum Discussion
ASM to block indexing of sites.
While you can configure Web Scraping protection, I don't think that it's the best solution to prevent search-engine indexing. I'm in favor of relying on /robots.txt standard.
If Google does not receive response to /robots.txt request (request times out), all indexing of your web-site will be discarded, as if the web-site did not exist at all. If they receive HTTP 404 or any other normal response, then your website is subject to indexing. There are firms that will not wait a day to file a lawsuit against Google if they find anything indexed that they don't want to be indexed, and for that reason Google proposed this as a compromise. Most major search engines today behave in the same way.
For the simplest solution, just make sure requests to /robots.txt will time out, and you're done.
-
Many viable solutions here. Personally, I use a LTM policy: (default-rule: Enable ASM ; conditional-rule for /robots.txt: Drop; Policy Strategy: Best-match). A simple iRule that drops requests to /robots.txt will work too.
-
Arguably the cleanest solution would be a valid response to /robots.txt request which responds with HTTP 200 response, and specifies the following in payload:
User-agent: *
Disallow: /(search engines will respect the statement and understand that no pages are subject to be indexed).
This file named as 'robots.txt' can be hosted in the end-server, placed in WWW root directory (/) . It can also be hosted in BigIP as iFile, or raw code in iRule. Regardless of your choice, to respond to /robots.txt request from BigIP, you will have to use 'HTTP::respond 200 content' function (Example of HTML-payload-response iRule here: https://devcentral.f5.com/questions/irule-response-with-static-html-message-when-pool-members-are-down)
Regards,
Recent Discussions
Related Content
* Getting Started on DevCentral
* Community Guidelines
* Community Terms of Use / EULA
* Community Ranking Explained
* Community Resources
* Contact the DevCentral Team
* Update MFA on account.f5.com