So I have a dickens of a problem. I am trying to remove some

Have you considered using a stream profile to do the replacements? You can use a regex for the search parameters. And as Deb said if you post more detail on exactly what you're trying to replace with X's, we might be able to help with the regex. On a related note, you can save memory consumption by prefixing regex tokens within parens with '?:'. This avoids unnecessarily capturing matches into backreferences. For example, this '(http[s]?)' could be done with this '(?:http[s]?)'. Also, you shouldn't need to escape forward slashes with backslashes. As Deb suggested, you might want to add a question mark .*? to your wildcard .* to do a lazy match. Aaron

Thanks for the input guys. Ok so the goal is to allow some URLs to be present in the pages and have some scrubbed. There is a data group of hosts (www.domain.com) that needs to be checked to see if the link needs to be scrubbed or not. If the url contains a host from the DG it is to be left alone, otherwise it needs to be x'ed out. Make sense?

How many domains do you want to remove from the HTTP content? I would guess that using a stream profile/expression with the list would be significantly faster than collecting the response data and performing regex operations on it. Using a stream to do this would also simplify the rule needed to do this. If you can define the 'find' pattern using strings it would be the most efficient. Else, if you need to use regexes, something like this should work: when HTTP_RESPONSE { if {[HTTP::header value Content-Type] contains "text"}{ STREAM::expression {@https?://(?:www\.)?example1\.com@xxxxxxxxxx@ @https?://(?:www\.)?example2\.com@xxxxxxxxxx@} STREAM::enable } } Aaron

Aaron, Thanks for the input. This issue is that I don't know what the domains that need to be removed are. I only know that all relative links and all the defined hosts must be allowed while all the rest must be removed.

Removing some links in payload but not all

11 Replies

Deb_Allen_18

Historic F5 Account

Nov 30, 2007

I think you mostly just need to move the release outside of the loop, otherwise you will always be releasing on the first match.

I think I'd also make a few adjustments to your loop variables to extract the URI string to a variable, then flip-flop your matchclass comparison to look for the host in the URI string and negate it to eliminate the empty "if" body (make sure I got that logic right, because I wasn't really clear which condition should result in the masked URI string),

Also for regex operations, you have to limit collection to a 1MB max payload, so I'd also modify the code as below to set the max collection size even if the header value is larger.

Here is an adjusted iRule with those changes:


when HTTP_REQUEST {
   Don't allow data to be chunked
  if { [HTTP::version] eq "1.1" } {
    if { [HTTP::header is_keepalive] } {
      HTTP::header replace "Connection" "Keep-Alive"
    }
    HTTP::version "1.0"
  }
}
when HTTP_RESPONSE {
   Only check responses that are a text content type 
   (text/html, text/xml, text/plain, etc).
  if { [HTTP::header "Content-Type"] starts_with "text/" } {
     Get the content length so we can request the data to be
     processed in the HTTP_RESPONSE_DATA event.
    if { [HTTP::header exists "Content-Length"] && [HTTP::header "Content-Length"] < 1048577 } {
      set content_length [HTTP::header "Content-Length"]
    } else {
      set content_length 1048576
    }
    log local0.info "Content Length: $content_length"
    if { $content_length > 0 } {
       HTTP::collect $content_length
    }
  }
}
when HTTP_RESPONSE_DATA {
   Find ALL the possible URLs in one pass  
  log local0.info "Time for some regex action baby"
  set url_indices [regexp -all -inline -indices {^((http[s]?):\/)?\/?([^:\/\s]+)((\/\w+)*\/)([\w\-\.]+[^?\s]+)(.*)?([\w\-]+)?$} [HTTP::payload]]
  log local0.info "url_indices: $url_indices"
  foreach url_idx $url_indices {
    set url_start [lindex $url_idx 0]
    set url_end [lindex $url_idx 1]
    set url_len [expr {$url_end - $url_start + 1}]
    log local0.info "url_start: $url_start  url_end: $url_end  url_len: $url_len"
    
    set url_address [string range [HTTP::payload] $url_start $url_end]
    log local0.info "url_address: $url_address"
     Check to see if URL is not part of allowed hosts data group
    if { !([matchclass $url_address contains $::valid_hosts]) } {
      If not a valid URL, then mask out URLs with X's
      HTTP::payload replace $url_start $url_len [string repeat "X" $url_len]
    }
  }
  HTTP::release
}

As for your regex expression, I'd say you don't want to start with ^ or end with $, since URI could start or end mid-line. Just removing the start & endpoint restrictions, I can see what look like some issues with too-greedy wildcards. You mention you want to replace hrefs, but the regex doesn't look for the string "href", and it would most likely be a good optimization to lock down the regex op as much as possible, so a better definition of what you want to mask is in order and we can refine the expression.

HTH

/deb

hooleylist
Cirrostratus
Dec 03, 2007
Have you considered using a stream profile to do the replacements? You can use a regex for the search parameters.

And as Deb said if you post more detail on exactly what you're trying to replace with X's, we might be able to help with the regex. On a related note, you can save memory consumption by prefixing regex tokens within parens with '?:'. This avoids unnecessarily capturing matches into backreferences. For example, this '(http[s]?)' could be done with this '(?:http[s]?)'. Also, you shouldn't need to escape forward slashes with backslashes. As Deb suggested, you might want to add a question mark .*? to your wildcard .* to do a lazy match.

Aaron
David_Homoney
Nimbostratus
Dec 03, 2007
Thanks for the input guys. Ok so the goal is to allow some URLs to be present in the pages and have some scrubbed. There is a data group of hosts (www.domain.com) that needs to be checked to see if the link needs to be scrubbed or not. If the url contains a host from the DG it is to be left alone, otherwise it needs to be x'ed out. Make sense?
hooleylist
Cirrostratus
Dec 03, 2007
How many domains do you want to remove from the HTTP content? I would guess that using a stream profile/expression with the list would be significantly faster than collecting the response data and performing regex operations on it. Using a stream to do this would also simplify the rule needed to do this.

If you can define the 'find' pattern using strings it would be the most efficient. Else, if you need to use regexes, something like this should work:
when HTTP_RESPONSE { if {[HTTP::header value Content-Type] contains "text"}{ STREAM::expression {@https?://(?:www\.)?example1\.com@xxxxxxxxxx@ @https?://(?:www\.)?example2\.com@xxxxxxxxxx@} STREAM::enable } }

Aaron
David_Homoney
Nimbostratus
Dec 03, 2007
Aaron,

Thanks for the input. This issue is that I don't know what the domains that need to be removed are. I only know that all relative links and all the defined hosts must be allowed while all the rest must be removed.
David_Homoney
Nimbostratus
Dec 03, 2007
No problem. Since it would appear that your regexfu is better than mine, do you know how to regex for any ? I have to capture relative links and regular links. Frequently the URL is the text and I need to x that all out.
hooleylist
Cirrostratus
Dec 03, 2007
Sure. Can you post some anonymized examples of strings you want to match and strings you don't, as they would appear in the HTML of a response?

Aaron
David_Homoney
Nimbostratus
Dec 03, 2007
I need to match anything thing in an href tag. This includes the url (either relative or explicit) and the text afterwards as it could contain the link.
hooleylist
Cirrostratus
Dec 03, 2007
Here is a sample of hrefs I assume you don't want to check:

refs to keep

And here is a list of hrefs you do want to compare against the class:

refs to potentially mask out

description

description

description

Jump to the Useful Tips Section

This regex matches all in the list to compare and none in the list not to check:

(?si).*?

Here is an explanation of the tokens from RegexBuddy:

Match the remainder of the regex with the options: dot matches newline (s); case insensitive (i) «(?si)»

Match the characters "Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.) «\s»

Match a single character present in the list below «[\w"'=]*?»

Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»

Match a single character that is a "word character" (letters, digits, etc.) «\w»

One of the characters ""'=" «"'=»

Match the characters "href" literally «href»

Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.) «\s?»

Between zero and one times, as many times as possible, giving back as needed (greedy) «?»

Match the character "=" literally «=»

Match a single character present in the list below «[\s'"]?»

Between zero and one times, as many times as possible, giving back as needed (greedy) «?»

Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.) «\s»

One of the characters "'"" «'"»

Match the characters "http" literally «http»

Match the character "s" literally «s?»

Between zero and one times, as many times as possible, giving back as needed (greedy) «?»

Match the characters "://" literally «://»

Match any single character that is not a line break character «.*?»

Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»

Match the character ">" literally «>»

Match any single character that is not a line break character «.*?»

Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»

Match the characters "" literally «»

I didn't test this extensively. If it doesn't work for you, can you give more examples on what you do/don't want to match?

Thanks,

Aaron
Eduardo_Saito_1
Nimbostratus
Dec 21, 2007
Hello homoney,

I have an iRule very similar to yours, but related to images calls.

I'm having trouble with the CPU overload using regex. This is taking (insane) 1% CPU per request, depending on how many images the web page have.

I have the same issues you have with a stream profile.

Are you having the same problem I'm having?

Thanks!!!