Pool setting: "Action On Service Down" set to "Reject" for loaded Pool - bad?
We have a Pool (HTTP) that hosts thousands of connections at any given moment, and the pool has "Action On Service Down" (link to F5 Article below) set to "Reject". We've had a few incidents where we've had most of a pool was impacted after a single server had an issue, and (I cannot be sure but) I believe some of the problem is that once the first pool member has gone down, the RST's sent back to the clients caused them all to re-send their requests all at the same time and begin to overwhelm other servers in the pool.
This config has been around since it was the default (v4???) and I'm wondering if it's time for a change.
I notice the default in the recent versions for "Action On Service Now" is "None", and I'm wondering if that might be a better choice for our situation - possibly allowing the servers to recover. Our loads are very "peaky" and perhaps we would be better served in giving them more time to recover from a barrage of requests before just sending them to another server.
So questions:
Am I characterizing this correctly and do my assumptions seem sound?
What drawbacks to using "None" over using "Reject"?
How do I know how long a request or connection will remain on the DOWN-ed server if it doesn't come back up? Is that a TCP timeout?
Anything else I'm missing?
Thanks,
-Funkdaddy
https://devcentral.f5.com/s/articles/ltm-action-on-service-down