Technical Forum
Ask questions. Discover Answers.
cancel
Showing results for 
Search instead for 
Did you mean: 
Custom Alert Banner

MySQL active connection never bleed off to other pool member

I am running galera MySQL behind F5 with

performance Layer 4
type and i have setup 3 mysql node in pool member with
Priority
so only 1 mysql node will be used and other two will be standby.

So everything was good but i found today when i shutdown Primary node which was active and i found my application break and when i have checked logs found:

(2006, "MySQL server has gone away (error(104, 'Connection reset by peer'))")

So solution was restart application, look like active member mysql connection not bleeding off to other pool member, what is wrong with my setup?

13 REPLIES 13

can you share your virtual server and pool configuration? (from tmsh)

 

Cheers,

 

Kees

 

Virtual Server

create ltm virtual /OSTACK/OSTACK_VS_GALERA { destination 172.28.0.9:3306 ip-protocol tcp mask 255.255.255.255 pool /OSTACK/OSTACK_POOL_GALERA profiles replace-all-with { /Common/fastL4 { } }  mirror enabled source-address-translation { pool /OSTACK/OSTACK_SNATPOOL type snat } }

POOL

create ltm pool /OSTACK/OSTACK_POOL_GALERA { load-balancing-mode least-connections-node members replace-all-with { OSTACK_NODE_ostack-infra-02_galera_container-fa5d9e98:3306 { priority-group 100 } OSTACK_NODE_ostack-infra-03_galera_container-eaacd880:3306 { priority-group 95 } OSTACK_NODE_ostack-infra-01_galera_container-6c126d29:3306 { priority-group 90 } } min-active-members 1 service-down-action reset slow-ramp-time 0 monitor /OSTACK/OSTACK_MON_GALERA }

The BIG-IP is sending a TCP Reset the moment the pool member with priority 100 goes offline/down.

 

(See article K15095 ).

 

The BIG-IP system sends RST or ICMP messages to reset active connections and removes them from the BIG-IP connection table.

 

Note: This selection is named "reset" instead of "reject" when using the TMOS Shell (tmsh).

 

 

And it seems your application is not trying to reconnect when it receives this reset.

 

Hamish
Cirrocumulus
Cirrocumulus

Were you expecting the BigIP to automatically connect an existing TCP connection to another host without the client having to participate?

 

Even for the most basic of protocols (e.g. DNS, SNMP), this won't happen. When the pool member goes down you will (by default) device a RST to indicate that connection is no longer valid. You have the option to send the mid-stream connection to another host, but as that host has no idea of the connection, you will (again) recieve a RST. Now you can (In theory) write an iRule to migrate MySQL connections from one host to another (Triggered when the pool member goes down - this is where you'd normally recieve a RST back to the client). I have done it in the dim dark past for LDAP, but in practice it isn't trivial (And possibly non-practical, but I'd love to see someone do it) You'd have to implement a protocol specific proxy to track what was sent and what was received. For a SQL database you'd have to track transactions and be ready to replay the whole transaction to the second server if you had to migrate it for any reason...

 

Given it's usually a pretty application specific thing I'd probably suggest altering the app rather than the BigIP to accomplish migrations of MySQL connections). Or running mySQL Cluster which does (apparently) guarantee uninterrupted access from clients... but I've never tried it and I'd suspect it's not cheap either...

 

After lots of debugging i found following. If i point my application to F5 base LB then i am seeing following error, every minute.

(2006, "MySQL server has gone away (error(104, 'Connection reset by peer'))")

Here is the full output of error

2018-08-19 09:19:50.789 11159 ERROR oslo_db.sqlalchemy.engines [req-aa221914-d720-490c-a8e8-f9d7b780a353 8ec61b0530b94a699c4dcf164115f365 328fc75d4f944a64ad1b8699c02350ca - default default] Database connection was found disconnected; reconnecting: DBConnectionError: (pymysql.err.OperationalError) (2006, "MySQL server has gone away (error(104, 'Connection reset by peer'))") [SQL: u'SELECT 1'] (Background on this error at: http://sqlalche.me/e/e3q8)
2018-08-19 09:19:50.789 11159 ERROR oslo_db.sqlalchemy.engines Traceback (most recent call last):
2018-08-19 09:19:50.789 11159 ERROR oslo_db.sqlalchemy.engines   File "/openstack/venvs/nova-17.0.8/lib/python2.7/site-packages/oslo_db/sqlalchemy/engines.py", line 73, in _connect_ping_listener
2018-08-19 09:19:50.789 11159 ERROR oslo_db.sqlalchemy.engines     connection.scalar(select([1]))

Interesting thing is wen i point my application to haproxy (mysql LB) vip then error disappered, also if i point application directly to Galera mysql node then also error disappered, look like something going on with F5 based VIP

Do you think i should create "standar" VIP instead of "performance layer4" ? I did use persistent source addr but still same error.

In

tcpdump
i am seeing F5 sending
RST
packet to both client and server and terminating connection every minute. This is new installation and there is not customer traffic or any high volume traffic yet... very odd why F5 sending RST?

Ready for FUN after switching from

SNAT
to
automap
it fixed all my issue. Now i am really really curious and would like to know what is the difference here?

After switching to

automap
all my mysql connection error disappeared and i am not seeing any tcp
RST
packet from F5 now.

quantiti_170569
Nimbostratus
Nimbostratus

You can read about SNAT here https://support.f5.com/csp/article/K7820

 

amintej
Cirrus
Cirrus

Hello satish.txt, maybe the problem is related to PVA acceleration. You can try to configure SNAT and attache new FastL4 profile with PVA acceleration to None or Offload State to EST.

 

@amintej,

 

I will try that but curious what is the relation with PVA acceleration?

 

I have other SNAT running on same F5 they all are working great except this MySQL one.

 

Current setting is PVA Acceleration = FULL

 

@amintej,

 

FYI, i have disabled PVA = None and still i am seeing error, so look like that is not an issue.

 

amintej
Cirrus
Cirrus

Ok, and SNAT and automap IPs are in the same network? Maybe you have an asymetric routing problem. You can check it using

tmsh show sys connections

Ajit
Altostratus
Altostratus

Satish,

 

In your pool settings you should try to change your action-on-service-down from "reset" to "reselect".

 

That should do it for you.

 

Regards,

 

Ajit