Forum Discussion

Aaron_1015's avatar
Aaron_1015
Icon for Nimbostratus rankNimbostratus
Dec 05, 2017

postgres process using excessive CPU

I'm basically going to answer my own question here, but I wanted to document this for any future inquiries on the same topic.

 

I received an RMA unit (LTM 1600) to replace a dead system. I booted it up, did the basic config and then rejoined to my cluster and all seemed happy until I noticed one big problem... the CPU usage was in excess of 70%. And that was even though it was the passive node.

 

Looking at the "top" results, I saw "postgres" was using 70% or more continuously. I had to research what that was and why it's there: it's an internal PostgreSQL service that I think may be used by ASM/AFM (which we don't use).

 

After some digging I noticed the /var/log/ltm logfile had grown to an enormous size... over 700M in a day. Peeking at the tail of it, I saw that the "pgadmind" service which kicks off that PostgreSQL was restarting continuously in a horrible loop.

 

It complained that the file "global/pg_filenode.map" was missing... I had to really hunt around to figure out the full path it looks at is: "/var/local/pgsql/data/global/pg_filenode.map"

 

Sure enough, that directory existed but that file was nowhere to be found.

 

First, I stopped the service: "tmsh stop /sys service pgadmind" and tried copying just that one file from another system: "scp root@othersystem:/var/local/pgsql/data/global/pg_filenode.map /var/local/pgsql/data/global"

 

Then did a "chown postgres:root /var/local/pgsql/data/global/pg_filenode.map" to set the correct ownership and started the service back up: "tmsh start /sys service pgadmind"

 

Well crumb, the service still failed but at least a different message now... it complained about other files in that directory missing. Well, the other files have names like "12790" and "12788_fsm" so I figured that map file must point to specific files that don't exist since I copied it from another system.

 

To fix that, I just wiped that whole directory on the troubled unit, copied all the files from the working node, and restarted the service.

 

Voila, that did it and now pgadmind was running happily, no more errors in the ltm log, CPU was normal.

 

But I did wonder if copying that dir from another unit was really the best thing... I have no idea since I don't really know what uses that database instance.

 

To try and be as sure as possible, I installed a clean 12.1.2 ISO plus the latest hotfix onto a different volume and booted into that. Actually, before I booted into it, I mounted the "var" from that volume into a temp mount directory so I could peek at that data/global directory. That "global" dir doesn't exist... I'm pretty sure it's not a symlink to a shared location, so I wonder now if I could have just stopped the service, deleted that global directory, started back up, and maybe it creates a new set of files from some default. I just have no experience with PostgreSQL so I have no idea how that works.

 

Anyway, I chalk this up to the RMA unit I was sent having either a damaged or missing map file for whatever random reason... maybe they did a dirty shutdown before packaging it up to ship out, or gamma rays hit that exact spot on the hard drive...whatever the case, I'm glad I was able to remedy it with only a few extra hours of work and also glad it wasn't a hardware issue.

 

If anyone else sees something like this, maybe this will help them out (or if anyone has a better idea how to really fix this type of issue, that'd be great to hear since I'm still not sure I did it the correct way).

 

2 Replies

  • thanks for posting, not really what you want to happen with a RMA unit, if you have time report it to F5 support so they can make a note about it.

     

    personally i would have considered to getting another RMA, this might have solved it, but it kinda gives a feeling more issues can occur.

     

  • I did mention it in my followup to the RMA, so we'll see.

     

    Unfortunately this unit is in a remote datacenter and I had to travel to handle the swap. I replaced it and made sure it booted up and then left, assuming it would all be okay and I updated the config later that night. Getting a different RMA would have been troublesome for my schedule so I'm glad I was able to figure it out, and hopefully it's helpful in the future. :)