Single Account DNS Zone stops Responding

Operating System & Version
CloudLinux v7.9.0 STANDARD kvm
cPanel & WHM Version
110.0.15

Halex_777

Registered
Nov 16, 2023
2
0
1
QLD Australia
cPanel Access Level
Root Administrator
Hey all,

Needing the collective genius of this forum to put me out of my misery.

The Config/Context
3 x cPanel 110.0.15 Servers (CloudLinux 7.9) in a DNS Cluster (PowerDNS).
The customer's account in question was transferred from another cPanel Server a couple of years ago. It's a larger account with various sub-domains.

The Issue
Over the last ~9 months, there has been a periodic issue where the site stops responding. Pinging the domain or any sub-domains fail to resolve.
The issue ONLY affects this account and no others on the same Server or other Servers in the DNS Cluster.
It took me a minute to figure out that if I make any arbitrary change in the DNS Zone, the issue is immediately resolved, the domain name resolves correctly, and the site is back online.
The issue then reoccurs anywhere from an hour to several hours later.

Troubleshooting/Changes thus far
The actual Hosting Provider/Server Admins have always been great and had made some recommendations around correcting the NS records in the Zone among other things.
In the previous 2 instances of this issue, their recommended changes magically 'fix' the issues and I don't have a repeat of it for another few months. This 3rd instance however is stumping us. Server Admins are suggesting more top-level Glue/NS changes but again, there are no other affected accounts, just this one customer.

If it's relevant, I'm noting the Zone Serial# before and after each time I make the arbitrary change and between when I make the change and the issue reoccurs, I'm noticing 1-3 increments in the serial# - unsure if this is normal or suggests that something else is editing the zone in the cluster?


If anyone can help me resolve this...love you long time!
 

cPRex

Jurassic Moderator
Staff member
Oct 19, 2014
17,470
2,843
363
cPanel Access Level
Root Administrator
Hey there! I'd only expect the serial number to update one time. Is it possible the zone isn't syncing to the cluster properly immediately after you make the change?

In WHM >> Tweak Settings under the Logging tab, I'd recommend enabling the "Enable verbose logging of DNS zone syncing" option if you haven't already. This will write additional log data to /usr/local/cpanel/logs/dnsadmin_log with every sync of the zone, so you'll be able to get information on why that isn't syncing properly.

You can also try running this command to ensure the integrity of the zone file on the local system after a change:

named-checkzone domain.com /var/named/domain.com.db

just update both entries of "domain.com" to be the actual domain you're working with.

Between those two things, I'd expect you to find *something* relevant that exposes the issue.
 

Halex_777

Registered
Nov 16, 2023
2
0
1
QLD Australia
cPanel Access Level
Root Administrator
Thanks cPRex! So far, the logging hasn't revealed anything, but, in addition to enabling the Verbose Logging via Tweak Settings, I saw reference to enabling similar Debug Logging via the DNS Cluster Sync config.

Enabling this the Server where the account is hosted had no effect or showed any useful info in the log, however, since enabling it on the Cluster Master, I've not had a reoccurrence of the issue. Unsure if the process of Editing/Saving the Sync config might have resolved the issue for now or is just a coincidence? Before editing the Cluster Sync config to enable to Debug Logging, everything was green tick & syncing successfully so really not sure.

If it's resolved it for now, sweet. I shall see if it lasts or reappears again in the coming days/months.

Thanks again for the assistance! At least I have something to review if it does reoccur. If it does reoccur and I manage to find the reason, I'll be sure to share on here to hopefully help others. Cheers!