Unreliable gateway monitoring and recovery from (staged) failure

Pentangle

Hi all,

I've just spent an hour beating my head against a brick wall.

I have 2 pfsense instances at a customer site. They're in HA (CARP). The site has a 600Mbit/s FTTP connection and a 100Mbit/s leased line. Since they're in HA both the pfsenses are sharing their configuration (essentially) although with the FTTP connection it's one of those weird ones where the ISP wouldn't give us any static IP addresses and they wouldn't entertain us connecting via anything but DHCP. Accordingly, the secondary pfsense has the interface but doesn't have the physical FTTP connection.

I think the fact there are 2 pfsenses and HA is relatively irrelevant here, as my issue is related to the failover between the FTTP and leased line on the primary pfsense, so as CARP stays with the master I won't bother expanding the problem to include 2 boxes.

There's a gateway group, containing the FTTP link as a Tier 1 and the leased line as a Tier 2.

This site is connected via an IPsec VPN to a third pfsense at a datacentre, behind which is the domain controller containing the DNS the users use (i.e. all their servers are in colo, with nothing physically on-prem aside from these pfsenses). The VPN is connected using the dynamic DNS name given by DuckDNS and programmed into pfsense to update.

So, the testing:

When both connections are live, everything works.
When I remove the FTTP connection, one of three things happens:

There is a single dropped packet, then the leased line takes over, the VPN gets re-established, and everything's fine (this happens very rarely)
There is a command prompt page's worth of dropped packets, and then the leased line takes over, the VPN does not get re-established in this instance
There's an infinite amount of dropped packets and nothing comes back for the 10 minutes or so I let it test.

My problems I believe stem from:

A) The gateway group not realising the primary gateway is down
B) The Dynamic DNS service not realising something's happened which would require a new Dynamic DNS Update
C) Something else I know not what.

I would appreciate some help, specifically I'd love to know:

Why if there's a X against the interface in the dashboard, does the gateway monitor still think that things are "Pending" and "Gathering data"
What is the impact of the gateway monitor IP address when there's no static IP address in the primary link path I can use (i.e. if I use 1.1.1.1 or 8.8.8.8 as a gateway monitor address, is this going to screw things up because it can be "seen" from the other interface(s)
What can I do about ensuring a Dynamic DNS update occurs following a gateway change so that the VPN can re-establish?

Any help gratefully received with virtual beers all round.