Cellular Network Failover

Most of my devices route internet traffic over cellular, and use ethernet for modbus TCP.

I have some devices with spotty cellular coverage where I’d like to route internet traffic over ethernet.

I don’t have control over these ethernet networks. If the ethernet networking changes at a site, and I lose access to the internet, I’d like to failover to cellular. This will allow me to reconfigure a device without visiting the site.

It sounds like the recommended approach for this scenario is to configure routing for both network interfaces, and to use a failover script to change the routing metric if internet access via ethernet is lost.

Does anyone here have experience with this (or a different approach), that they’d be willing to share?

Are there any gotcha’s I should watch out for?

Note: I’m using networkd, and trying to keep my images small so I can OTA update via cellular.

Cheers,
Greg

FWIW, here’s an example script from Gemini (I’d run it as a service with systemd).

TARGET="amazonaws.com"
INTERFACE="eth0"
CHECK_INTERVAL=5
MAX_FAILURES=3
FAILURE_COUNT=0

PRIMARY_METRIC=10
FAILOVER_METRIC=2000

while true; do
    # Ping once, timeout after 2 seconds, specifically on eth0
    if ping -I "$INTERFACE" -c 1 -W 2 "$TARGET" > /dev/null 2>&1; then
        if [ "$FAILURE_COUNT" -ne 0 ]; then
            echo "Internet recovered on $INTERFACE. Restoring primary route."
            networkctl metric "$INTERFACE" "$PRIMARY_METRIC"
            FAILURE_COUNT=0
        fi
    else
        ((FAILURE_COUNT++))
        echo "Check failed on $INTERFACE ($FAILURE_COUNT/$MAX_FAILURES)"

        if [ "$FAILURE_COUNT" -eq "$MAX_FAILURES" ]; then
            echo "Internet dead on $INTERFACE. Switching to cellular."
            networkctl metric "$INTERFACE" "$FAILOVER_METRIC"
        fi
    fi
    sleep "$CHECK_INTERVAL"
done

Sorry. Ignore the example script above.

This is what I’ve been testing.

#!/bin/bash

# Network configured to have two default routes
# Priority: metric 20 gateway for cellular
# Backup: metric 1000 gateway for ethernet

# Script periodically checks if the ethernet gateway is working. If working add
# a route with a lower metric than the cellular gateway, if it's not working
# remove the low metric route.

# Notes:
#   * Routes added with ip route do not persist across reboots.
#   * Adding an identical route is ok, as well as removing a route which doesn't exist.

ETH_GATEWAY=$(ip -j route show dev wiredeth0 \
  | jq -r '.[] | select(.dst == "default" and .gateway != null) | .gateway' \
  | head -n 1)

if [ -z "$ETH_GATEWAY" ]; then
  echo "net-gateway-failover: no ethernet gateway found for wiredeth0: '$ETH_GATEWAY'"
  exit 0 # systemd shouldn't rerun this script.
fi

echo "net-gateway-failover: eth gateway ip: '$ETH_GATEWAY'"

if /bin/ping -c 3 -W 5 -I wiredeth0 8.8.8.8 > /dev/null 2>&1; then
  echo "net-gateway-failover: wiredeth0 ok: adding route (ignore failure if already added)"
  ip route add default via $ETH_GATEWAY dev wiredeth0 metric 10 || true
  echo "net-gateway-failover: wait for route to be ready"
  sleep 1
  if /bin/ping -c 3 -W 5 8.8.8.8 > /dev/null 2>&1; then
    echo "net-gateway-failover: wiredeth0 route ok"
  else
    echo "net-gateway-failover: wiredeth0 add failure: deleting route (ignore failure if already deleted)"
    ip route del default via $ETH_GATEWAY dev wiredeth0 metric 10 || true
  fi
else
  echo "net-gateway-failover: wiredeth0 failure: deleting route (ignore failure if already deleted)"
  ip route del default via $ETH_GATEWAY dev wiredeth0 metric 10 || true
fi

Hi @greg-blackcurrent,

Thanks for sharing! I have asked around a bit internally, and while we are convinced that comparable approaches are used in a number of cases none of them has been shared with us.

What we can say is that the Mender Client is supposed to work correctly with a setup as you described. If you run into any unexpected or “interesting” observations, I’d love to hear them.

Greetz,
Josef

Thanks for asking around.

One random thought. I wonder if a more robust solution would be to make mender-connect aware of the failover interface.

i.e. If mender-connect is unable to connect via the default gateway, it could try and connect using the failover interface (Done by specifying the local outgoing IP address of the failover interface when connecting to the server).

This would make mender-connect more robust to bad network configurations.