Groups > linux.debian.bugs.dist > #1090341

Bug#1004139: ifenslave: Hotplugged interfaces sometimes not added to bond-master due to ifquery race condition

From	Jean-Paul Larocque <jpl-debian-bts@thoughtcrime.us>
Newsgroups	linux.debian.bugs.dist
Subject	Bug#1004139: ifenslave: Hotplugged interfaces sometimes not added to bond-master due to ifquery race condition
Date	2022-01-21 17:00 +0100
Message-ID	<DI4uu-8dV-5@gated-at.bofh.it> (permalink)
Organization	linux.* mail to news gateway

Show all headers | View raw

[Multipart message — attachments visible in raw view] - view raw

Package: ifenslave
Version: 2.13
Severity: normal

Hi,

I have a simple bonding configuration with two physical Ethernet
interfaces, both defined with the `allow-hotplug` option.  I use
allow-hotplug because the interfaces are on USB.  And since I'm
using allow-hotplug, I chose the style of configuration where I use
the `bond-master` configuration option under each slave configuration
stanza, rather than `bond-slaves` in the bonding master interface
stanza.

This configuration is similar to the one in
examples/two_hotplug_ethernet.  I'm attaching a copy of my
/etc/network/interfaces file.

The problem is that sometimes after a reboot, I find that one of my
interfaces (wlx0013efd01275: an external, wireless USB interface) is
not a member of the bond:

$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: enxb827eb9e4634: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond0 state UP group default qlen 1000
    link/ether b8:27:eb:9e:46:34 brd ff:ff:ff:ff:ff:ff
3: wlx0013efd01275: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:13:ef:d0:12:75 brd ff:ff:ff:ff:ff:ff
    inet6 2002:ce3f:e590:2:213:efff:fed0:1275/64 scope global dynamic mngtmpaddr 
       valid_lft 86179sec preferred_lft 14179sec
    inet6 2002:ce3f:e590:1:213:efff:fed0:1275/64 scope global dynamic mngtmpaddr 
       valid_lft 86179sec preferred_lft 14179sec
    inet6 fe80::213:efff:fed0:1275/64 scope link 
       valid_lft forever preferred_lft forever
4: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether b8:27:eb:9e:46:34 brd ff:ff:ff:ff:ff:ff
    inet 192.168.66.19/21 brd 192.168.71.255 scope global bond0
       valid_lft forever preferred_lft forever
    inet6 2002:ce3f:e590:1:1::19/64 scope global nodad 
       valid_lft forever preferred_lft forever
    inet6 fe80::ba27:ebff:fe9e:4634/64 scope link 
       valid_lft forever preferred_lft forever
$ cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v5.10.0-10-rpi

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: enxb827eb9e4634 (primary_reselect always)
Currently Active Slave: enxb827eb9e4634
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0

Slave Interface: enxb827eb9e4634
MII Status: up
Speed: 100 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: b8:27:eb:9e:46:34
Slave queue ID: 0
$ /sbin/ifquery --state
lo=lo
bond0=bond0
wlx0013efd01275=wlx0013efd01275
enxb827eb9e4634=enxb827eb9e4634

In this situation, this is logged to syslog (via systemd):

sh[299]: Failed to enslave wlx0013efd01275 to bond0.

I think I understand the cause, and propose a workaround or a bug fix
(depending on how you look at it).

When both devices are present at boot, two ifup@<iface>.service units
(one for each interface) are started simultaneously.  It seems like
the ifenslave ifupdown scripts are meant to handle this, but in
/etc/network/if-pre-up.d/ifenslave, in the definition of setup_slave_device(),
there is:

    # Ensure the master is up or being configured
    export IFENSLAVE_ENV_NAME="IFUPDOWN_$IF_BOND_MASTER"
    IFUPDOWN_IF_BOND_MASTER="$(printenv "$IFENSLAVE_ENV_NAME")"
    unset IFENSLAVE_ENV_NAME
    if [ -z "$IFUPDOWN_IF_BOND_MASTER" ] ; then
            ifquery --state "$IF_BOND_MASTER" >/dev/null 2>&1 || ifup "$IF_BOND_MASTER"
    fi

I've added loads of debugging and done many reboot cycles to find out
that when the problem occurs (or at least in one case), both
simultaneously-running processes get to the `ifquery` line.  One of
the processes executes ifquery and gets a non-zero return code,
leading it to run `ifup "$IF_BOND_MASTER"`.  After ifup starts, the
other script process executes ifquery and gets a zero return code.

In this case, the bond interface hasn't come up yet, but the command
to bring it up has started, which I think is why ifquery is returning
zero here.  I can reproduce this behavior of ifquery with a dummy
interface:

iface dummy0 inet manual
        pre-up modprobe dummy
        pre-up sleep 5
        up ip link add dummy0 type dummy
        down ip link del dummy0

$ q() { sudo ifquery --state dummy0; echo "  => $?"; }
$ q
  => 1
$ sudo ifup dummy0 & sleep 1; q; ip link show dummy0; ps $!
[1] 7956
dummy0=dummy0
  => 0
Device "dummy0" does not exist.
  PID TTY      STAT   TIME COMMAND
 7956 pts/1    S      0:00 sudo ifup dummy0

This shows that ifquery does return 0, even while ifup is still
working.

From my reading of ifquery(8), this is possibly a bug ("successful
code is returned if all of interfaces given as arguments are up").
Either ifquery is supposed to return failure in this case (meaning
ifquery has a bug, since the interface hasn't finished coming up yet),
or it is valid for it to return successfully since an ifup process has
been _started_ for this interface (in which case the ifenslave script
has a bug--but ifquery(8) could probably be improved).  Regardless of
whether ifquery has a bug here, the ifenslave script could be
simplified in a way that works around the issue anyway:

    # [omitted context]
    if [ -z "$IFUPDOWN_IF_BOND_MASTER" ] ; then
            ifup "$IF_BOND_MASTER"
    fi

Now, if ifup gets started concurrently for the same bond master
interface, the second process blocks (due to ifupdown locking) until
the first process completes.  Both processes running
if-pre-up.d/ifenslave would now continue after bond0 has finished its
configuration.

ifup is idempotent, so there's no harm in running it every time.  In
fact, with how the script is written now, there's a race condition
where ifquery could return failure, and in-between the return and the
ifup, the interface could come up.  If that happens, ifup gets run on
a bond interface in the up state.  Because of the idempotence of ifup,
there's no harm in this hypothetical race condition, but the script
already relies on it, so I'd suggest removing the ifquery condition
altogether.

The only downside I see to running `ifup` every time is that I now get
extra messages in my bootup scenario:

ifup[283]: ifup: waiting for lock on /run/network/ifstate.bond0
sh[311]: ifup: interface bond0 already configured

But if I understand correctly, those are informational, and not
indicating an error.

When I remove that ifquery condition, I no longer encounter the
problem, even after letting this machine reboot in a loop all night.
I'm attaching my modified version of the script.

Do you think this fix could be accepted?  If you think it'd be more
appropriate for me to file a bug against ifupdown regarding ifquery,
please let me know.

Thanks,
-Jean-Paul

-- System Information:
Debian Release: 11.2
  APT prefers unstable
  APT policy: (500, 'unstable')
Architecture: armel (armv6l)

Kernel: Linux 5.10.0-10-rpi
Kernel taint flags: TAINT_CRAP, TAINT_UNSIGNED_MODULE
Locale: LANG=C, LC_CTYPE=C (charmap=UTF-8) (ignored: LC_ALL set to en_US.UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages ifenslave depends on:
ii  ifupdown  0.8.36
ii  iproute2  5.10.0-4

Versions of packages ifenslave recommends:
ii  net-tools  1.60+git20181103.0eebece-1

ifenslave suggests no packages.

-- no debconf information

Back to linux.debian.bugs.dist | Previous | Next | Find similar

Thread

Bug#1004139: ifenslave: Hotplugged interfaces sometimes not added to bond-master due to ifquery race condition Jean-Paul Larocque <jpl-debian-bts@thoughtcrime.us> - 2022-01-21 17:00 +0100

csiph-web