Path: csiph.com!xmission!news.snarked.org!news.linkpendium.com!news.linkpendium.com!panix!usenet.stanford.edu!not-for-mail From: Steffen Nurpmeso Newsgroups: gnu.bash.bug Subject: Re: Async processes started in functions not reliably started Date: Sun, 11 Aug 2019 00:50:44 +0200 Lines: 227 Approved: bug-bash@gnu.org Message-ID: References: <20190805120543.Bf9-U%steffen@sdaoden.eu> <20190803224008.dVNLU%steffen@sdaoden.eu> <20190803231715.YFelC%steffen@sdaoden.eu> <26245.1565045376@jinx.noi.kre.to> <20190806142527.9HS0i%steffen@sdaoden.eu> <20190807193402.d1ZQM%steffen@sdaoden.eu> <20190810225044.fqNyM%steffen@sdaoden.eu> NNTP-Posting-Host: lists.gnu.org X-Trace: usenet.stanford.edu 1565477454 14614 209.51.188.17 (10 Aug 2019 22:50:54 GMT) X-Complaints-To: action@cs.stanford.edu Cc: bug-bash@gnu.org, Steffen Nurpmeso To: Robert Elz Envelope-to: bug-bash@gnu.org In-Reply-To: <20190807193402.d1ZQM%steffen@sdaoden.eu> Mail-Followup-To: Robert Elz , bug-bash@gnu.org, Steffen Nurpmeso User-Agent: s-nail v14.9.14-9-g0a0ff75e OpenPGP: id=EE19E1C1F2F7054F8D3954D8308964B51883A0DD; url=https://ftp.sdaoden.eu/steffen.asc; preference=signencrypt BlahBlahBlah: Any stupid boy can crush a beetle. But all the professors in the world can make no bugs. X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 217.144.132.164 X-BeenThere: bug-bash@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Bug reports for the GNU Bourne Again SHell List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Mailman-Original-Message-ID: <20190810225044.fqNyM%steffen@sdaoden.eu> X-Mailman-Original-References: <20190805120543.Bf9-U%steffen@sdaoden.eu> <20190803224008.dVNLU%steffen@sdaoden.eu> <20190803231715.YFelC%steffen@sdaoden.eu> <26245.1565045376@jinx.noi.kre.to> <20190806142527.9HS0i%steffen@sdaoden.eu> <20190807193402.d1ZQM%steffen@sdaoden.eu> Xref: csiph.com gnu.bash.bug:15315 Hello and a nice Saturday evening, Mr. Elz, and everyone. While it is not a bash bug, and therefore quite off topic, i come back to this once more. Maybe it is of interest for someone. And maybe someone can shed some light on this. This would be nice. Steffen Nurpmeso wrote in <20190807193402.d1ZQM%steffen@sdaoden.eu>: |Steffen Nurpmeso wrote in <20190806142527.9HS0i%steffen@sdaoden.eu>: ||Robert Elz wrote in <26245.1565045376@jinx.noi.kre.to>: ||| Date: Mon, 05 Aug 2019 14:05:43 +0200 ||| From: Steffen Nurpmeso ||| Message-ID: <20190805120543.Bf9-U%steffen@sdaoden.eu> | .. |||The shell cannot really know - your example was not functional until |||after it set up the traps. | .. |||No temp files, named pipes, or othe similar stateful mechanisms needed. | |Sorry for all that noise once again, but i have then rewritten it |using mkfifo etc. with credits for some of you (which collects |things i have seen flying by since Saturday night): | | They also came up with the solution: do not wait(1) on child | processes until we know about their state, so that anytime before we | actually do wait(1) we can safely kill(1) them (Jilles Tjoelker). | Thus, let's create a FIFO (Chet Ramey) to get a synchronized | device, strip the wild test undertaker to a core that only writes | "timeout" to that FIFO, and also improve its startup-is-completed to | simply send a signal to the parent process (Robert Elz). So | either the tests finish nicely, in which case they write their job | number to the fifo, or we see "timeout" and kill all remains. ... The problem is that it does not work out portably. Maybe i am getting something wrong, but i see failures on multi processor OpenBSD 6.5/i386 and FreeBSD 11.3-RC2/i386 (in a Linux KVM/Qemu). On these i see mx-test.sh[8467]: can't open t.fifo: Interrupted system call quite frequently, even if there are no traps installed at all, and data written to the FIFO is occasionally lost. It is written in ( trap '' HUP INT TERM EXIT if ${mkdir} t.${JOBS}.d; then ( cd t.${JOBS}.d && eval t_${1} ${JOBS} ${1} ) fi [ -e t.fifo ] && echo ${JOBS} >> t.fifo ) > t.${JOBS}.io 2>&1 &2 JOB $JOBS WRITES FIFO echo ${JOBS} >> t.fifo sleep 1 done ) > t.${JOBS}.io &1 &2 FROM FIFO I READ $js [ ${?} -ne 0 ] && continue JOBDESC=`${awk} -v L="${JOBDESC}" ' BEGIN{ while(1){ sub("^[ ]+", "", L) sub("[ ]+$", "", L) if(length(L) == 0) break x = L sub("[ ]+.+$", "", x) y = z = x sub("^[0-9]+=[0-9]+/", "", z) sub("/.+$", "", y) x = y sub("=.+", "", x) sub(".+=", "", y) print x " " y " " z sub("^[^ ]+", "", L) } } ' | { l= kl= while read j p n; do if [ ${js} = timeout ]; then kl="${kl} ${j}/[${n}]" echo >&2 KILL ING $j=$p/$n kill -KILL ${p} >/dev/null 2>&1 ${rm} -f t.${j}.result elif [ ${js} = ${j} ]; then echo >&2 USR1 ING $j=$p/$n kill -USR1 ${p} >/dev/null 2>&1 else l="${l} ${j}=${p}/${n}" fi done if [ ${js} = timeout ] && [ -n "${kl}" ]; then printf >&2 '%s!! Timeout: reaped job(s)%s%s\n' \ "${COLOR_ERR_ON}" "${kl}" "${COLOR_ERR_OFF}" fi echo ${l} }` [ ${js} = timeout ] && break # If all jobs finished regulary: done [ -z "${JOBDESC}" ] && break done But, even then, see this: ... [1=X_Y_opt_input_go_stack] [2=X_errexit] [3=Y_errexit] [4=S_freeze] .. waiting JOB 3 WRITES FIFO FROM FIFO I READ 3 USR1 ING 3=8203/Y_errexit JOB 4 WRITES FIFO JOB 2 WRITES FIFO FROM FIFO I READ 4 USR1 ING 4=8210/S_freeze JOB 1 WRITES FIFO FROM FIFO I READ 1 USR1 ING 1=8189/X_Y_opt_input_go_stack ...mx-test.sh[8470]: can't open t.fifo: Interrupted system call ...mx-test.sh[8470]: can't open t.fifo: Interrupted system call ...mx-test.sh[8470]: can't open t.fifo: Interrupted system call FROM FIFO I READ timeout KILL ING 2=8195/X_errexit So then i do ( trap '' HUP INT TERM EXIT if ${mkdir} t.${JOBS}.d; then ( cd t.${JOBS}.d && eval t_${1} ${JOBS} ${1} ) fi if [ -n "${JOBREAPER}" ]; then trap 'exit 0' USR1 while [ 1 ]; do echo >&2 JOB $JOBS WRITES FIFO echo ${JOBS} >> t.fifo sleep 3 done fi ) > t.${JOBS}.io &1