Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Newsgroups: comp.os.linux.development.apps
Subject: Re: Linux O_NONBLOCK bug/ quirk
Date: Sun, 30 Mar 2014 19:42:01 +0100
Lines: 126
Message-ID: <87ha6fr0jq.fsf@sable.mobileactivedefense.com>
References: <878urvu0gx.fsf@sable.mobileactivedefense.com> <lh4vnk$bdu$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Trace: individual.net lmUqxYBkrj+KkiwN3adabQOGjYmPnrUMEKZBBdXKhmoIgNKi4=
Cancel-Lock: sha1:4ANrx068cvi0nAxenvUOTxphLm4= sha1:cyGwFuctYZbbAg39PuGZ0ZHUg/0=
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (gnu/linux)
Xref: csiph.com comp.os.linux.development.apps:674

Lusotec <nomail@nomail.not> writes:
> Rainer Weikusat wrote:
>> As part of one of the usual 'pleasant exchanges' with the people whose
>> ability to make a living depends on controlling access to the Linux code
>> base,
>
> Thats nonsense!
>
>> it came to light that a receive operation on a socket in non-blocking mode 
>> can actually be blocked forever on Linux, example code:
>>
>> ---------
>> #include <fcntl.h>
>> #include <string.h>
>> #include <sys/socket.h>
>> #include <sys/un.h>
>> 
>> int main(void)
>> {
>>     struct sockaddr_un sun;
>>     int fd;
>> 
>>     fd = socket(AF_UNIX, SOCK_DGRAM, 0);
>>     sun.sun_family = AF_UNIX;
>>     strncpy(sun.sun_path, "/tmp/bla", sizeof(sun.sun_path));
>>     bind(fd, (struct sockaddr *)&sun, sizeof(sun));
>> 
>>     if (fork() == 0) read(fd, &fd, sizeof(fd));
>> 
>>     sleep(1);
>> 
>>     fcntl(fd, F_SETFL, O_NONBLOCK);
>>     read(fd, &fd, sizeof(fd));
>> 
>>     return 0;
>> }
>> --------
>> 
>> Killing the forked process results in the other aborting the read call
>> with EAGAIN, as can be determined with strace.
>> 
>> I don't think this is of much practical relevance but it is something
>> worth knowing about.
>
> In the above code, both child and parent processes are reading from the same 
> file descriptor.
>
> Reads from a file descriptiors are queued and served in a fifo fashion. This 
> is true for blocking and non-blocking reads. Even non-blocking reads still 
> have to wait for any previous reads to complete, even if they are going to 
> just return EAGAIN.

This is not true: In the given case, there's a single mutex in the
recv-function and all readers except the first will block on this mutex
and will afterwards be served in whatever order they actually acquire
the mutex. This may turn out to be FIFO but may well be different, eg,
based on priorities.

> The issue with your code is that the file descriptor is set to non-blocking 
> while the first read, a blocking read, is active. When a second read, this 
> will be non-blocking, is made the first read is still blocking and thus the 
> second non-blocking read has to wait for the first to finish.

Yes. Because the first read has acquired the mutex and is a blocking
read, all subsequent reads are effectively blocked until a message is
received on this socket, regardless if they were supposed to be
non-blocking or not. But the definition of "non-blocking read" is "it
won't wait indefinetely until a message is received". 

> Currently, in Linux fcntl affects future operations but not previous or 
> current operations. As such, if a read is blocking a file descriptor, future 
> reads, even if non-blocking will have to wait for the current blocking read 
> to complete.
>
> Now, for the code to work as you expect it (or at least as I understood your 
> expectation), a fcntl must affect a already running operations. I think this 
> is very problematic.

Aborting the blocking read with EAGAIN would indeed be wrong since it is
supposed to block until data is received (or it is interrupted by a
signal). But the second read isn't: It is supposed to return immediately
with either a message or an EAGAIN error. As I wrote in the other
posting: While I'd rate the code above as 'contrived example showing a
theoretic problem' the issue is different for the 'recv with
MSG_DONTWAIT' case: Since 'blocking' or 'non-blocking' semantics can be
demanded with a 'granularity' of individual recvs calls, it is perfectly
reasonable to expect that the blocking ones will potentially block and
the non-blocking ones won't. As it stands, the actual behaviour of an
individual call with MSG_DONTWAIT set is effectively unpredictable except
if it is certain that only one thread of execution tries receive
operations on the socket or if only non-blocking receives are
attempted. This could be documented as the usual 'in case ..., the
behaviour is undefined', the usual fig leaf for "the implementation
doesn't handle this case sensibly", but it isn't.

This is also specifically a 'feature' of the AF_UNIX socket
implementation (and reportedly, AF_INET, too). Other 'things' capable of
supporting non-blocking I/O, eg, pipes (tested) behave as expected[*]: The
non-blocking call blocks, the other doesn't. The reason for this is that
the pipe-mutex is released prior to blocking a blocking reader,
something which cannot 'easily' be done for AF_UNIX sockets because the
lock exists in the 'AF_UNIX layer' and the blocking wait is done with a
general 'datagram socket function' blissfully unaware of that.

[*] The pipe_read implementation in pipe.c (3.2.54) actually contains
the following comment:

 if (!pipe->waiting_writers) {
                        /* syscall merging: Usually we must not sleep
                         * if O_NONBLOCK is set, or if we got some data.
                         * But if a writer sleeps in kernel space, then
                         * we can wait for that data without violating POSIX.

The kernel seems to disagree with itself on that (or, more likely, the
guy who wrote the pipe-code was a little more far-thinking [or
experienced] than the guy who wrote the AF_UNIX code and thus, didn't
have to have the issue pointed out to him in a 'politically unwelcome
way', namely, by me).                         

> What are you trying to do by reading from the same socket in two processes, 
> especially when you change the file descriptor status in the middle of the 
> operations? Both are very unusual.

In this case, nothing. The code was supposed to demonstrate a property
of the implementation I'd consider to be not in line with the documented
behaviour of said implementation.