Groups > comp.lang.c > #155503 > unrolled thread

Inconsistent line counts from 3 methods

Started by	DFS <nospam@dfs.com>
First post	2020-10-10 22:37 -0400
Last post	2020-10-20 15:48 +0100
Articles	20 on this page of 47 — 14 participants

Back to article view | Back to comp.lang.c

  Inconsistent line counts from 3 methods DFS <nospam@dfs.com> - 2020-10-10 22:37 -0400
    Re: Inconsistent line counts from 3 methods Barry Schwarz <schwarzb@delq.com> - 2020-10-10 22:06 -0700
      Re: Inconsistent line counts from 3 methods DFS <nospam@dfs.com> - 2020-10-11 10:38 -0400
        Re: Inconsistent line counts from 3 methods Jorgen Grahn <grahn+nntp@snipabacken.se> - 2020-10-11 15:36 +0000
          Re: Inconsistent line counts from 3 methods DFS <nospam@dfs.com> - 2020-10-11 13:51 -0400
            Re: Inconsistent line counts from 3 methods Lew Pitcher <lew.pitcher@digitalfreehold.ca> - 2020-10-11 18:33 +0000
              Re: Inconsistent line counts from 3 methods DFS <nospam@dfs.com> - 2020-10-11 15:20 -0400
                Re: Inconsistent line counts from 3 methods Lew Pitcher <lew.pitcher@digitalfreehold.ca> - 2020-10-11 19:40 +0000
                  Re: Inconsistent line counts from 3 methods DFS <nospam@dfs.com> - 2020-10-11 15:47 -0400
                    Re: Inconsistent line counts from 3 methods James Kuyper <jameskuyper@alumni.caltech.edu> - 2020-10-11 16:35 -0400
                    Re: NNTP message requirements (Was: Inconsistent line counts from 3 methods) Lew Pitcher <lew.pitcher@digitalfreehold.ca> - 2020-10-11 21:13 +0000
                      Re: NNTP message requirements (Was: Inconsistent line counts from 3 methods) DFS <nospam@dfs.com> - 2020-10-11 18:45 -0400
                        Re: NNTP message requirements Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2020-10-11 17:11 -0700
                Re: Inconsistent line counts from 3 methods James Kuyper <jameskuyper@alumni.caltech.edu> - 2020-10-11 16:27 -0400
                  Re: Inconsistent line counts from 3 methods Ben Bacarisse <ben.usenet@bsb.me.uk> - 2020-10-11 23:30 +0100
                    Re: Inconsistent line counts from 3 methods James Kuyper <jameskuyper@alumni.caltech.edu> - 2020-10-11 23:56 -0400
            Re: Inconsistent line counts from 3 methods James Kuyper <jameskuyper@alumni.caltech.edu> - 2020-10-11 14:53 -0400
              Re: Inconsistent line counts from 3 methods DFS <nospam@dfs.com> - 2020-10-11 15:15 -0400
              Re: Inconsistent line counts from 3 methods Jorgen Grahn <grahn+nntp@snipabacken.se> - 2020-10-14 20:08 +0000
                Re: Inconsistent line counts from 3 methods James Kuyper <jameskuyper@alumni.caltech.edu> - 2020-10-14 16:58 -0400
                  Re: Inconsistent line counts from 3 methods Eli the Bearded <*@eli.users.panix.com> - 2020-10-14 23:37 +0000
                    Re: Inconsistent line counts from 3 methods Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2020-10-14 17:25 -0700
                      Re: Inconsistent line counts from 3 methods Eli the Bearded <*@eli.users.panix.com> - 2020-10-15 01:55 +0000
                    Re: Inconsistent line counts from 3 methods Jorgen Grahn <grahn+nntp@snipabacken.se> - 2020-10-17 19:19 +0000
                  Re: Inconsistent line counts from 3 methods Jorgen Grahn <grahn+nntp@snipabacken.se> - 2020-10-17 19:10 +0000
                    Re: Inconsistent line counts from 3 methods Kaz Kylheku <793-849-0957@kylheku.com> - 2020-10-17 19:36 +0000
            Re: Inconsistent line counts from 3 methods Jorgen Grahn <grahn+nntp@snipabacken.se> - 2020-10-14 20:16 +0000
          Re: Inconsistent line counts from 3 methods Barry Schwarz <schwarzb@delq.com> - 2020-10-11 11:36 -0700
            Re: Inconsistent line counts from 3 methods James Kuyper <jameskuyper@alumni.caltech.edu> - 2020-10-11 15:12 -0400
      Re: Inconsistent line counts from 3 methods James Kuyper <jameskuyper@alumni.caltech.edu> - 2020-10-11 12:16 -0400
    Re: Inconsistent line counts from 3 methods Johann Klammer <klammerj@NOSPAM.a1.net> - 2020-10-11 15:18 +0200
      Re: Inconsistent line counts from 3 methods Jorgen Grahn <grahn+nntp@snipabacken.se> - 2020-10-11 14:31 +0000
      Re: Inconsistent line counts from 3 methods Barry Schwarz <schwarzb@delq.com> - 2020-10-11 11:31 -0700
      Re: Inconsistent line counts from 3 methods Ben Bacarisse <ben.usenet@bsb.me.uk> - 2020-10-11 23:15 +0100
    Re: Inconsistent line counts from 3 methods Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2020-10-11 14:00 -0700
      Re: Inconsistent line counts from 3 methods DFS <nospam@dfs.com> - 2020-10-11 17:47 -0400
        Re: Inconsistent line counts from 3 methods Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2020-10-11 17:26 -0700
          Re: Inconsistent line counts from 3 methods DFS <nospam@dfs.com> - 2020-10-12 13:11 -0400
            Re: Inconsistent line counts from 3 methods Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2020-10-12 10:56 -0700
              Re: Inconsistent line counts from 3 methods Tim Rentsch <tr.17687@z991.linuxsc.com> - 2020-11-29 00:21 -0800
            Re: Inconsistent line counts from 3 methods scott@slp53.sl.home (Scott Lurndal) - 2020-10-12 19:19 +0000
              Re: Inconsistent line counts from 3 methods dfs <nospam@dfs.com> - 2020-10-12 18:53 -0400
                Re: Inconsistent line counts from 3 methods Jorgen Grahn <grahn+nntp@snipabacken.se> - 2020-10-17 23:09 +0000
                  Re: Inconsistent line counts from 3 methods Bart <bc@freeuk.com> - 2020-10-18 00:24 +0100
                    Re: Inconsistent line counts from 3 methods Kaz Kylheku <793-849-0957@kylheku.com> - 2020-10-18 16:56 +0000
                    Re: Inconsistent line counts from 3 methods James Kuyper <jameskuyper@alumni.caltech.edu> - 2020-10-20 09:17 -0400
                      Re: Inconsistent line counts from 3 methods Bart <bc@freeuk.com> - 2020-10-20 15:48 +0100

Page 1 of 3 [1] 2 3 Next page →

#155503 — Inconsistent line counts from 3 methods

From	DFS <nospam@dfs.com>
Date	2020-10-10 22:37 -0400
Subject	Inconsistent line counts from 3 methods
Message-ID	<WhugH.334023$Av7.244451@fx34.iad>

$ countlines war_peace.txt
fread-var: 66875  off by 1183
fgetc    : 65692  correct
fgets    : 65692  correct


$ countlines bible.txt
fread-var: 31255  off by 153
fgetc    : 31101  off by 1
fgets    : 31102  correct


======================================================
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
  				
  char line[600] = "";
  char c;

  // use fread to populate a variable
  // open file, go to end, get size, allocate memory, back to
  // beginning, read contents into variable
  FILE *fin = fopen(argv[1],"r");
  fseek(fin, 0, SEEK_END);
  int buffer = ftell(fin);	
  char *myStr = malloc(sizeof(char) * (buffer + 1));
  rewind(fin);
  fread(myStr, sizeof(char), buffer, fin);

  //count newlines in variable
  int lines = 0;
  for (int i = 0; i < buffer; i++) {if(myStr[i]=='\n') {lines++;}}
  printf("fread-var: %d\n",lines);
  free(myStr);
  	
  //count newline from file with getc	
  lines = 0;
  rewind(fin);
  for (c=getc(fin);c!=EOF;c=getc(fin)) {if(c=='\n') {lines++;}}
  printf("fgetc    : %d\n",lines);
  	

  //count lines from file with fgets
  lines = 0;
  rewind(fin);
  while (fgets(line,sizeof line, fin)!= NULL) {lines++;}
  printf("fgets    : %d\n",lines);

  fclose(fin);
  return(0);
}
======================================================

See anything wrong with the fread-var section?  It consistently 
overcounts lines, especially on bigger files.

[toc] | [next] | [standalone]

#155504

From	Barry Schwarz <schwarzb@delq.com>
Date	2020-10-10 22:06 -0700
Message-ID	<qt35ofl3ced5r1kvtoaui0h2omse848j43@4ax.com>
In reply to	#155503

On Sat, 10 Oct 2020 22:37:35 -0400, DFS <nospam@dfs.com> wrote:

>$ countlines war_peace.txt
>fread-var: 66875  off by 1183
>fgetc    : 65692  correct
>fgets    : 65692  correct
>
>
>$ countlines bible.txt
>fread-var: 31255  off by 153
>fgetc    : 31101  off by 1
>fgets    : 31102  correct
>
>
>======================================================
>#include <stdio.h>
>#include <stdlib.h>
>
>int main(int argc, char *argv[])
>{
>  				
>  char line[600] = "";
>  char c;
>
>  // use fread to populate a variable
>  // open file, go to end, get size, allocate memory, back to
>  // beginning, read contents into variable
>  FILE *fin = fopen(argv[1],"r");
>  fseek(fin, 0, SEEK_END);
>  int buffer = ftell(fin);	
>  char *myStr = malloc(sizeof(char) * (buffer + 1));
>  rewind(fin);
>  fread(myStr, sizeof(char), buffer, fin);
>
>  //count newlines in variable
>  int lines = 0;
>  for (int i = 0; i < buffer; i++) {if(myStr[i]=='\n') {lines++;}}
>  printf("fread-var: %d\n",lines);
>  free(myStr);
>  	
>  //count newline from file with getc	
>  lines = 0;
>  rewind(fin);
>  for (c=getc(fin);c!=EOF;c=getc(fin)) {if(c=='\n') {lines++;}}
>  printf("fgetc    : %d\n",lines);
>  	
>
>  //count lines from file with fgets
>  lines = 0;
>  rewind(fin);
>  while (fgets(line,sizeof line, fin)!= NULL) {lines++;}
>  printf("fgets    : %d\n",lines);
>
>  fclose(fin);
>  return(0);
>}
>======================================================
>
>See anything wrong with the fread-var section?  It consistently 
>overcounts lines, especially on bigger files.

Look at the description of ftell in the standard, particularly as it
relates to text files.

"The ftell function obtains the current value of the file position
indicator for the stream pointed to by stream. For a binary stream,
the value is the number of characters from the beginning of the file.
For a text stream, its file position indicator contains unspecified
information, usable by the fseek function for returning the file
position indicator for the stream to its position at the time of the
ftell call; the difference between two such return values is not
necessarily a meaningful measure of the number of characters written
or read."

Using ftell to determine buffer size probably results in an overly
large buffer.  You should use the return value from fread to determine
how much data was actually read.  You are probably examining residual
data in the buffer that is not part of the file.

Alternately, you could open the file in binary mode instead of text
mode.  ftell should work then.

You never call fgetc so I do not understand why you name the output
with that function.

Are you absolutely certain that bible.txt has a \n at the end of the
very last line?  Use a hex editor to make sure.

-- 
Remove del for email

[toc] | [prev] | [next] | [standalone]

#155515

From	DFS <nospam@dfs.com>
Date	2020-10-11 10:38 -0400
Message-ID	<eREgH.343567$I15.298775@fx36.iad>
In reply to	#155504

On 10/11/2020 1:06 AM, Barry Schwarz wrote:
> On Sat, 10 Oct 2020 22:37:35 -0400, DFS <nospam@dfs.com> wrote:


>> See anything wrong with the fread-var section?  It consistently
>> overcounts lines, especially on bigger files.
> 
> Look at the description of ftell in the standard, particularly as it
> relates to text files.
> 
> "The ftell function obtains the current value of the file position
> indicator for the stream pointed to by stream. For a binary stream,
> the value is the number of characters from the beginning of the file.
> For a text stream, its file position indicator contains unspecified
> information, usable by the fseek function for returning the file
> position indicator for the stream to its position at the time of the
> ftell call; the difference between two such return values is not
> necessarily a meaningful measure of the number of characters written
> or read."
> 
> Using ftell to determine buffer size probably results in an overly
> large buffer.  You should use the return value from fread to determine
> how much data was actually read.  You are probably examining residual
> data in the buffer that is not part of the file.
> 
> Alternately, you could open the file in binary mode instead of text
> mode.  ftell should work then.

Thanks.  Both your suggestions worked.  I'll use the 'open in binary 
mode' option.


> You never call fgetc so I do not understand why you name the output
> with that function.

typo


> Are you absolutely certain that bible.txt has a \n at the end of the
> very last line?  Use a hex editor to make sure.

It didn't, as shown in Notepad++ | View | Symbol | Show End of Line

I see fgets() is more "reliable" than f/getc in case your final line is 
missing a newline (which I would bet happens frequently):

"fgets() stops when either (n-1) characters are read, the newline 
character is read, or the end-of-file is reached, whichever comes first."

[toc] | [prev] | [next] | [standalone]

#155516

From	Jorgen Grahn <grahn+nntp@snipabacken.se>
Date	2020-10-11 15:36 +0000
Message-ID	<slrnro69ji.1hpq.grahn+nntp@frailea.sa.invalid>
In reply to	#155515

On Sun, 2020-10-11, DFS wrote:
...
> I see fgets() is more "reliable" than f/getc in case your final line is 
> missing a newline (which I would bet happens frequently):

In the past, on Unix, it used to happen very infrequently, but it
seems recent IDEs generate "endless" text files by default.

I have yet to figure out why they do this.  The only effect is a
saving of one byte, and that lot of traditional tools break in subtle
ways[1] ... but a conspiracy against Unix seems unlikely.

> "fgets() stops when either (n-1) characters are read, the newline 
> character is read, or the end-of-file is reached, whichever comes first."

What do you mean?  All the functions you list give you all information
available; they all seem reliable.

(fgets() can't handle '\0' characters, but that's a separate thing.)

/Jorgen

[1] cat foo bar

-- 
  // Jorgen Grahn <grahn@  Oo  o.   .     .
\X/     snipabacken.se>   O  o   .

[toc] | [prev] | [next] | [standalone]

#155521

From	DFS <nospam@dfs.com>
Date	2020-10-11 13:51 -0400
Message-ID	<2FHgH.225012$d95.16295@fx06.iad>
In reply to	#155516

On 10/11/2020 11:36 AM, Jorgen Grahn wrote:
> On Sun, 2020-10-11, DFS wrote:
> ...
>> I see fgets() is more "reliable" than f/getc in case your final line is
>> missing a newline (which I would bet happens frequently):
> 
> In the past, on Unix, it used to happen very infrequently, but it
> seems recent IDEs generate "endless" text files by default.
> 
> I have yet to figure out why they do this.  The only effect is a
> saving of one byte, and that lot of traditional tools break in subtle
> ways[1] ... but a conspiracy against Unix seems unlikely.


I'm not talking about IDEs.  I'm talking about writing in general.  I 
would guess many people write their last line and end it with a period, 
not a period and return.

ie, the bible.txt file I used doesn't have a terminating \n
http://www.truth.info/bigfiles/bible.txt.zip



>> "fgets() stops when either (n-1) characters are read, the newline
>> character is read, or the end-of-file is reached, whichever comes first."
> 
> What do you mean?  All the functions you list give you all information
> available; they all seem reliable.

I mean "reliable" in the sense that if you forget a terminating \n fgets 
will still count the last line, whereas using f/getc you would usually 
undercount the number of lines if there's no terminating \n.


> (fgets() can't handle '\0' characters, but that's a separate thing.)
> 
> /Jorgen
> 
> [1] cat foo bar
>

[toc] | [prev] | [next] | [standalone]

#155523

From	Lew Pitcher <lew.pitcher@digitalfreehold.ca>
Date	2020-10-11 18:33 +0000
Message-ID	<rlvj4v$bdi$1@dont-email.me>
In reply to	#155521

On Sun, 11 Oct 2020 13:51:10 -0400, DFS wrote:

> On 10/11/2020 11:36 AM, Jorgen Grahn wrote:
>> On Sun, 2020-10-11, DFS wrote:
>> ...
>>> I see fgets() is more "reliable" than f/getc in case your final line
>>> is missing a newline (which I would bet happens frequently):
>> 
>> In the past, on Unix, it used to happen very infrequently, but it seems
>> recent IDEs generate "endless" text files by default.
>> 
>> I have yet to figure out why they do this.  The only effect is a saving
>> of one byte, and that lot of traditional tools break in subtle ways[1]
>> ... but a conspiracy against Unix seems unlikely.
> 
> 
> I'm not talking about IDEs.  I'm talking about writing in general.  I
> would guess many people write their last line and end it with a period,
> not a period and return.

Maybe so.

But, by definition (both the "C" definition (C11 7.21.2 pgph 2), and the 
Unix common definition) that last block of characters would not make up a 
"line". 

> ie, the bible.txt file I used doesn't have a terminating \n
> http://www.truth.info/bigfiles/bible.txt.zip
> 
> 
> 
>>> "fgets() stops when either (n-1) characters are read, the newline
>>> character is read, or the end-of-file is reached, whichever comes
>>> first."
>> 
>> What do you mean?  All the functions you list give you all information
>> available; they all seem reliable.
> 
> I mean "reliable" in the sense that if you forget a terminating \n fgets
> will still count the last line, whereas using f/getc you would usually
> undercount the number of lines if there's no terminating \n.

No, you wouldn't. A line is /specifically/ a sequence of characters 
followed by a newline. 
> 
> 
>> (fgets() can't handle '\0' characters, but that's a separate thing.)
>> 
>> /Jorgen
>> 
>> [1] cat foo bar
>>




-- 
Lew Pitcher
"In Skills, We Trust"

[toc] | [prev] | [next] | [standalone]

#155529

From	DFS <nospam@dfs.com>
Date	2020-10-11 15:20 -0400
Message-ID	<fZIgH.17219$je1.6227@fx22.iad>
In reply to	#155523

On 10/11/2020 2:33 PM, Lew Pitcher wrote:
> On Sun, 11 Oct 2020 13:51:10 -0400, DFS wrote:
> 
>> On 10/11/2020 11:36 AM, Jorgen Grahn wrote:
>>> On Sun, 2020-10-11, DFS wrote:
>>> ...
>>>> I see fgets() is more "reliable" than f/getc in case your final line
>>>> is missing a newline (which I would bet happens frequently):
>>>
>>> In the past, on Unix, it used to happen very infrequently, but it seems
>>> recent IDEs generate "endless" text files by default.
>>>
>>> I have yet to figure out why they do this.  The only effect is a saving
>>> of one byte, and that lot of traditional tools break in subtle ways[1]
>>> ... but a conspiracy against Unix seems unlikely.
>>
>>
>> I'm not talking about IDEs.  I'm talking about writing in general.  I
>> would guess many people write their last line and end it with a period,
>> not a period and return.
> 
> Maybe so.
> 
> But, by definition (both the "C" definition (C11 7.21.2 pgph 2), and the
> Unix common definition) that last block of characters would not make up a
> "line".


"Whether the last line requires a terminating new-line character is 
implementation-defined."




>> ie, the bible.txt file I used doesn't have a terminating \n
>> http://www.truth.info/bigfiles/bible.txt.zip
>>
>>
>>
>>>> "fgets() stops when either (n-1) characters are read, the newline
>>>> character is read, or the end-of-file is reached, whichever comes
>>>> first."
>>>
>>> What do you mean?  All the functions you list give you all information
>>> available; they all seem reliable.
>>
>> I mean "reliable" in the sense that if you forget a terminating \n fgets
>> will still count the last line, whereas using f/getc you would usually
>> undercount the number of lines if there's no terminating \n.
> 
> No, you wouldn't. A line is /specifically/ a sequence of characters
> followed by a newline.

sez who?

By that bogus definition:

This is only
1 line of text.

[toc] | [prev] | [next] | [standalone]

#155531

From	Lew Pitcher <lew.pitcher@digitalfreehold.ca>
Date	2020-10-11 19:40 +0000
Message-ID	<rlvn2i$bdi$2@dont-email.me>
In reply to	#155529

On Sun, 11 Oct 2020 15:20:11 -0400, DFS wrote:

> On 10/11/2020 2:33 PM, Lew Pitcher wrote:
>> On Sun, 11 Oct 2020 13:51:10 -0400, DFS wrote:
>> 
>>> On 10/11/2020 11:36 AM, Jorgen Grahn wrote:
>>>> On Sun, 2020-10-11, DFS wrote:
>>>> ...
>>>>> I see fgets() is more "reliable" than f/getc in case your final line
>>>>> is missing a newline (which I would bet happens frequently):
>>>>
>>>> In the past, on Unix, it used to happen very infrequently, but it
>>>> seems recent IDEs generate "endless" text files by default.
>>>>
>>>> I have yet to figure out why they do this.  The only effect is a
>>>> saving of one byte, and that lot of traditional tools break in subtle
>>>> ways[1] ... but a conspiracy against Unix seems unlikely.
>>>
>>>
>>> I'm not talking about IDEs.  I'm talking about writing in general.  I
>>> would guess many people write their last line and end it with a
>>> period, not a period and return.
>> 
>> Maybe so.
>> 
>> But, by definition (both the "C" definition (C11 7.21.2 pgph 2), and
>> the Unix common definition) that last block of characters would not
>> make up a "line".
> 
> 
> "Whether the last line requires a terminating new-line character is
> implementation-defined."

So, either your implementation requires a terminating new-line character to count the last bit of a file as a line or it doesn't.

If it /does/, then any trailing data is not a line.
If it doesn't, then there is no trailing data.

ISTM that, when writing code for an implementation for which you have not established whether or not "lines" require a terminating new-line, you should assume that the implementation /does/ require a new-line. The resulting code will handle either state.


>>> ie, the bible.txt file I used doesn't have a terminating \n
>>> http://www.truth.info/bigfiles/bible.txt.zip
>>>
>>>
>>>
>>>>> "fgets() stops when either (n-1) characters are read, the newline
>>>>> character is read, or the end-of-file is reached, whichever comes
>>>>> first."
>>>>
>>>> What do you mean?  All the functions you list give you all
>>>> information available; they all seem reliable.
>>>
>>> I mean "reliable" in the sense that if you forget a terminating \n
>>> fgets will still count the last line, whereas using f/getc you would
>>> usually undercount the number of lines if there's no terminating \n.
>> 
>> No, you wouldn't. A line is /specifically/ a sequence of characters
>> followed by a newline.
> 
> sez who?
> 
> By that bogus definition:
> 
> This is only 1
> line of text.
Assuming that the post went to EOF immediatly after the ".", then yes; that was only 1 line of text, followed by a block of text that cannot be called a "line".

OTOH, in reality, You /posted/ two lines there, both terminated with newlines.

  15:38 $ hexdump -C fZIgH.17219\$je1.6227\@fx22.iad.msg | tail -10
  00000bb0  69 6e 65 20 69 73 20 2f  73 70 65 63 69 66 69 63  |ine is /specific|
  00000bc0  61 6c 6c 79 2f 20 61 20  73 65 71 75 65 6e 63 65  |ally/ a sequence|
  00000bd0  20 6f 66 20 63 68 61 72  61 63 74 65 72 73 0a 3e  | of characters.>|
  00000be0  20 66 6f 6c 6c 6f 77 65  64 20 62 79 20 61 20 6e  | followed by a n|
  00000bf0  65 77 6c 69 6e 65 2e 0a  0a 73 65 7a 20 77 68 6f  |ewline...sez who|
  00000c00  3f 0a 0a 42 79 20 74 68  61 74 20 62 6f 67 75 73  |?..By that bogus|
  00000c10  20 64 65 66 69 6e 69 74  69 6f 6e 3a 0a 0a 54 68  | definition:..Th|
  00000c20  69 73 20 69 73 20 6f 6e  6c 79 0a 31 20 6c 69 6e  |is is only.1 lin|
  00000c30  65 20 6f 66 20 74 65 78  74 2e 0a                 |e of text..|
  00000c3b

Note the newline (0x0a) at 0c3a)


-- 
Lew Pitcher
"In Skills, We Trust"

[toc] | [prev] | [next] | [standalone]

#155532

From	DFS <nospam@dfs.com>
Date	2020-10-11 15:47 -0400
Message-ID	<XnJgH.109602$nI.52951@fx21.iad>
In reply to	#155531

On 10/11/2020 3:40 PM, Lew Pitcher wrote:
> On Sun, 11 Oct 2020 15:20:11 -0400, DFS wrote:
> 
>> On 10/11/2020 2:33 PM, Lew Pitcher wrote:
>>> On Sun, 11 Oct 2020 13:51:10 -0400, DFS wrote:
>>>
>>>> On 10/11/2020 11:36 AM, Jorgen Grahn wrote:
>>>>> On Sun, 2020-10-11, DFS wrote:
>>>>> ...
>>>>>> I see fgets() is more "reliable" than f/getc in case your final line
>>>>>> is missing a newline (which I would bet happens frequently):
>>>>>
>>>>> In the past, on Unix, it used to happen very infrequently, but it
>>>>> seems recent IDEs generate "endless" text files by default.
>>>>>
>>>>> I have yet to figure out why they do this.  The only effect is a
>>>>> saving of one byte, and that lot of traditional tools break in subtle
>>>>> ways[1] ... but a conspiracy against Unix seems unlikely.
>>>>
>>>>
>>>> I'm not talking about IDEs.  I'm talking about writing in general.  I
>>>> would guess many people write their last line and end it with a
>>>> period, not a period and return.
>>>
>>> Maybe so.
>>>
>>> But, by definition (both the "C" definition (C11 7.21.2 pgph 2), and
>>> the Unix common definition) that last block of characters would not
>>> make up a "line".
>>
>>
>> "Whether the last line requires a terminating new-line character is
>> implementation-defined."
> 
> So, either your implementation requires a terminating new-line character to count the last bit of a file as a line or it doesn't.
> 
> If it /does/, then any trailing data is not a line.
> If it doesn't, then there is no trailing data.
> 
> ISTM that, when writing code for an implementation for which you have not established whether or not "lines" require a terminating new-line, you should assume that the implementation /does/ require a new-line. The resulting code will handle either state.
> 
> 
>>>> ie, the bible.txt file I used doesn't have a terminating \n
>>>> http://www.truth.info/bigfiles/bible.txt.zip
>>>>
>>>>
>>>>
>>>>>> "fgets() stops when either (n-1) characters are read, the newline
>>>>>> character is read, or the end-of-file is reached, whichever comes
>>>>>> first."
>>>>>
>>>>> What do you mean?  All the functions you list give you all
>>>>> information available; they all seem reliable.
>>>>
>>>> I mean "reliable" in the sense that if you forget a terminating \n
>>>> fgets will still count the last line, whereas using f/getc you would
>>>> usually undercount the number of lines if there's no terminating \n.
>>>
>>> No, you wouldn't. A line is /specifically/ a sequence of characters
>>> followed by a newline.
>>
>> sez who?
>>
>> By that bogus definition:
>>
>> This is only 1
>> line of text.
> Assuming that the post went to EOF immediatly after the ".", then yes; that was only 1 line of text, followed by a block of text that cannot be called a "line".


It can and must be called a line, whether it terminated with \n or not.




> OTOH, in reality, You /posted/ two lines there, both terminated with newlines.

In reality I posted two lines, only one terminated with a newline.

https://imgur.com/a/GjG2E9W

as cut and pasted from Thunderbird, and viewed with Notepad++ on Windows.




>    15:38 $ hexdump -C fZIgH.17219\$je1.6227\@fx22.iad.msg | tail -10
>    00000bb0  69 6e 65 20 69 73 20 2f  73 70 65 63 69 66 69 63  |ine is /specific|
>    00000bc0  61 6c 6c 79 2f 20 61 20  73 65 71 75 65 6e 63 65  |ally/ a sequence|
>    00000bd0  20 6f 66 20 63 68 61 72  61 63 74 65 72 73 0a 3e  | of characters.>|
>    00000be0  20 66 6f 6c 6c 6f 77 65  64 20 62 79 20 61 20 6e  | followed by a n|
>    00000bf0  65 77 6c 69 6e 65 2e 0a  0a 73 65 7a 20 77 68 6f  |ewline...sez who|
>    00000c00  3f 0a 0a 42 79 20 74 68  61 74 20 62 6f 67 75 73  |?..By that bogus|
>    00000c10  20 64 65 66 69 6e 69 74  69 6f 6e 3a 0a 0a 54 68  | definition:..Th|
>    00000c20  69 73 20 69 73 20 6f 6e  6c 79 0a 31 20 6c 69 6e  |is is only.1 lin|
>    00000c30  65 20 6f 66 20 74 65 78  74 2e 0a                 |e of text..|
>    00000c3b
> 
> Note the newline (0x0a) at 0c3a)


That last newline was added by another program.  I didn't send it.

Stop your copy and paste at 00000c30 and see what you get.

[toc] | [prev] | [next] | [standalone]

#155537

From	James Kuyper <jameskuyper@alumni.caltech.edu>
Date	2020-10-11 16:35 -0400
Message-ID	<rlvqaa$ill$1@dont-email.me>
In reply to	#155532

On 10/11/20 3:47 PM, DFS wrote:
> On 10/11/2020 3:40 PM, Lew Pitcher wrote:
>> On Sun, 11 Oct 2020 15:20:11 -0400, DFS wrote:
>>
>>> On 10/11/2020 2:33 PM, Lew Pitcher wrote:
...
>>> "Whether the last line requires a terminating new-line character is
>>> implementation-defined."
...
>>>> No, you wouldn't. A line is /specifically/ a sequence of characters
>>>> followed by a newline.
>>>
>>> sez who?
>>>
>>> By that bogus definition:
>>>
>>> This is only 1
>>> line of text.
>> Assuming that the post went to EOF immediatly after the ".", then yes; that was only 1 line of text, followed by a block of text that cannot be called a "line".
> 
> 
> It can and must be called a line, whether it terminated with \n or not.

That might be the way you feel about it, but the C standard expresses a
conflicting and more authoritative point of view on the issue. You
yourself cited the text from the standard (quoted at the top of this
message) that explicitly authorizes each implementation decide for
itself whether such a sequence of characters qualifies as a line.

[toc] | [prev] | [next] | [standalone]

#155542 — Re: NNTP message requirements (Was: Inconsistent line counts from 3 methods)

From	Lew Pitcher <lew.pitcher@digitalfreehold.ca>
Date	2020-10-11 21:13 +0000
Subject	Re: NNTP message requirements (Was: Inconsistent line counts from 3 methods)
Message-ID	<rlvsgs$bdi$3@dont-email.me>
In reply to	#155532

On Sun, 11 Oct 2020 15:47:28 -0400, DFS wrote:

> On 10/11/2020 3:40 PM, Lew Pitcher wrote:
>> On Sun, 11 Oct 2020 15:20:11 -0400, DFS wrote:
[snip]
>>> This is only 1
>>> line of text.
>> Assuming that the post went to EOF immediatly after the ".", then yes; 
that was only 1 line of text, followed by a block of text that cannot be 
called a "line".
> 
> 
> It can and must be called a line, whether it terminated with \n or not.

>> OTOH, in reality, You /posted/ two lines there, both terminated with 
newlines.
> 
> In reality I posted two lines, only one terminated with a newline.
> 
> https://imgur.com/a/GjG2E9W
> 
> as cut and pasted from Thunderbird, and viewed with Notepad++ on 
Windows.
>
>>    15:38 $ hexdump -C fZIgH.17219\$je1.6227\@fx22.iad.msg | tail -10
[snip]
>>    00000c30  65 20 6f 66 20 74 65 78  74 2e 0a                 |e of 
text..|
>>    00000c3b
>> 
>> Note the newline (0x0a) at 0c3a)
> 
> 
> That last newline was added by another program.

Yes, /your/ nntp posting application, as part of the requirements of NNTP 
(the protocol that governs the population and management of Usenet 
articles). See RFC 3977, section 3.6, where it defines the format of the 
body of a posting. Ironically, it specifies that the body will contain 
one or more /lines/, each terminated with a CARRIAGE-RETURN, LINEFEED 
combination. There are /no/ unterminated lines in a posting body.

>  I didn't send it.

You may not have intended to, but you did, indeed, "send it".

As for your complaint regarding linecounting, making such a complaint 
here is more than useless. You are arguing against an interpretation that 
has /operationally/ been in place for at least 50 years, and has been 
codified in standards for at least 30 years. Moreover, you are making 
this argument to an audience of amateurs and professionals in a public 
forum, and /NOT/ to a standards body, or anyone who can take action to 
ensure that /your/ interpretation overrides the current, standard 
interpretation. (At least, not within the confines of this forum; some 
participants may be in such a position, should you address your concerns /
formally/ to them in the appropriate forum.)

In other words, you are practicing an exercise in futility. You would 
have better luck arguing with weathermen that the wind should not blow 
your expectoration back on you when you spit into the wind.

-- 
Lew Pitcher
"In Skills, We Trust"

[toc] | [prev] | [next] | [standalone]

#155546 — Re: NNTP message requirements (Was: Inconsistent line counts from 3 methods)

From	DFS <nospam@dfs.com>
Date	2020-10-11 18:45 -0400
Subject	Re: NNTP message requirements (Was: Inconsistent line counts from 3 methods)
Message-ID	<F_LgH.126376$MQ.9832@fx14.iad>
In reply to	#155542

On 10/11/2020 5:13 PM, Lew Pitcher wrote:
> On Sun, 11 Oct 2020 15:47:28 -0400, DFS wrote:
> 
>> On 10/11/2020 3:40 PM, Lew Pitcher wrote:
>>> On Sun, 11 Oct 2020 15:20:11 -0400, DFS wrote:
> [snip]
>>>> This is only 1
>>>> line of text.
>>> Assuming that the post went to EOF immediatly after the ".", then yes;
> that was only 1 line of text, followed by a block of text that cannot be
> called a "line".
>>
>>
>> It can and must be called a line, whether it terminated with \n or not.
> 
>>> OTOH, in reality, You /posted/ two lines there, both terminated with
> newlines.
>>
>> In reality I posted two lines, only one terminated with a newline.
>>
>> https://imgur.com/a/GjG2E9W
>>
>> as cut and pasted from Thunderbird, and viewed with Notepad++ on
> Windows.
>>
>>>     15:38 $ hexdump -C fZIgH.17219\$je1.6227\@fx22.iad.msg | tail -10
> [snip]
>>>     00000c30  65 20 6f 66 20 74 65 78  74 2e 0a                 |e of
> text..|
>>>     00000c3b
>>>
>>> Note the newline (0x0a) at 0c3a)
>>
>>
>> That last newline was added by another program.



> Yes, /your/ nntp posting application, as part of the requirements of NNTP
> (the protocol that governs the population and management of Usenet
> articles). See RFC 3977, section 3.6, where it defines the format of the
> body of a posting. Ironically, it specifies that the body will contain
> one or more /lines/, each terminated with a CARRIAGE-RETURN, LINEFEED
> combination. There are /no/ unterminated lines in a posting body.

And what exactly forces Thunderbird or any Usenet app or server to 
adhere to the NNTP protocol?



>>   I didn't send it.
> 
> You may not have intended to, but you did, indeed, "send it".

No #1

I submitted the text to blocknews - via Thunderbird, written with 
whatever editor is the default - without a penultimate \n, and when the 
post showed up in clc no CRLF was present at the end.

I already showed you what it looks like in Notepad++:
https://imgur.com/a/GjG2E9W



No #2

When I retrieve the post from blocknews (via the python 
nntplib.body(articleID) call, it has no terminating \n.

<fZIgH.17219$je1.6227@fx22.iad>



But when I find the post at Howard Knight Usenet lookup it does contain 
a terminating \n, indicating it was probably altered by other Usenet 
server(s).

http://al.howardknight.net/?STYPE=msgid&MSGI=<fZIgH.17219%24je1.6227%40fx22.iad>



How about you send a post to clc with no terminating \n and we'll see 
what happens.



> As for your complaint regarding linecounting, making such a complaint
> here is more than useless. You are arguing against an interpretation that
> has /operationally/ been in place for at least 50 years, and has been
> codified in standards for at least 30 years. Moreover, you are making
> this argument to an audience of amateurs and professionals in a public
> forum, and /NOT/ to a standards body, or anyone who can take action to
> ensure that /your/ interpretation overrides the current, standard
> interpretation. (At least, not within the confines of this forum; some
> participants may be in such a position, should you address your concerns /
> formally/ to them in the appropriate forum.)
> 
> In other words, you are practicing an exercise in futility. You would
> have better luck arguing with weathermen that the wind should not blow
> your expectoration back on you when you spit into the wind.


If anyone doesn't count the last line of text because it hasn't a 
terminating \n they should be flogged.

[toc] | [prev] | [next] | [standalone]

#155547 — Re: NNTP message requirements

From	Keith Thompson <Keith.S.Thompson+u@gmail.com>
Date	2020-10-11 17:11 -0700
Subject	Re: NNTP message requirements
Message-ID	<875z7g7171.fsf@nosuchdomain.example.com>
In reply to	#155546

DFS <nospam@dfs.com> writes:
[...]
> If anyone doesn't count the last line of text because it hasn't a
> terminating \n they should be flogged.

Thank you for establishing that your opinions are not to be taken
seriously.

-- 
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
Working, but not speaking, for Philips Healthcare
void Void(void) { Void(); } /* The recursive call of the void */

[toc] | [prev] | [next] | [standalone]

#155536

From	James Kuyper <jameskuyper@alumni.caltech.edu>
Date	2020-10-11 16:27 -0400
Message-ID	<rlvpqq$fgs$1@dont-email.me>
In reply to	#155529

On 10/11/20 3:20 PM, DFS wrote:
> On 10/11/2020 2:33 PM, Lew Pitcher wrote:
...
>> No, you wouldn't. A line is /specifically/ a sequence of characters
>> followed by a newline.
> 
> sez who?

The most authoritative source possible in this context: the ISO C
standard. "A text stream is an ordered sequence of characters composed
into _lines_, each line consisting of zero or more characters plus a
terminating new-line character. ..." (7.21.2p2)
The word "lines" is italicized, an ISO convention indicating that the
sentence containing that word is considered to officially define the
meaning of that word in the context of this document.

The immediately following line, which makes the end of the file a
special case, has already been quoted by Lew Pitcher.

[toc] | [prev] | [next] | [standalone]

#155545

From	Ben Bacarisse <ben.usenet@bsb.me.uk>
Date	2020-10-11 23:30 +0100
Message-ID	<87pn5o8kfw.fsf@bsb.me.uk>
In reply to	#155536

James Kuyper <jameskuyper@alumni.caltech.edu> writes:

> On 10/11/20 3:20 PM, DFS wrote:
>> On 10/11/2020 2:33 PM, Lew Pitcher wrote:
> ...
>>> No, you wouldn't. A line is /specifically/ a sequence of characters
>>> followed by a newline.
>> 
>> sez who?
>
> The most authoritative source possible in this context: the ISO C
> standard.

The answer to who says a line is defined as above is indeed "the ISO C
standard", but saying that this is "the most authoritative source
possible in this context" goes too far.  If I write a program to count
lines, I get to say what a line is, not the C standard.

-- 
Ben.

[toc] | [prev] | [next] | [standalone]

#155550

From	James Kuyper <jameskuyper@alumni.caltech.edu>
Date	2020-10-11 23:56 -0400
Message-ID	<rm0k4h$9qe$1@dont-email.me>
In reply to	#155545

On 10/11/20 6:30 PM, Ben Bacarisse wrote:
> James Kuyper <jameskuyper@alumni.caltech.edu> writes:
> 
>> On 10/11/20 3:20 PM, DFS wrote:
>>> On 10/11/2020 2:33 PM, Lew Pitcher wrote:
>> ...
>>>> No, you wouldn't. A line is /specifically/ a sequence of characters
>>>> followed by a newline.
>>>
>>> sez who?
>>
>> The most authoritative source possible in this context: the ISO C
>> standard.
> 
> The answer to who says a line is defined as above is indeed "the ISO C
> standard", but saying that this is "the most authoritative source
> possible in this context" goes too far.  If I write a program to count
> lines, I get to say what a line is, not the C standard.

I'm not disagreeing with that comment, but it's not relevant to what I
was saying. The fact that you consider it relevant implies that I didn't
say what I meant clearly enough.

The standard uses the term line in (at least?) two different contexts -
a line of source code, or a line as processed using the standard
library's I/O routines. My comment was only about the latter.

You are, of course, free to choose a definition of what a "line" is, and
to write code that counts how many of them there are in an input file.

However, it's the standard's definition of "lines" in 7.21.2p2 that
governs the interpretation of any sentence in the standard that uses
"line" or "lines" in connection with C standard library I/O routines;
the definition you choose has no bearing on the matter. For instance,
7.21.2p9 says "An implementation shall support text files with lines
containing at least 254 characters,". The standard's definition of
"lines" is what determines whether or not a given implementation meets
that requirement, not your definition. In general, a program that counts
something called "lines" with a different definition of that term than
the one in 7.21.2p2 might have to be carefully written to work around
that difference.

When I said "in this context", I was referring specifically to a context
where it is the standard's definition that matters. It's implementation
defined whether or not a newline is needed at the end of the last line
of a text file, and it's the standard's definition of "lines", not
yours, that determines what constitutes the last line of a text file for
purposes of determining whether that requirement has been met.

[toc] | [prev] | [next] | [standalone]

#155525

From	James Kuyper <jameskuyper@alumni.caltech.edu>
Date	2020-10-11 14:53 -0400
Message-ID	<rlvkb2$dbd$1@dont-email.me>
In reply to	#155521

On 10/11/20 1:51 PM, DFS wrote:
...
> I'm not talking about IDEs.  I'm talking about writing in general.  I 
> would guess many people write their last line and end it with a period, 
> not a period and return.

Text file editors (including some IDEs) often put in a final return,
even if you don't. As a result, if you do put one in, you'll often end
up with two newlines at the end of the file.

[toc] | [prev] | [next] | [standalone]

#155528

From	DFS <nospam@dfs.com>
Date	2020-10-11 15:15 -0400
Message-ID	<0VIgH.17218$je1.9703@fx22.iad>
In reply to	#155525

On 10/11/2020 2:53 PM, James Kuyper wrote:
> On 10/11/20 1:51 PM, DFS wrote:
> ...
>> I'm not talking about IDEs.  I'm talking about writing in general.  I
>> would guess many people write their last line and end it with a period,
>> not a period and return.
> 
> Text file editors (including some IDEs) often put in a final return,
> even if you don't. As a result, if you do put one in, you'll often end
> up with two newlines at the end of the file.

I think the default Thunderbird editor does, since when I save a draft 
it adds a terminating CRLF (Windows).

I'm going to send this post without saving it, and see if it
adds a CRLF to this last line - I'm not going to.

[toc] | [prev] | [next] | [standalone]

#155665

From	Jorgen Grahn <grahn+nntp@snipabacken.se>
Date	2020-10-14 20:08 +0000
Message-ID	<slrnroemmf.1hpq.grahn+nntp@frailea.sa.invalid>
In reply to	#155525

On Sun, 2020-10-11, James Kuyper wrote:
> On 10/11/20 1:51 PM, DFS wrote:
> ...
>> I'm not talking about IDEs.  I'm talking about writing in general.  I 
>> would guess many people write their last line and end it with a period, 
>> not a period and return.
>
> Text file editors (including some IDEs) often put in a final return,
> even if you don't.

E.g. Emacs and vi in default configurations.

> As a result, if you do put one in, you'll often end
> up with two newlines at the end of the file.

/That/ is something I have never seen.  Which tools do that? Sounds
like a bug to me -- and one that's easily fixed.

/Jorgen

-- 
  // Jorgen Grahn <grahn@  Oo  o.   .     .
\X/     snipabacken.se>   O  o   .

[toc] | [prev] | [next] | [standalone]

#155668

From	James Kuyper <jameskuyper@alumni.caltech.edu>
Date	2020-10-14 16:58 -0400
Message-ID	<rm7oq5$grv$1@dont-email.me>
In reply to	#155665

On 10/14/20 4:08 PM, Jorgen Grahn wrote:
> On Sun, 2020-10-11, James Kuyper wrote:
>> On 10/11/20 1:51 PM, DFS wrote:
>> ...
>>> I'm not talking about IDEs.  I'm talking about writing in general.  I 
>>> would guess many people write their last line and end it with a period, 
>>> not a period and return.
>>
>> Text file editors (including some IDEs) often put in a final return,
>> even if you don't.
> 
> E.g. Emacs and vi in default configurations.
> 
>> As a result, if you do put one in, you'll often end
>> up with two newlines at the end of the file.
> 
> /That/ is something I have never seen.  Which tools do that? Sounds
> like a bug to me -- and one that's easily fixed.

I opened a new file with vi, and hit the following keys:

	i 1 Enter Esc : x

Here's what I see in the resulting file:

~(48) od -a linetest
0000000   1  nl  nl
0000003

[toc] | [prev] | [next] | [standalone]

Page 1 of 3 [1] 2 3 Next page →

csiph-web

Inconsistent line counts from 3 methods

Contents

#155503 — Inconsistent line counts from 3 methods

#155504

#155515

#155516

#155521

#155523

#155529

#155531

#155532

#155537

#155542 — Re: NNTP message requirements (Was: Inconsistent line counts from 3 methods)

#155546 — Re: NNTP message requirements (Was: Inconsistent line counts from 3 methods)

#155547 — Re: NNTP message requirements

#155536

#155545

#155550

#155525

#155528

#155665

#155668