Groups > comp.lang.c > #155503 > unrolled thread

Inconsistent line counts from 3 methods

Started by	DFS <nospam@dfs.com>
First post	2020-10-10 22:37 -0400
Last post	2020-10-20 15:48 +0100
Articles	7 on this page of 47 — 14 participants

Back to article view | Back to comp.lang.c

  Inconsistent line counts from 3 methods DFS <nospam@dfs.com> - 2020-10-10 22:37 -0400
    Re: Inconsistent line counts from 3 methods Barry Schwarz <schwarzb@delq.com> - 2020-10-10 22:06 -0700
      Re: Inconsistent line counts from 3 methods DFS <nospam@dfs.com> - 2020-10-11 10:38 -0400
        Re: Inconsistent line counts from 3 methods Jorgen Grahn <grahn+nntp@snipabacken.se> - 2020-10-11 15:36 +0000
          Re: Inconsistent line counts from 3 methods DFS <nospam@dfs.com> - 2020-10-11 13:51 -0400
            Re: Inconsistent line counts from 3 methods Lew Pitcher <lew.pitcher@digitalfreehold.ca> - 2020-10-11 18:33 +0000
              Re: Inconsistent line counts from 3 methods DFS <nospam@dfs.com> - 2020-10-11 15:20 -0400
                Re: Inconsistent line counts from 3 methods Lew Pitcher <lew.pitcher@digitalfreehold.ca> - 2020-10-11 19:40 +0000
                  Re: Inconsistent line counts from 3 methods DFS <nospam@dfs.com> - 2020-10-11 15:47 -0400
                    Re: Inconsistent line counts from 3 methods James Kuyper <jameskuyper@alumni.caltech.edu> - 2020-10-11 16:35 -0400
                    Re: NNTP message requirements (Was: Inconsistent line counts from 3 methods) Lew Pitcher <lew.pitcher@digitalfreehold.ca> - 2020-10-11 21:13 +0000
                      Re: NNTP message requirements (Was: Inconsistent line counts from 3 methods) DFS <nospam@dfs.com> - 2020-10-11 18:45 -0400
                        Re: NNTP message requirements Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2020-10-11 17:11 -0700
                Re: Inconsistent line counts from 3 methods James Kuyper <jameskuyper@alumni.caltech.edu> - 2020-10-11 16:27 -0400
                  Re: Inconsistent line counts from 3 methods Ben Bacarisse <ben.usenet@bsb.me.uk> - 2020-10-11 23:30 +0100
                    Re: Inconsistent line counts from 3 methods James Kuyper <jameskuyper@alumni.caltech.edu> - 2020-10-11 23:56 -0400
            Re: Inconsistent line counts from 3 methods James Kuyper <jameskuyper@alumni.caltech.edu> - 2020-10-11 14:53 -0400
              Re: Inconsistent line counts from 3 methods DFS <nospam@dfs.com> - 2020-10-11 15:15 -0400
              Re: Inconsistent line counts from 3 methods Jorgen Grahn <grahn+nntp@snipabacken.se> - 2020-10-14 20:08 +0000
                Re: Inconsistent line counts from 3 methods James Kuyper <jameskuyper@alumni.caltech.edu> - 2020-10-14 16:58 -0400
                  Re: Inconsistent line counts from 3 methods Eli the Bearded <*@eli.users.panix.com> - 2020-10-14 23:37 +0000
                    Re: Inconsistent line counts from 3 methods Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2020-10-14 17:25 -0700
                      Re: Inconsistent line counts from 3 methods Eli the Bearded <*@eli.users.panix.com> - 2020-10-15 01:55 +0000
                    Re: Inconsistent line counts from 3 methods Jorgen Grahn <grahn+nntp@snipabacken.se> - 2020-10-17 19:19 +0000
                  Re: Inconsistent line counts from 3 methods Jorgen Grahn <grahn+nntp@snipabacken.se> - 2020-10-17 19:10 +0000
                    Re: Inconsistent line counts from 3 methods Kaz Kylheku <793-849-0957@kylheku.com> - 2020-10-17 19:36 +0000
            Re: Inconsistent line counts from 3 methods Jorgen Grahn <grahn+nntp@snipabacken.se> - 2020-10-14 20:16 +0000
          Re: Inconsistent line counts from 3 methods Barry Schwarz <schwarzb@delq.com> - 2020-10-11 11:36 -0700
            Re: Inconsistent line counts from 3 methods James Kuyper <jameskuyper@alumni.caltech.edu> - 2020-10-11 15:12 -0400
      Re: Inconsistent line counts from 3 methods James Kuyper <jameskuyper@alumni.caltech.edu> - 2020-10-11 12:16 -0400
    Re: Inconsistent line counts from 3 methods Johann Klammer <klammerj@NOSPAM.a1.net> - 2020-10-11 15:18 +0200
      Re: Inconsistent line counts from 3 methods Jorgen Grahn <grahn+nntp@snipabacken.se> - 2020-10-11 14:31 +0000
      Re: Inconsistent line counts from 3 methods Barry Schwarz <schwarzb@delq.com> - 2020-10-11 11:31 -0700
      Re: Inconsistent line counts from 3 methods Ben Bacarisse <ben.usenet@bsb.me.uk> - 2020-10-11 23:15 +0100
    Re: Inconsistent line counts from 3 methods Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2020-10-11 14:00 -0700
      Re: Inconsistent line counts from 3 methods DFS <nospam@dfs.com> - 2020-10-11 17:47 -0400
        Re: Inconsistent line counts from 3 methods Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2020-10-11 17:26 -0700
          Re: Inconsistent line counts from 3 methods DFS <nospam@dfs.com> - 2020-10-12 13:11 -0400
            Re: Inconsistent line counts from 3 methods Keith Thompson <Keith.S.Thompson+u@gmail.com> - 2020-10-12 10:56 -0700
              Re: Inconsistent line counts from 3 methods Tim Rentsch <tr.17687@z991.linuxsc.com> - 2020-11-29 00:21 -0800
            Re: Inconsistent line counts from 3 methods scott@slp53.sl.home (Scott Lurndal) - 2020-10-12 19:19 +0000
              Re: Inconsistent line counts from 3 methods dfs <nospam@dfs.com> - 2020-10-12 18:53 -0400
                Re: Inconsistent line counts from 3 methods Jorgen Grahn <grahn+nntp@snipabacken.se> - 2020-10-17 23:09 +0000
                  Re: Inconsistent line counts from 3 methods Bart <bc@freeuk.com> - 2020-10-18 00:24 +0100
                    Re: Inconsistent line counts from 3 methods Kaz Kylheku <793-849-0957@kylheku.com> - 2020-10-18 16:56 +0000
                    Re: Inconsistent line counts from 3 methods James Kuyper <jameskuyper@alumni.caltech.edu> - 2020-10-20 09:17 -0400
                      Re: Inconsistent line counts from 3 methods Bart <bc@freeuk.com> - 2020-10-20 15:48 +0100

Page 3 of 3 — ← Prev page 1 2 [3]

#155588

From	scott@slp53.sl.home (Scott Lurndal)
Date	2020-10-12 19:19 +0000
Message-ID	<F_1hH.334040$Av7.7306@fx34.iad>
In reply to	#155583

DFS <nospam@dfs.com> writes:
>On 10/11/2020 8:26 PM, Keith Thompson wrote:

>> If you only care about *how many* lines are in your input, there's
>> no point in using fgets().  Just read a character or a block at
>> a time and scan for '\n' characters (and *maybe* apply special
>> handling if the last character read isn't '\n').
>
>Why maybe?  Shouldn't you test every time, and add one to your linecount 
>if the last character before EOF isn't \n?
>
>----------------------------------------------------
>#include <stdio.h>
>int main(int argc, char *argv[])
>{
>  //count newline with getc
>  FILE *fin = fopen(argv[1],"r");
>  char c;
>  int lines = 0;
>  for (c=getc(fin);c!=EOF;c=getc(fin)) {if(c=='\n') {lines++;}}
>  fseek(fin, ftell(fin)-1, SEEK_SET);
>  c=getc(fin);
>  if(c!='\n') {lines++;printf("Last character = '%c'\n",c);}
>  printf("getc line count: %d\n",lines);
>  fclose(fin);
>  return(0);
>}
>----------------------------------------------------
>
>I tested that code a few times and it worked.  Even though the pointer 
>is at EOF after the for..loop, do you think it's potentially troublesome 
>not to use an explicit fseek(fin, 0, SEEK_END); after the for..loop?
>
>


The fastest way to count lines:

#include <errno.h>
#include <fcntl.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#include <sys/mman.h>
#include <sys/stat.h>

int
main(int argc, const char **argv, const char **envp)
{
    int fd;
    uint8_t *cp;
    struct stat st;
    size_t linecount = 0ul;

    if (argc < 2) {
        fprintf(stderr, "%s: The file to scan must be supplied as an argument\n", argv[0]);
        return 1;
    }
    fd = open(argv[1], O_RDONLY, 0);
    if (fd == -1) {
       fprintf(stderr, "%s: Unable to open '%s': %s\n",
               argv[0], argv[1], strerror(errno));
       return 2;
    }
    fstat(fd, &st);
    cp = (uint8_t *)mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0ul);
    if ((void *)cp == MAP_FAILED) {
        fprintf(stderr, "%s: Unable to map '%s': %s\n",
               argv[0], argv[1], strerror(errno));
        return 3;
    }
    for(size_t s = st.st_size; s > 0; --s) {
        if (*cp++ == '\n') linecount++;
    }

    fprintf(stdout, "Line count is %zu\n", linecount);
    if (*(cp - 1) != '\n') fprintf(stdout, "Last byte of file was not a newline\n");
    return 0;
}

Yes, on x86 (32-bit), this may choke on files over 1GB (depending
on the virtual and physical address space resource limits). In which
case, mapping smaller portions works just fine.

With this approach, the data from the input file is loaded directly
into the application address space during the page fault process. There
are no intermediate kernel or library buffers involved unlike stdio.

[toc] | [prev] | [next] | [standalone]

#155597

From	dfs <nospam@dfs.com>
Date	2020-10-12 18:53 -0400
Message-ID	<s75hH.312992$575.308561@fx38.iad>
In reply to	#155588

On 10/12/20 3:19 PM, Scott Lurndal wrote:
> DFS <nospam@dfs.com> writes:
>> On 10/11/2020 8:26 PM, Keith Thompson wrote:
> 
>>> If you only care about *how many* lines are in your input, there's
>>> no point in using fgets().  Just read a character or a block at
>>> a time and scan for '\n' characters (and *maybe* apply special
>>> handling if the last character read isn't '\n').
>>
>> Why maybe?  Shouldn't you test every time, and add one to your linecount
>> if the last character before EOF isn't \n?
>>
>> ----------------------------------------------------
>> #include <stdio.h>
>> int main(int argc, char *argv[])
>> {
>>   //count newline with getc
>>   FILE *fin = fopen(argv[1],"r");
>>   char c;
>>   int lines = 0;
>>   for (c=getc(fin);c!=EOF;c=getc(fin)) {if(c=='\n') {lines++;}}
>>   fseek(fin, ftell(fin)-1, SEEK_SET);
>>   c=getc(fin);
>>   if(c!='\n') {lines++;printf("Last character = '%c'\n",c);}
>>   printf("getc line count: %d\n",lines);
>>   fclose(fin);
>>   return(0);
>> }
>> ----------------------------------------------------
>>
>> I tested that code a few times and it worked.  Even though the pointer
>> is at EOF after the for..loop, do you think it's potentially troublesome
>> not to use an explicit fseek(fin, 0, SEEK_END); after the for..loop?
>>
>>
> 
> 
> The fastest way to count lines:
> 
> #include <errno.h>
> #include <fcntl.h>
> #include <stdint.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> 
> #include <sys/mman.h>
> #include <sys/stat.h>
> 
> int
> main(int argc, const char **argv, const char **envp)
> {
>      int fd;
>      uint8_t *cp;
>      struct stat st;
>      size_t linecount = 0ul;
> 
>      if (argc < 2) {
>          fprintf(stderr, "%s: The file to scan must be supplied as an argument\n", argv[0]);
>          return 1;
>      }
>      fd = open(argv[1], O_RDONLY, 0);
>      if (fd == -1) {
>         fprintf(stderr, "%s: Unable to open '%s': %s\n",
>                 argv[0], argv[1], strerror(errno));
>         return 2;
>      }
>      fstat(fd, &st);
>      cp = (uint8_t *)mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0ul);
>      if ((void *)cp == MAP_FAILED) {
>          fprintf(stderr, "%s: Unable to map '%s': %s\n",
>                 argv[0], argv[1], strerror(errno));
>          return 3;
>      }
>      for(size_t s = st.st_size; s > 0; --s) {
>          if (*cp++ == '\n') linecount++;
>      }
> 
>      fprintf(stdout, "Line count is %zu\n", linecount);
>      if (*(cp - 1) != '\n') fprintf(stdout, "Last byte of file was not a newline\n");
>      return 0;
> }
> 
> Yes, on x86 (32-bit), this may choke on files over 1GB (depending
> on the virtual and physical address space resource limits). In which
> case, mapping smaller portions works just fine.
> 
> With this approach, the data from the input file is loaded directly
> into the application address space during the page fault process. There
> are no intermediate kernel or library buffers involved unlike stdio.


$ gcc -Wall linecounter_lurndal.c -o linecounter_lurndal
$ time ./linecounter_lurndal bible.txt
Line count is 31101
Last byte of file was not a newline

real	0m0.023s
user	0m0.023s
sys	0m0.000s

$ time ./linecounter_lurndal bible4x.txt
Line count is 124408

real	0m0.086s
user	0m0.082s
sys	0m0.004s



$ time ./linecounter_DFS bible.txt
31102 lines

real	0m0.008s
user	0m0.008s
sys	0m0.000s


$ time ./linecounter_DFS bible4x.txt
124408 lines

real	0m0.029s
user	0m0.021s
sys	0m0.008s



--------------------------------------
mine is a 'standard' fgets routine
--------------------------------------
#include <stdio.h>
int main(int argc, char *argv[])
{
// usage: linecounter_DFS filename 				
  int  lines = 0;
  char line[1024] = "";
  FILE *fin = fopen(argv[1],"r");
  while (fgets(line,sizeof line, fin)!= NULL) {lines++;}		
  fclose(fin);
  printf("%d lines\n",lines);
  return 0;
}
--------------------------------------

[toc] | [prev] | [next] | [standalone]

#155737

From	Jorgen Grahn <grahn+nntp@snipabacken.se>
Date	2020-10-17 23:09 +0000
Message-ID	<slrnromuco.1hpq.grahn+nntp@frailea.sa.invalid>
In reply to	#155597

On Mon, 2020-10-12, dfs wrote:
...
> $ gcc -Wall linecounter_lurndal.c -o linecounter_lurndal
> $ time ./linecounter_lurndal bible.txt

Why do you time code that you built with optimization disabled?  You
can argue (maybe) that it doesn't matter in this case, but you would
have saved yourself the trouble by typing four more characters.

/Jorgen

-- 
  // Jorgen Grahn <grahn@  Oo  o.   .     .
\X/     snipabacken.se>   O  o   .

[toc] | [prev] | [next] | [standalone]

#155738

From	Bart <bc@freeuk.com>
Date	2020-10-18 00:24 +0100
Message-ID	<P2LiH.2749107$1Eh.2063211@fx46.ams4>
In reply to	#155737

On 18/10/2020 00:09, Jorgen Grahn wrote:
> On Mon, 2020-10-12, dfs wrote:
> ...
>> $ gcc -Wall linecounter_lurndal.c -o linecounter_lurndal
>> $ time ./linecounter_lurndal bible.txt
> 
> Why do you time code that you built with optimization disabled?  You
> can argue (maybe) that it doesn't matter in this case, but you would
> have saved yourself the trouble by typing four more characters.
> 

So, if one program was faster than another, was it because of the 
approach and algorithm (eg. memory mapped files vs. calls to fread etc), 
or because one was more amenable to be optimised?

Programs this tiny can be unfairly optimised (sometimes to nothing), in 
a way that might not be practical in a real, sprawling application.

Comparisons can therefore be less meaningful optimised than optimised.

[toc] | [prev] | [next] | [standalone]

#155750

From	Kaz Kylheku <793-849-0957@kylheku.com>
Date	2020-10-18 16:56 +0000
Message-ID	<20201018095536.418@kylheku.com>
In reply to	#155738

On 2020-10-17, Bart <bc@freeuk.com> wrote:
> On 18/10/2020 00:09, Jorgen Grahn wrote:
>> On Mon, 2020-10-12, dfs wrote:
>> ...
>>> $ gcc -Wall linecounter_lurndal.c -o linecounter_lurndal
>>> $ time ./linecounter_lurndal bible.txt
>> 
>> Why do you time code that you built with optimization disabled?  You
>> can argue (maybe) that it doesn't matter in this case, but you would
>> have saved yourself the trouble by typing four more characters.
>> 
>
> So, if one program was faster than another, was it because of the 
> approach and algorithm (eg. memory mapped files vs. calls to fread etc), 
> or because one was more amenable to be optimised?


> Programs this tiny can be unfairly optimised (sometimes to nothing), in 
> a way that might not be practical in a real, sprawling application.

Probably not if you just use -O for basic optimizations.

Without any optimizations at all, programs can be confounded by silly
coding that moves data from one register to another, only to move it
back again, and which jumps to unconditional jump instructions and such.

[toc] | [prev] | [next] | [standalone]

#155806

From	James Kuyper <jameskuyper@alumni.caltech.edu>
Date	2020-10-20 09:17 -0400
Message-ID	<rmmo24$aju$1@dont-email.me>
In reply to	#155738

On 2020-10-17, Bart <bc@freeuk.com> wrote:
> On 18/10/2020 00:09, Jorgen Grahn wrote:
...
>> Why do you time code that you built with optimization disabled?  You
>> can argue (maybe) that it doesn't matter in this case, but you would
>> have saved yourself the trouble by typing four more characters.
>> 
>
> So, if one program was faster than another, was it because of the 
> approach and algorithm (eg. memory mapped files vs. calls to fread etc), 
> or because one was more amenable to be optimised?


> Programs this tiny can be unfairly optimised (sometimes to nothing), in 
> a way that might not be practical in a real, sprawling application.

If performance is an issue, turning on safe optimizations should be the
norm in real-world applications. In that case, testing without
optimization is essentially meaningless, unfairly failing to favor code
that optimizes easily over code that does not.

[toc] | [prev] | [next] | [standalone]

#155808

From	Bart <bc@freeuk.com>
Date	2020-10-20 15:48 +0100
Message-ID	<UMCjH.242004$ZL3.70823@fx33.am4>
In reply to	#155806

On 20/10/2020 14:17, James Kuyper wrote:
> On 2020-10-17, Bart <bc@freeuk.com> wrote:
>> On 18/10/2020 00:09, Jorgen Grahn wrote:
> ...
>>> Why do you time code that you built with optimization disabled?  You
>>> can argue (maybe) that it doesn't matter in this case, but you would
>>> have saved yourself the trouble by typing four more characters.
>>>
>>
>> So, if one program was faster than another, was it because of the
>> approach and algorithm (eg. memory mapped files vs. calls to fread etc),
>> or because one was more amenable to be optimised?
> 
> 
>> Programs this tiny can be unfairly optimised (sometimes to nothing), in
>> a way that might not be practical in a real, sprawling application.
> 
> If performance is an issue, turning on safe optimizations should be the
> norm in real-world applications. In that case, testing without
> optimization is essentially meaningless, unfairly failing to favor code
> that optimizes easily over code that does not.
> 

Failing to favour code that unfairly optimises easily.

For example, you are testing a function in the same module. It's called 
from one place, and with arguments that might be all or partly constant 
values.

In this situation, a compiler might inline the function, replace 
instances of the parameters in the body with the constants, and 
everything reduces down to some compact expression.

Naturally, the results will be very good, and you might deduce that the 
algorithm used in the function is highly performant.

Until you try it in a real program where the function is in another 
module, where it cannot be inlined, and those constant reductions cannot 
be performed.

Now you might find that algorithm wasn't that great after all.

If I want to find out whether car A is faster than B over a circuit, you 
can't have B taking short-cuts.

[toc] | [prev] | [standalone]

Page 3 of 3 — ← Prev page 1 2 [3]

csiph-web

Inconsistent line counts from 3 methods

Contents

#155588

#155597

#155737

#155738

#155750

#155806

#155808