Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.c > #388496

Re: program to remove duplicates

From fir <fir@grunge.pl>
Newsgroups comp.lang.c
Subject Re: program to remove duplicates
Date 2024-09-22 16:26 +0200
Organization i2pn2 (i2pn.org)
Message-ID <66F02929.3020901@grunge.pl> (permalink)
References (4 earlier) <66EF8293.30803@grunge.pl> <vcoh04$24ioi$1@dont-email.me> <66EFF046.8010709@grunge.pl> <vcos2o$264lk$1@dont-email.me> <66F02808.8030404@grunge.pl>

Show all headers | View raw


fir wrote:
> Bart wrote:
>> On 22/09/2024 11:24, fir wrote:
>>> Paul wrote:
>>
>>>> The normal way to do this, is do a hash check on the
>>>> files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
>>>> as a means to compare two files. If you want to be picky about
>>>> it, stick with SHA256SUM.
>>
>>
>>> the code i posted work ok, and if someone has windows and mingw/tdm
>>> may compiel it and check the application if wants
>>>
>>> hashing is not necessary imo though probably could speed things up -
>>> im not strongly convinced that the probablility of misteke in this
>>> hashing is strictly zero (as i dont ever used this and would need to
>>> produce my own hashing probably).. probably its mathematically proven
>>> ists almost zero but as for now at least it is more interesting for me
>>> if the cde i posted is ok
>>
>> I was going to post similar ideas (doing a linear pass working out
>> checksums for each file, sorting the list by checksum and size, then
>> candidates for a byte-by-byte comparison, if you want to do that, will
>> be grouped together).
>>
>> But if you're going to reject everyone's suggestions in favour of your
>> own already working solution, then I wonder why you bothered posting.
>>
>> (I didn't post after all because I knew it would be futile.)
>>
>>
>
> yet to say about this efficiency
>
> whan i observe how it work - this program is square in a sense it has
> half square loop over the directory files list, so it may be lik
> 20x*20k/2-20k comparcions but it only compares mostly sizes so this
> kind of being square im not sure how serious is ..200M int comparsions
> is a problem? - mayeb it become to be for larger sets
>
> in the meaning of real binary comparsions is not fully square but
> its liek sets of smaller squares on diagonal of this large square
> if yu (some) know what i mean... and that may be a problem as
> if in that 20k files 100 have same size then it makes about 100x100 full
> loads and 100x100 full binary copmpares byte to byte which
> is practically full if there are indeed 100 duplicates
> (maybe its less than 100x100 as at first finding of duplicate i mark it
> as dumpicate and ship it in loop then
>
> but indeed it shows practically that in case of folders bigger than 3k
> files it slows down probably unproportionally so the optimisation is
> in hand /needed for large folders
>
> thats from the observation on it
>


but as i said i mainly wanted this to be done to remove soem space of 
this recovered somewhat junk files.. and having it the partially square 
way is more important than having it optimised

it works and if i see it slows down on large folders i can divide those
big folders on few for 3k files and run this duplicate mover in each one

more hand work but can be done by hand

Back to comp.lang.c | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

program to remove duplicates fir <fir@grunge.pl> - 2024-09-21 20:53 +0200
  Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-21 20:56 +0200
    Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-21 21:27 +0200
      Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-21 22:12 +0200
        Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-21 23:13 +0200
          Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 00:48 +0200
  Re: program to remove duplicates "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2024-09-21 14:54 -0700
    Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 00:18 +0200
      Re: program to remove duplicates "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2024-09-21 16:46 -0700
      Re: program to remove duplicates Lawrence D'Oliveiro <ldo@nz.invalid> - 2024-09-22 02:06 +0000
        Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 04:36 +0200
          Re: program to remove duplicates "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2024-09-21 21:18 -0700
          Re: program to remove duplicates Lawrence D'Oliveiro <ldo@nz.invalid> - 2024-09-22 07:09 +0000
          Re: program to remove duplicates Paul <nospam@needed.invalid> - 2024-09-22 03:29 -0400
            Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 12:24 +0200
              Re: program to remove duplicates Bart <bc@freeuk.com> - 2024-09-22 11:38 +0100
                Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 14:46 +0200
                Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 14:48 +0200
                Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 16:06 +0200
                Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 16:22 +0200
                Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 16:26 +0200
                Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 16:32 +0200
                Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 16:51 +0200
            Re: program to remove duplicates "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2024-09-22 11:47 -0700
        Re: program to remove duplicates DFS <nospam@dfs.com> - 2024-09-22 17:11 -0400
  Re: program to remove duplicates Lawrence D'Oliveiro <ldo@nz.invalid> - 2024-09-22 01:28 +0000
  Re: program to remove duplicates Josef Möllers <josef@invalid.invalid> - 2024-10-01 16:34 +0200
    Off Topic (Was: program to remove duplicates) gazelle@shell.xmission.com (Kenny McCormack) - 2024-10-01 20:38 +0000

csiph-web