Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.c > #388497

Re: program to remove duplicates

From fir <fir@grunge.pl>
Newsgroups comp.lang.c
Subject Re: program to remove duplicates
Date 2024-09-22 16:32 +0200
Organization i2pn2 (i2pn.org)
Message-ID <66F02A65.3000802@grunge.pl> (permalink)
References (5 earlier) <vcoh04$24ioi$1@dont-email.me> <66EFF046.8010709@grunge.pl> <vcos2o$264lk$1@dont-email.me> <66F02808.8030404@grunge.pl> <66F02929.3020901@grunge.pl>

Show all headers | View raw


fir wrote:
> fir wrote:
>> Bart wrote:
>>> On 22/09/2024 11:24, fir wrote:
>>>> Paul wrote:
>>>
>>>>> The normal way to do this, is do a hash check on the
>>>>> files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM,
>>>>> as a means to compare two files. If you want to be picky about
>>>>> it, stick with SHA256SUM.
>>>
>>>
>>>> the code i posted work ok, and if someone has windows and mingw/tdm
>>>> may compiel it and check the application if wants
>>>>
>>>> hashing is not necessary imo though probably could speed things up -
>>>> im not strongly convinced that the probablility of misteke in this
>>>> hashing is strictly zero (as i dont ever used this and would need to
>>>> produce my own hashing probably).. probably its mathematically proven
>>>> ists almost zero but as for now at least it is more interesting for me
>>>> if the cde i posted is ok
>>>
>>> I was going to post similar ideas (doing a linear pass working out
>>> checksums for each file, sorting the list by checksum and size, then
>>> candidates for a byte-by-byte comparison, if you want to do that, will
>>> be grouped together).
>>>
>>> But if you're going to reject everyone's suggestions in favour of your
>>> own already working solution, then I wonder why you bothered posting.
>>>
>>> (I didn't post after all because I knew it would be futile.)
>>>
>>>
>>
>> yet to say about this efficiency
>>
>> whan i observe how it work - this program is square in a sense it has
>> half square loop over the directory files list, so it may be lik
>> 20x*20k/2-20k comparcions but it only compares mostly sizes so this
>> kind of being square im not sure how serious is ..200M int comparsions
>> is a problem? - mayeb it become to be for larger sets
>>
>> in the meaning of real binary comparsions is not fully square but
>> its liek sets of smaller squares on diagonal of this large square
>> if yu (some) know what i mean... and that may be a problem as
>> if in that 20k files 100 have same size then it makes about 100x100 full
>> loads and 100x100 full binary copmpares byte to byte which
>> is practically full if there are indeed 100 duplicates
>> (maybe its less than 100x100 as at first finding of duplicate i mark it
>> as dumpicate and ship it in loop then
>>
>> but indeed it shows practically that in case of folders bigger than 3k
>> files it slows down probably unproportionally so the optimisation is
>> in hand /needed for large folders
>>
>> thats from the observation on it
>>
>
>
> but as i said i mainly wanted this to be done to remove soem space of
> this recovered somewhat junk files.. and having it the partially square
> way is more important than having it optimised
>
> it works and if i see it slows down on large folders i can divide those
> big folders on few for 3k files and run this duplicate mover in each one
>
> more hand work but can be done by hand

hovever saying that the checksuming/hashing idea is kinda good ofc
(sorting oprobably the less as maybe a bit harder to write, as im never 
sure if my old quicksirt hand code has no error i once tested like 30
quicksort versions in mya life trying to rewrite it and once i get some
mistake in thsi code and later never strictly sure if the version i 
finally get is good - its probably good but im not sure)

but i would need to understand that may own way of hashing has 
practically no chances to generate same hash on different files..
and i never was doing that things so i not rethinked it..and now its a 
side thing possibly not worth studying

Back to comp.lang.c | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

program to remove duplicates fir <fir@grunge.pl> - 2024-09-21 20:53 +0200
  Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-21 20:56 +0200
    Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-21 21:27 +0200
      Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-21 22:12 +0200
        Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-21 23:13 +0200
          Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 00:48 +0200
  Re: program to remove duplicates "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2024-09-21 14:54 -0700
    Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 00:18 +0200
      Re: program to remove duplicates "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2024-09-21 16:46 -0700
      Re: program to remove duplicates Lawrence D'Oliveiro <ldo@nz.invalid> - 2024-09-22 02:06 +0000
        Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 04:36 +0200
          Re: program to remove duplicates "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2024-09-21 21:18 -0700
          Re: program to remove duplicates Lawrence D'Oliveiro <ldo@nz.invalid> - 2024-09-22 07:09 +0000
          Re: program to remove duplicates Paul <nospam@needed.invalid> - 2024-09-22 03:29 -0400
            Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 12:24 +0200
              Re: program to remove duplicates Bart <bc@freeuk.com> - 2024-09-22 11:38 +0100
                Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 14:46 +0200
                Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 14:48 +0200
                Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 16:06 +0200
                Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 16:22 +0200
                Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 16:26 +0200
                Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 16:32 +0200
                Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 16:51 +0200
            Re: program to remove duplicates "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2024-09-22 11:47 -0700
        Re: program to remove duplicates DFS <nospam@dfs.com> - 2024-09-22 17:11 -0400
  Re: program to remove duplicates Lawrence D'Oliveiro <ldo@nz.invalid> - 2024-09-22 01:28 +0000
  Re: program to remove duplicates Josef Möllers <josef@invalid.invalid> - 2024-10-01 16:34 +0200
    Off Topic (Was: program to remove duplicates) gazelle@shell.xmission.com (Kenny McCormack) - 2024-10-01 20:38 +0000

csiph-web