Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.c > #388497
| From | fir <fir@grunge.pl> |
|---|---|
| Newsgroups | comp.lang.c |
| Subject | Re: program to remove duplicates |
| Date | 2024-09-22 16:32 +0200 |
| Organization | i2pn2 (i2pn.org) |
| Message-ID | <66F02A65.3000802@grunge.pl> (permalink) |
| References | (5 earlier) <vcoh04$24ioi$1@dont-email.me> <66EFF046.8010709@grunge.pl> <vcos2o$264lk$1@dont-email.me> <66F02808.8030404@grunge.pl> <66F02929.3020901@grunge.pl> |
fir wrote: > fir wrote: >> Bart wrote: >>> On 22/09/2024 11:24, fir wrote: >>>> Paul wrote: >>> >>>>> The normal way to do this, is do a hash check on the >>>>> files and compare the hash. You can use MD5SUM, SHA1SUM, SHA256SUM, >>>>> as a means to compare two files. If you want to be picky about >>>>> it, stick with SHA256SUM. >>> >>> >>>> the code i posted work ok, and if someone has windows and mingw/tdm >>>> may compiel it and check the application if wants >>>> >>>> hashing is not necessary imo though probably could speed things up - >>>> im not strongly convinced that the probablility of misteke in this >>>> hashing is strictly zero (as i dont ever used this and would need to >>>> produce my own hashing probably).. probably its mathematically proven >>>> ists almost zero but as for now at least it is more interesting for me >>>> if the cde i posted is ok >>> >>> I was going to post similar ideas (doing a linear pass working out >>> checksums for each file, sorting the list by checksum and size, then >>> candidates for a byte-by-byte comparison, if you want to do that, will >>> be grouped together). >>> >>> But if you're going to reject everyone's suggestions in favour of your >>> own already working solution, then I wonder why you bothered posting. >>> >>> (I didn't post after all because I knew it would be futile.) >>> >>> >> >> yet to say about this efficiency >> >> whan i observe how it work - this program is square in a sense it has >> half square loop over the directory files list, so it may be lik >> 20x*20k/2-20k comparcions but it only compares mostly sizes so this >> kind of being square im not sure how serious is ..200M int comparsions >> is a problem? - mayeb it become to be for larger sets >> >> in the meaning of real binary comparsions is not fully square but >> its liek sets of smaller squares on diagonal of this large square >> if yu (some) know what i mean... and that may be a problem as >> if in that 20k files 100 have same size then it makes about 100x100 full >> loads and 100x100 full binary copmpares byte to byte which >> is practically full if there are indeed 100 duplicates >> (maybe its less than 100x100 as at first finding of duplicate i mark it >> as dumpicate and ship it in loop then >> >> but indeed it shows practically that in case of folders bigger than 3k >> files it slows down probably unproportionally so the optimisation is >> in hand /needed for large folders >> >> thats from the observation on it >> > > > but as i said i mainly wanted this to be done to remove soem space of > this recovered somewhat junk files.. and having it the partially square > way is more important than having it optimised > > it works and if i see it slows down on large folders i can divide those > big folders on few for 3k files and run this duplicate mover in each one > > more hand work but can be done by hand hovever saying that the checksuming/hashing idea is kinda good ofc (sorting oprobably the less as maybe a bit harder to write, as im never sure if my old quicksirt hand code has no error i once tested like 30 quicksort versions in mya life trying to rewrite it and once i get some mistake in thsi code and later never strictly sure if the version i finally get is good - its probably good but im not sure) but i would need to understand that may own way of hashing has practically no chances to generate same hash on different files.. and i never was doing that things so i not rethinked it..and now its a side thing possibly not worth studying
Back to comp.lang.c | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
program to remove duplicates fir <fir@grunge.pl> - 2024-09-21 20:53 +0200
Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-21 20:56 +0200
Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-21 21:27 +0200
Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-21 22:12 +0200
Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-21 23:13 +0200
Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 00:48 +0200
Re: program to remove duplicates "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2024-09-21 14:54 -0700
Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 00:18 +0200
Re: program to remove duplicates "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2024-09-21 16:46 -0700
Re: program to remove duplicates Lawrence D'Oliveiro <ldo@nz.invalid> - 2024-09-22 02:06 +0000
Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 04:36 +0200
Re: program to remove duplicates "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2024-09-21 21:18 -0700
Re: program to remove duplicates Lawrence D'Oliveiro <ldo@nz.invalid> - 2024-09-22 07:09 +0000
Re: program to remove duplicates Paul <nospam@needed.invalid> - 2024-09-22 03:29 -0400
Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 12:24 +0200
Re: program to remove duplicates Bart <bc@freeuk.com> - 2024-09-22 11:38 +0100
Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 14:46 +0200
Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 14:48 +0200
Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 16:06 +0200
Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 16:22 +0200
Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 16:26 +0200
Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 16:32 +0200
Re: program to remove duplicates fir <fir@grunge.pl> - 2024-09-22 16:51 +0200
Re: program to remove duplicates "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> - 2024-09-22 11:47 -0700
Re: program to remove duplicates DFS <nospam@dfs.com> - 2024-09-22 17:11 -0400
Re: program to remove duplicates Lawrence D'Oliveiro <ldo@nz.invalid> - 2024-09-22 01:28 +0000
Re: program to remove duplicates Josef Möllers <josef@invalid.invalid> - 2024-10-01 16:34 +0200
Off Topic (Was: program to remove duplicates) gazelle@shell.xmission.com (Kenny McCormack) - 2024-10-01 20:38 +0000
csiph-web