Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.os.linux.misc > #56526 > unrolled thread
| Started by | J Newman <jenniferkatenewman@gmail.com> |
|---|---|
| First post | 2024-06-11 14:53 +0800 |
| Last post | 2024-06-15 11:30 +0800 |
| Articles | 16 — 7 participants |
Back to article view | Back to comp.os.linux.misc
Script to conditionally find and compress files recursively J Newman <jenniferkatenewman@gmail.com> - 2024-06-11 14:53 +0800
Re: Script to conditionally find and compress files recursively D <nospam@example.net> - 2024-06-11 10:51 +0200
Re: Script to conditionally find and compress files recursively Joe Beanfish <joebeanfish@nospam.duh> - 2024-06-11 14:58 +0000
Re: Script to conditionally find and compress files recursively Grant Taylor <gtaylor@tnetconsulting.net> - 2024-06-11 22:21 -0500
Re: Script to conditionally find and compress files recursively Richard Kettlewell <invalid@invalid.invalid> - 2024-06-12 08:17 +0100
Re: Script to conditionally find and compress files recursively D <nospam@example.net> - 2024-06-12 10:13 +0200
Re: Script to conditionally find and compress files recursively J Newman <jenniferkatenewman@gmail.com> - 2024-06-13 12:46 +0800
Re: Script to conditionally find and compress files recursively D <nospam@example.net> - 2024-06-13 11:55 +0200
Re: Script to conditionally find and compress files recursively Grant Taylor <gtaylor@tnetconsulting.net> - 2024-06-13 22:35 -0500
Re: Script to conditionally find and compress files recursively D <nospam@example.net> - 2024-06-14 11:07 +0200
Re: Script to conditionally find and compress files recursively J Newman <jenniferkatenewman@gmail.com> - 2024-06-13 12:43 +0800
Re: Script to conditionally find and compress files recursively Anssi Saari <anssi.saari@usenet.mail.kapsi.fi> - 2024-06-13 10:13 +0300
Re: Script to conditionally find and compress files recursively D <nospam@example.net> - 2024-06-13 11:55 +0200
Re: Script to conditionally find and compress files recursively not@telling.you.invalid (Computer Nerd Kev) - 2024-06-14 09:06 +1000
Re: Script to conditionally find and compress files recursively not@telling.you.invalid (Computer Nerd Kev) - 2024-06-14 12:25 +1000
Re: Script to conditionally find and compress files recursively J Newman <jenniferkatenewman@gmail.com> - 2024-06-15 11:30 +0800
| From | J Newman <jenniferkatenewman@gmail.com> |
|---|---|
| Date | 2024-06-11 14:53 +0800 |
| Subject | Script to conditionally find and compress files recursively |
| Message-ID | <v48s96$u6fg$1@dont-email.me> |
Hi, I'm interested in writing a script that will: 1. Find and compress files recursively 2. After the first 5 seconds of compressing, if the compression ratio >1 (i.e. the compressed file will be larger than the uncompressed file), it tries another compression algorithm. 3. If the other compression algorithm still has a ratio >1, it tries another algorithm, until a list is exhausted. 4. If the list is exhausted, it skips compressing that file. Any suggestions on how to proceed?
[toc] | [next] | [standalone]
| From | D <nospam@example.net> |
|---|---|
| Date | 2024-06-11 10:51 +0200 |
| Message-ID | <2e0ae86d-ae03-5231-b2c3-1da13d22de72@example.net> |
| In reply to | #56526 |
On Tue, 11 Jun 2024, J Newman wrote: > Hi, I'm interested in writing a script that will: > > 1. Find and compress files recursively > 2. After the first 5 seconds of compressing, if the compression ratio >1 > (i.e. the compressed file will be larger than the uncompressed file), it > tries another compression algorithm. > 3. If the other compression algorithm still has a ratio >1, it tries another > algorithm, until a list is exhausted. > 4. If the list is exhausted, it skips compressing that file. > > Any suggestions on how to proceed? > Difficult to estimate compression ratio without analyzing the entire file. In theory you could say something based on the file type, but that's the best I can come up with.
[toc] | [prev] | [next] | [standalone]
| From | Joe Beanfish <joebeanfish@nospam.duh> |
|---|---|
| Date | 2024-06-11 14:58 +0000 |
| Message-ID | <v49omf$12c3q$1@dont-email.me> |
| In reply to | #56526 |
On Tue, 11 Jun 2024 14:53:27 +0800, J Newman wrote: > Hi, I'm interested in writing a script that will: > > 1. Find and compress files recursively > 2. After the first 5 seconds of compressing, if the compression ratio >1 > (i.e. the compressed file will be larger than the uncompressed file), it > tries another compression algorithm. > 3. If the other compression algorithm still has a ratio >1, it tries > another algorithm, until a list is exhausted. > 4. If the list is exhausted, it skips compressing that file. > > Any suggestions on how to proceed? You could use dd to extract a representative chunk of the file to compress and compare size. uncompressedsize=$(dd status=none if="$file" bs=1M count=1|wc -c) compressedsize=$(dd status=none if="$file" bs=1M count=1|$compresscmd|wc -c) You could get fancy and try all the compression commands you have and pick the one with smallest output for the actual compression. That's all assuming the beginning of the file is representative of the content throughout. If it's not, no way to tell without compressing the whole thing.
[toc] | [prev] | [next] | [standalone]
| From | Grant Taylor <gtaylor@tnetconsulting.net> |
|---|---|
| Date | 2024-06-11 22:21 -0500 |
| Message-ID | <v4b46s$7dh$1@tncsrv09.home.tnetconsulting.net> |
| In reply to | #56526 |
On 6/11/24 01:53, J Newman wrote: > Any suggestions on how to proceed? As others have said, it's very difficult to tell within the first five seconds what the ultimate compression ratio will be. If you have the disk space, compress using all of the compression options and then remove all but the smallest file. Then go on to the next file. -- Grant. . . .
[toc] | [prev] | [next] | [standalone]
| From | Richard Kettlewell <invalid@invalid.invalid> |
|---|---|
| Date | 2024-06-12 08:17 +0100 |
| Message-ID | <wwvo7868waw.fsf@LkoBDZeT.terraraq.uk> |
| In reply to | #56534 |
Grant Taylor <gtaylor@tnetconsulting.net> writes: > On 6/11/24 01:53, J Newman wrote: >> Any suggestions on how to proceed? > > As others have said, it's very difficult to tell within the first five > seconds what the ultimate compression ratio will be. Not just difficult but impossible in general: the input file could change character in its second half, switching the overall result from that that is (for example) a gzip win to an xz win. -- https://www.greenend.org.uk/rjk/
[toc] | [prev] | [next] | [standalone]
| From | D <nospam@example.net> |
|---|---|
| Date | 2024-06-12 10:13 +0200 |
| Message-ID | <083d0e35-e02d-8668-726f-7aa89980e9b2@example.net> |
| In reply to | #56537 |
On Wed, 12 Jun 2024, Richard Kettlewell wrote: > Grant Taylor <gtaylor@tnetconsulting.net> writes: >> On 6/11/24 01:53, J Newman wrote: >>> Any suggestions on how to proceed? >> >> As others have said, it's very difficult to tell within the first five >> seconds what the ultimate compression ratio will be. > > Not just difficult but impossible in general: the input file could > change character in its second half, switching the overall result from > that that is (for example) a gzip win to an xz win. > > This is true! The only thing I can imagine are parsing the file type, and from that file type, drawing conclusions about the compressability of the data, or doing a flawed statistical analysis, but as said, the end could be vastly different from the start.
[toc] | [prev] | [next] | [standalone]
| From | J Newman <jenniferkatenewman@gmail.com> |
|---|---|
| Date | 2024-06-13 12:46 +0800 |
| Message-ID | <v4dtih$23kjq$2@dont-email.me> |
| In reply to | #56542 |
On 12/06/2024 16:13, D wrote: > > > On Wed, 12 Jun 2024, Richard Kettlewell wrote: > >> Grant Taylor <gtaylor@tnetconsulting.net> writes: >>> On 6/11/24 01:53, J Newman wrote: >>>> Any suggestions on how to proceed? >>> >>> As others have said, it's very difficult to tell within the first five >>> seconds what the ultimate compression ratio will be. >> >> Not just difficult but impossible in general: the input file could >> change character in its second half, switching the overall result from >> that that is (for example) a gzip win to an xz win. >> >> > > This is true! The only thing I can imagine are parsing the file type, > and from that file type, drawing conclusions about the compressability > of the data, or doing a flawed statistical analysis, but as said, the > end could be vastly different from the start. OK good point...as mentioned elsewhere my experience is with compressing video files with lzma. But if we accept that the script will make mistakes sometimes in choosing the right algorithm for compression, do you suggest parsing the file type, or trying to compress each file for the first 5 seconds, as the option with the least errors in choosing the right compression algorithm?
[toc] | [prev] | [next] | [standalone]
| From | D <nospam@example.net> |
|---|---|
| Date | 2024-06-13 11:55 +0200 |
| Message-ID | <647f0226-265e-2757-bd2a-3aa89de38107@example.net> |
| In reply to | #56557 |
On Thu, 13 Jun 2024, J Newman wrote: > On 12/06/2024 16:13, D wrote: >> >> >> On Wed, 12 Jun 2024, Richard Kettlewell wrote: >> >>> Grant Taylor <gtaylor@tnetconsulting.net> writes: >>>> On 6/11/24 01:53, J Newman wrote: >>>>> Any suggestions on how to proceed? >>>> >>>> As others have said, it's very difficult to tell within the first five >>>> seconds what the ultimate compression ratio will be. >>> >>> Not just difficult but impossible in general: the input file could >>> change character in its second half, switching the overall result from >>> that that is (for example) a gzip win to an xz win. >>> >>> >> >> This is true! The only thing I can imagine are parsing the file type, and >> from that file type, drawing conclusions about the compressability of the >> data, or doing a flawed statistical analysis, but as said, the end could be >> vastly different from the start. > > OK good point...as mentioned elsewhere my experience is with compressing > video files with lzma. > > But if we accept that the script will make mistakes sometimes in choosing the > right algorithm for compression, do you suggest parsing the file type, or > trying to compress each file for the first 5 seconds, as the option with the > least errors in choosing the right compression algorithm? > Hmm, I'd say parsing file types first, and perhaps have a little database that maps file type to compression algorithm, and if that doesn't yield anything, proceed with "brute force".
[toc] | [prev] | [next] | [standalone]
| From | Grant Taylor <gtaylor@tnetconsulting.net> |
|---|---|
| Date | 2024-06-13 22:35 -0500 |
| Message-ID | <v4gdpu$cts$1@tncsrv09.home.tnetconsulting.net> |
| In reply to | #56562 |
On 6/13/24 04:55, D wrote:
> perhaps have a little database that maps file type to compression algorithm
case ${FILE##*.} in
txt)
#...
;;
jpg|jpeg)
# Jpeg
;;
*)
echo "unknown file type"
;;
esac
;-)
[toc] | [prev] | [next] | [standalone]
| From | D <nospam@example.net> |
|---|---|
| Date | 2024-06-14 11:07 +0200 |
| Message-ID | <34907e1a-2413-bfc6-724b-f4798e73cd17@example.net> |
| In reply to | #56570 |
On Thu, 13 Jun 2024, Grant Taylor wrote:
> On 6/13/24 04:55, D wrote:
>> perhaps have a little database that maps file type to compression algorithm
>
> case ${FILE##*.} in
> txt)
> #...
> ;;
> jpg|jpeg)
> # Jpeg
> ;;
> *)
> echo "unknown file type"
> ;;
> esac
>
> ;-)
>
See.. half way there! Just cut n' paste and fill in the details. =)
[toc] | [prev] | [next] | [standalone]
| From | J Newman <jenniferkatenewman@gmail.com> |
|---|---|
| Date | 2024-06-13 12:43 +0800 |
| Message-ID | <v4dtdt$23kjq$1@dont-email.me> |
| In reply to | #56534 |
On 12/06/2024 11:21, Grant Taylor wrote: > On 6/11/24 01:53, J Newman wrote: >> Any suggestions on how to proceed? > > As others have said, it's very difficult to tell within the first five > seconds what the ultimate compression ratio will be. > > If you have the disk space, compress using all of the compression > options and then remove all but the smallest file. > > Then go on to the next file. > > > It's true that you cannot tell within the first 5 seconds what the ultimate compression ratio will be, but it seems to me (from compressing avi/mp4/mov files with lzma -9evv) that you can tell within +/- 5% to a high degree of confidence, what the ultimate compression ratio will be given the first 5 seconds.
[toc] | [prev] | [next] | [standalone]
| From | Anssi Saari <anssi.saari@usenet.mail.kapsi.fi> |
|---|---|
| Date | 2024-06-13 10:13 +0300 |
| Message-ID | <sm05xudwc1b.fsf@lakka.kapsi.fi> |
| In reply to | #56556 |
J Newman <jenniferkatenewman@gmail.com> writes: > It's true that you cannot tell within the first 5 seconds what the > ultimate compression ratio will be, but it seems to me (from > compressing avi/mp4/mov files with lzma -9evv) that you can tell > within +/- 5% to a high degree of confidence, what the ultimate > compression ratio will be given the first 5 seconds. Well then, I believe the solution was already posted. Grab 5% of your files with dd and see how it compresses. I'm a little curious, what kind of space savings do you expect to get by doing this? And wouldn't it make more sense to re-encode for lower bitrate if space saving is your goal?
[toc] | [prev] | [next] | [standalone]
| From | D <nospam@example.net> |
|---|---|
| Date | 2024-06-13 11:55 +0200 |
| Message-ID | <909e65ae-69f4-8619-e563-7d6565a48bc3@example.net> |
| In reply to | #56558 |
On Thu, 13 Jun 2024, Anssi Saari wrote: > J Newman <jenniferkatenewman@gmail.com> writes: > >> It's true that you cannot tell within the first 5 seconds what the >> ultimate compression ratio will be, but it seems to me (from >> compressing avi/mp4/mov files with lzma -9evv) that you can tell >> within +/- 5% to a high degree of confidence, what the ultimate >> compression ratio will be given the first 5 seconds. > > Well then, I believe the solution was already posted. Grab 5% of your > files with dd and see how it compresses. > > I'm a little curious, what kind of space savings do you expect to get by > doing this? And wouldn't it make more sense to re-encode for lower > bitrate if space saving is your goal? > If it's about space saving, don't forget deduplication, alternatively, depending on yoru file system of choice, you could maybe use file system functionality to save space as well, but caveat emptor, always have off site (or off machine) backups.
[toc] | [prev] | [next] | [standalone]
| From | not@telling.you.invalid (Computer Nerd Kev) |
|---|---|
| Date | 2024-06-14 09:06 +1000 |
| Message-ID | <666b7b6c@news.ausics.net> |
| In reply to | #56558 |
Anssi Saari <anssi.saari@usenet.mail.kapsi.fi> wrote: > J Newman <jenniferkatenewman@gmail.com> writes: > >> It's true that you cannot tell within the first 5 seconds what the >> ultimate compression ratio will be, but it seems to me (from >> compressing avi/mp4/mov files with lzma -9evv) that you can tell >> within +/- 5% to a high degree of confidence, what the ultimate >> compression ratio will be given the first 5 seconds. > > Well then, I believe the solution was already posted. Grab 5% of your > files with dd and see how it compresses. The solution that I see grabs the first 1MB, but it would make more sense to sample eg. 1% of the file size in five places within the file. 100MB file = 1MB sample, 100MB/5 = 20MB, so use dd to grab one 1MB sample from the start of the file then four more at an offset that increments by 20MB each time. Store these separately, compress them separately, then average the compression ratio of all the samples. > I'm a little curious, what kind of space savings do you expect to get by > doing this? And wouldn't it make more sense to re-encode for lower > bitrate if space saving is your goal? Maybe he's using lossless video compression? Otherwise yes it seems like the wrong approach. -- __ __ #_ < |\| |< _#
[toc] | [prev] | [next] | [standalone]
| From | not@telling.you.invalid (Computer Nerd Kev) |
|---|---|
| Date | 2024-06-14 12:25 +1000 |
| Message-ID | <666baa01@news.ausics.net> |
| In reply to | #56567 |
Computer Nerd Kev <not@telling.you.invalid> wrote: > Anssi Saari <anssi.saari@usenet.mail.kapsi.fi> wrote: >> >> Well then, I believe the solution was already posted. Grab 5% of your >> files with dd and see how it compresses. > > The solution that I see grabs the first 1MB, but it would make more > sense to sample eg. 1% of the file size in five places within the > file. 100MB file = 1MB sample, 100MB/5 = 20MB, so use dd to grab > one 1MB sample from the start of the file then four more at an > offset that increments by 20MB each time. Store these separately, > compress them separately, then average the compression ratio of all > the samples. Also for some types of data (if it's not all video), like text, some more advanced compressors build a dictionary to better compress larger files. But this requires a minimum file size, so the small samples might not represent the compression ratio of the whole file with a dictionary included. A solution is to pre-generate a dictionary based on a collection of the same type of files you're compressing, then you could compress the small samples using that dictionary and get a more accurate result. -- __ __ #_ < |\| |< _#
[toc] | [prev] | [next] | [standalone]
| From | J Newman <jenniferkatenewman@gmail.com> |
|---|---|
| Date | 2024-06-15 11:30 +0800 |
| Message-ID | <v4j1sp$39rdv$1@dont-email.me> |
| In reply to | #56526 |
On 11/06/2024 14:53, J Newman wrote:
> Hi, I'm interested in writing a script that will:
>
> 1. Find and compress files recursively
> 2. After the first 5 seconds of compressing, if the compression ratio >1
> (i.e. the compressed file will be larger than the uncompressed file), it
> tries another compression algorithm.
> 3. If the other compression algorithm still has a ratio >1, it tries
> another algorithm, until a list is exhausted.
> 4. If the list is exhausted, it skips compressing that file.
>
> Any suggestions on how to proceed?
This is the script ChatGPT gives. After some thought, I decided to just
go with one compression algorithm for simplicity, and just not compress
the files if the compression ratio >1.
#!/bin/bash
# Function to compress a file with lzma and keep it only if compression
ratio is <1
compress_file() {
local file=$1
local orig_size=$(stat --printf="%s" "$file")
# Compress with lzma
lzma -z -k -c "$file" > "$file.lzma"
local lzma_size=$(stat --printf="%s" "$file.lzma")
# If the lzma compressed file is smaller than the original, keep it
if (( lzma_size < orig_size )); then
mv "$file.lzma" "$file.compressed"
echo "File compressed using lzma: $file -> $file.compressed"
else
rm -f "$file.lzma"
echo "No compression applied for $file as the compressed size
was not smaller than the original."
fi
}
# Export the function so it's available to find -exec
export -f compress_file
# Recursively find all files and compress them
find . -type f -exec bash -c 'compress_file "$0"' {} \;
echo "Compression process complete."
[toc] | [prev] | [standalone]
Back to top | Article view | comp.os.linux.misc
csiph-web