Groups > comp.os.linux.misc > #56526 > unrolled thread

Script to conditionally find and compress files recursively

Started by	J Newman <jenniferkatenewman@gmail.com>
First post	2024-06-11 14:53 +0800
Last post	2024-06-15 11:30 +0800
Articles	16 — 7 participants

Back to article view | Back to comp.os.linux.misc

  Script to conditionally find and compress files recursively J Newman <jenniferkatenewman@gmail.com> - 2024-06-11 14:53 +0800
    Re: Script to conditionally find and compress files recursively D <nospam@example.net> - 2024-06-11 10:51 +0200
    Re: Script to conditionally find and compress files recursively Joe Beanfish <joebeanfish@nospam.duh> - 2024-06-11 14:58 +0000
    Re: Script to conditionally find and compress files recursively Grant Taylor <gtaylor@tnetconsulting.net> - 2024-06-11 22:21 -0500
      Re: Script to conditionally find and compress files recursively Richard Kettlewell <invalid@invalid.invalid> - 2024-06-12 08:17 +0100
        Re: Script to conditionally find and compress files recursively D <nospam@example.net> - 2024-06-12 10:13 +0200
          Re: Script to conditionally find and compress files recursively J Newman <jenniferkatenewman@gmail.com> - 2024-06-13 12:46 +0800
            Re: Script to conditionally find and compress files recursively D <nospam@example.net> - 2024-06-13 11:55 +0200
              Re: Script to conditionally find and compress files recursively Grant Taylor <gtaylor@tnetconsulting.net> - 2024-06-13 22:35 -0500
                Re: Script to conditionally find and compress files recursively D <nospam@example.net> - 2024-06-14 11:07 +0200
      Re: Script to conditionally find and compress files recursively J Newman <jenniferkatenewman@gmail.com> - 2024-06-13 12:43 +0800
        Re: Script to conditionally find and compress files recursively Anssi Saari <anssi.saari@usenet.mail.kapsi.fi> - 2024-06-13 10:13 +0300
          Re: Script to conditionally find and compress files recursively D <nospam@example.net> - 2024-06-13 11:55 +0200
          Re: Script to conditionally find and compress files recursively not@telling.you.invalid (Computer Nerd Kev) - 2024-06-14 09:06 +1000
            Re: Script to conditionally find and compress files recursively not@telling.you.invalid (Computer Nerd Kev) - 2024-06-14 12:25 +1000
    Re: Script to conditionally find and compress files recursively J Newman <jenniferkatenewman@gmail.com> - 2024-06-15 11:30 +0800

#56526 — Script to conditionally find and compress files recursively

From	J Newman <jenniferkatenewman@gmail.com>
Date	2024-06-11 14:53 +0800
Subject	Script to conditionally find and compress files recursively
Message-ID	<v48s96$u6fg$1@dont-email.me>

Hi, I'm interested in writing a script that will:

1. Find and compress files recursively
2. After the first 5 seconds of compressing, if the compression ratio >1 
(i.e. the compressed file will be larger than the uncompressed file), it 
tries another compression algorithm.
3. If the other compression algorithm still has a ratio >1, it tries 
another algorithm, until a list is exhausted.
4. If the list is exhausted, it skips compressing that file.

Any suggestions on how to proceed?

[toc] | [next] | [standalone]

#56528

From	D <nospam@example.net>
Date	2024-06-11 10:51 +0200
Message-ID	<2e0ae86d-ae03-5231-b2c3-1da13d22de72@example.net>
In reply to	#56526


On Tue, 11 Jun 2024, J Newman wrote:

> Hi, I'm interested in writing a script that will:
>
> 1. Find and compress files recursively
> 2. After the first 5 seconds of compressing, if the compression ratio >1 
> (i.e. the compressed file will be larger than the uncompressed file), it 
> tries another compression algorithm.
> 3. If the other compression algorithm still has a ratio >1, it tries another 
> algorithm, until a list is exhausted.
> 4. If the list is exhausted, it skips compressing that file.
>
> Any suggestions on how to proceed?
>

Difficult to estimate compression ratio without analyzing the entire file. 
In theory you could say something based on the file type, but that's the 
best I can come up with.

[toc] | [prev] | [next] | [standalone]

#56529

From	Joe Beanfish <joebeanfish@nospam.duh>
Date	2024-06-11 14:58 +0000
Message-ID	<v49omf$12c3q$1@dont-email.me>
In reply to	#56526

On Tue, 11 Jun 2024 14:53:27 +0800, J Newman wrote:

> Hi, I'm interested in writing a script that will:
> 
> 1. Find and compress files recursively
> 2. After the first 5 seconds of compressing, if the compression ratio >1 
> (i.e. the compressed file will be larger than the uncompressed file), it 
> tries another compression algorithm.
> 3. If the other compression algorithm still has a ratio >1, it tries 
> another algorithm, until a list is exhausted.
> 4. If the list is exhausted, it skips compressing that file.
> 
> Any suggestions on how to proceed?

You could use dd to extract a representative chunk of the file to
compress and compare size.

uncompressedsize=$(dd status=none if="$file" bs=1M count=1|wc -c)
compressedsize=$(dd status=none if="$file" bs=1M count=1|$compresscmd|wc -c)

You could get fancy and try all the compression commands you have
and pick the one with smallest output for the actual compression.
That's all assuming the beginning of the file is representative of
the content throughout. If it's not, no way to tell without compressing
the whole thing.

[toc] | [prev] | [next] | [standalone]

#56534

From	Grant Taylor <gtaylor@tnetconsulting.net>
Date	2024-06-11 22:21 -0500
Message-ID	<v4b46s$7dh$1@tncsrv09.home.tnetconsulting.net>
In reply to	#56526

On 6/11/24 01:53, J Newman wrote:
> Any suggestions on how to proceed?

As others have said, it's very difficult to tell within the first five 
seconds what the ultimate compression ratio will be.

If you have the disk space, compress using all of the compression 
options and then remove all but the smallest file.

Then go on to the next file.



-- 
Grant. . . .

[toc] | [prev] | [next] | [standalone]

#56537

From	Richard Kettlewell <invalid@invalid.invalid>
Date	2024-06-12 08:17 +0100
Message-ID	<wwvo7868waw.fsf@LkoBDZeT.terraraq.uk>
In reply to	#56534

Grant Taylor <gtaylor@tnetconsulting.net> writes:
> On 6/11/24 01:53, J Newman wrote:
>> Any suggestions on how to proceed?
>
> As others have said, it's very difficult to tell within the first five
> seconds what the ultimate compression ratio will be.

Not just difficult but impossible in general: the input file could
change character in its second half, switching the overall result from
that that is (for example) a gzip win to an xz win.

-- 
https://www.greenend.org.uk/rjk/

[toc] | [prev] | [next] | [standalone]

#56542

From	D <nospam@example.net>
Date	2024-06-12 10:13 +0200
Message-ID	<083d0e35-e02d-8668-726f-7aa89980e9b2@example.net>
In reply to	#56537

On Wed, 12 Jun 2024, Richard Kettlewell wrote:

> Grant Taylor <gtaylor@tnetconsulting.net> writes:
>> On 6/11/24 01:53, J Newman wrote:
>>> Any suggestions on how to proceed?
>>
>> As others have said, it's very difficult to tell within the first five
>> seconds what the ultimate compression ratio will be.
>
> Not just difficult but impossible in general: the input file could
> change character in its second half, switching the overall result from
> that that is (for example) a gzip win to an xz win.
>
>

This is true! The only thing I can imagine are parsing the file type, and 
from that file type, drawing conclusions about the compressability of the 
data, or doing a flawed statistical analysis, but as said, the end could 
be vastly different from the start.

[toc] | [prev] | [next] | [standalone]

#56557

From	J Newman <jenniferkatenewman@gmail.com>
Date	2024-06-13 12:46 +0800
Message-ID	<v4dtih$23kjq$2@dont-email.me>
In reply to	#56542

On 12/06/2024 16:13, D wrote:
> 
> 
> On Wed, 12 Jun 2024, Richard Kettlewell wrote:
> 
>> Grant Taylor <gtaylor@tnetconsulting.net> writes:
>>> On 6/11/24 01:53, J Newman wrote:
>>>> Any suggestions on how to proceed?
>>>
>>> As others have said, it's very difficult to tell within the first five
>>> seconds what the ultimate compression ratio will be.
>>
>> Not just difficult but impossible in general: the input file could
>> change character in its second half, switching the overall result from
>> that that is (for example) a gzip win to an xz win.
>>
>>
> 
> This is true! The only thing I can imagine are parsing the file type, 
> and from that file type, drawing conclusions about the compressability 
> of the data, or doing a flawed statistical analysis, but as said, the 
> end could be vastly different from the start.

OK good point...as mentioned elsewhere my experience is with compressing 
video files with lzma.

But if we accept that the script will make mistakes sometimes in 
choosing the right algorithm for compression, do you suggest parsing the 
file type, or trying to compress each file for the first 5 seconds, as 
the option with the least errors in choosing the right compression 
algorithm?

[toc] | [prev] | [next] | [standalone]

#56562

From	D <nospam@example.net>
Date	2024-06-13 11:55 +0200
Message-ID	<647f0226-265e-2757-bd2a-3aa89de38107@example.net>
In reply to	#56557


On Thu, 13 Jun 2024, J Newman wrote:

> On 12/06/2024 16:13, D wrote:
>> 
>> 
>> On Wed, 12 Jun 2024, Richard Kettlewell wrote:
>> 
>>> Grant Taylor <gtaylor@tnetconsulting.net> writes:
>>>> On 6/11/24 01:53, J Newman wrote:
>>>>> Any suggestions on how to proceed?
>>>> 
>>>> As others have said, it's very difficult to tell within the first five
>>>> seconds what the ultimate compression ratio will be.
>>> 
>>> Not just difficult but impossible in general: the input file could
>>> change character in its second half, switching the overall result from
>>> that that is (for example) a gzip win to an xz win.
>>> 
>>> 
>> 
>> This is true! The only thing I can imagine are parsing the file type, and 
>> from that file type, drawing conclusions about the compressability of the 
>> data, or doing a flawed statistical analysis, but as said, the end could be 
>> vastly different from the start.
>
> OK good point...as mentioned elsewhere my experience is with compressing 
> video files with lzma.
>
> But if we accept that the script will make mistakes sometimes in choosing the 
> right algorithm for compression, do you suggest parsing the file type, or 
> trying to compress each file for the first 5 seconds, as the option with the 
> least errors in choosing the right compression algorithm?
>

Hmm, I'd say parsing file types first, and perhaps have a little database 
that maps file type to compression algorithm, and if that doesn't yield 
anything, proceed with "brute force".

[toc] | [prev] | [next] | [standalone]

#56570

From	Grant Taylor <gtaylor@tnetconsulting.net>
Date	2024-06-13 22:35 -0500
Message-ID	<v4gdpu$cts$1@tncsrv09.home.tnetconsulting.net>
In reply to	#56562

On 6/13/24 04:55, D wrote:
> perhaps have a little database that maps file type to compression algorithm

case ${FILE##*.} in
	txt)
		#...
		;;
	jpg|jpeg)
		# Jpeg
		;;
	*)
		echo "unknown file type"
		;;
esac

;-)

[toc] | [prev] | [next] | [standalone]

#56575

From	D <nospam@example.net>
Date	2024-06-14 11:07 +0200
Message-ID	<34907e1a-2413-bfc6-724b-f4798e73cd17@example.net>
In reply to	#56570


On Thu, 13 Jun 2024, Grant Taylor wrote:

> On 6/13/24 04:55, D wrote:
>> perhaps have a little database that maps file type to compression algorithm
>
> case ${FILE##*.} in
> 	txt)
> 		#...
> 		;;
> 	jpg|jpeg)
> 		# Jpeg
> 		;;
> 	*)
> 		echo "unknown file type"
> 		;;
> esac
>
> ;-)
>

See.. half way there! Just cut n' paste and fill in the details. =)

[toc] | [prev] | [next] | [standalone]

#56556

From	J Newman <jenniferkatenewman@gmail.com>
Date	2024-06-13 12:43 +0800
Message-ID	<v4dtdt$23kjq$1@dont-email.me>
In reply to	#56534

On 12/06/2024 11:21, Grant Taylor wrote:
> On 6/11/24 01:53, J Newman wrote:
>> Any suggestions on how to proceed?
> 
> As others have said, it's very difficult to tell within the first five 
> seconds what the ultimate compression ratio will be.
> 
> If you have the disk space, compress using all of the compression 
> options and then remove all but the smallest file.
> 
> Then go on to the next file.
> 
> 
> 

It's true that you cannot tell within the first 5 seconds what the 
ultimate compression ratio will be, but it seems to me (from compressing 
avi/mp4/mov files with lzma -9evv) that you can tell within +/- 5% to a 
high degree of confidence, what the ultimate compression ratio will be 
given the first 5 seconds.

[toc] | [prev] | [next] | [standalone]

#56558

From	Anssi Saari <anssi.saari@usenet.mail.kapsi.fi>
Date	2024-06-13 10:13 +0300
Message-ID	<sm05xudwc1b.fsf@lakka.kapsi.fi>
In reply to	#56556

J Newman <jenniferkatenewman@gmail.com> writes:

> It's true that you cannot tell within the first 5 seconds what the
> ultimate compression ratio will be, but it seems to me (from
> compressing avi/mp4/mov files with lzma -9evv) that you can tell
> within +/- 5% to a high degree of confidence, what the ultimate
> compression ratio will be given the first 5 seconds.

Well then, I believe the solution was already posted. Grab 5% of your
files with dd and see how it compresses. 

I'm a little curious, what kind of space savings do you expect to get by
doing this? And wouldn't it make more sense to re-encode for lower
bitrate if space saving is your goal?

[toc] | [prev] | [next] | [standalone]

#56561

From	D <nospam@example.net>
Date	2024-06-13 11:55 +0200
Message-ID	<909e65ae-69f4-8619-e563-7d6565a48bc3@example.net>
In reply to	#56558


On Thu, 13 Jun 2024, Anssi Saari wrote:

> J Newman <jenniferkatenewman@gmail.com> writes:
>
>> It's true that you cannot tell within the first 5 seconds what the
>> ultimate compression ratio will be, but it seems to me (from
>> compressing avi/mp4/mov files with lzma -9evv) that you can tell
>> within +/- 5% to a high degree of confidence, what the ultimate
>> compression ratio will be given the first 5 seconds.
>
> Well then, I believe the solution was already posted. Grab 5% of your
> files with dd and see how it compresses.
>
> I'm a little curious, what kind of space savings do you expect to get by
> doing this? And wouldn't it make more sense to re-encode for lower
> bitrate if space saving is your goal?
>

If it's about space saving, don't forget deduplication, alternatively, 
depending on yoru file system of choice, you could maybe use file system 
functionality to save space as well, but caveat emptor, always have off 
site (or off machine) backups.

[toc] | [prev] | [next] | [standalone]

#56567

From	not@telling.you.invalid (Computer Nerd Kev)
Date	2024-06-14 09:06 +1000
Message-ID	<666b7b6c@news.ausics.net>
In reply to	#56558

Anssi Saari <anssi.saari@usenet.mail.kapsi.fi> wrote:
> J Newman <jenniferkatenewman@gmail.com> writes:
> 
>> It's true that you cannot tell within the first 5 seconds what the
>> ultimate compression ratio will be, but it seems to me (from
>> compressing avi/mp4/mov files with lzma -9evv) that you can tell
>> within +/- 5% to a high degree of confidence, what the ultimate
>> compression ratio will be given the first 5 seconds.
> 
> Well then, I believe the solution was already posted. Grab 5% of your
> files with dd and see how it compresses. 

The solution that I see grabs the first 1MB, but it would make more
sense to sample eg. 1% of the file size in five places within the
file. 100MB file = 1MB sample, 100MB/5 = 20MB, so use dd to grab
one 1MB sample from the start of the file then four more at an
offset that increments by 20MB each time. Store these separately,
compress them separately, then average the compression ratio of all
the samples.

> I'm a little curious, what kind of space savings do you expect to get by
> doing this? And wouldn't it make more sense to re-encode for lower
> bitrate if space saving is your goal?

Maybe he's using lossless video compression? Otherwise yes it seems
like the wrong approach.

-- 
__          __
#_ < |\| |< _#

[toc] | [prev] | [next] | [standalone]

#56569

From	not@telling.you.invalid (Computer Nerd Kev)
Date	2024-06-14 12:25 +1000
Message-ID	<666baa01@news.ausics.net>
In reply to	#56567

Computer Nerd Kev <not@telling.you.invalid> wrote:
> Anssi Saari <anssi.saari@usenet.mail.kapsi.fi> wrote:
>> 
>> Well then, I believe the solution was already posted. Grab 5% of your
>> files with dd and see how it compresses. 
> 
> The solution that I see grabs the first 1MB, but it would make more
> sense to sample eg. 1% of the file size in five places within the
> file. 100MB file = 1MB sample, 100MB/5 = 20MB, so use dd to grab
> one 1MB sample from the start of the file then four more at an
> offset that increments by 20MB each time. Store these separately,
> compress them separately, then average the compression ratio of all
> the samples.

Also for some types of data (if it's not all video), like text, some
more advanced compressors build a dictionary to better compress
larger files. But this requires a minimum file size, so the small
samples might not represent the compression ratio of the whole file
with a dictionary included. A solution is to pre-generate a
dictionary based on a collection of the same type of files you're
compressing, then you could compress the small samples using that
dictionary and get a more accurate result.

-- 
__          __
#_ < |\| |< _#

[toc] | [prev] | [next] | [standalone]

#56591

From	J Newman <jenniferkatenewman@gmail.com>
Date	2024-06-15 11:30 +0800
Message-ID	<v4j1sp$39rdv$1@dont-email.me>
In reply to	#56526

On 11/06/2024 14:53, J Newman wrote:
> Hi, I'm interested in writing a script that will:
> 
> 1. Find and compress files recursively
> 2. After the first 5 seconds of compressing, if the compression ratio >1 
> (i.e. the compressed file will be larger than the uncompressed file), it 
> tries another compression algorithm.
> 3. If the other compression algorithm still has a ratio >1, it tries 
> another algorithm, until a list is exhausted.
> 4. If the list is exhausted, it skips compressing that file.
> 
> Any suggestions on how to proceed?


This is the script ChatGPT gives. After some thought, I decided to just 
go with one compression algorithm for simplicity, and just not compress 
the files if the compression ratio >1.

#!/bin/bash

# Function to compress a file with lzma and keep it only if compression 
ratio is <1
compress_file() {
     local file=$1
     local orig_size=$(stat --printf="%s" "$file")

     # Compress with lzma
     lzma -z -k -c "$file" > "$file.lzma"
     local lzma_size=$(stat --printf="%s" "$file.lzma")

     # If the lzma compressed file is smaller than the original, keep it
     if (( lzma_size < orig_size )); then
         mv "$file.lzma" "$file.compressed"
         echo "File compressed using lzma: $file -> $file.compressed"
     else
         rm -f "$file.lzma"
         echo "No compression applied for $file as the compressed size 
was not smaller than the original."
     fi
}

# Export the function so it's available to find -exec
export -f compress_file

# Recursively find all files and compress them
find . -type f -exec bash -c 'compress_file "$0"' {} \;

echo "Compression process complete."

[toc] | [prev] | [standalone]

csiph-web

Script to conditionally find and compress files recursively

Contents

#56526 — Script to conditionally find and compress files recursively

#56528

#56529

#56534

#56537

#56542

#56557

#56562

#56570

#56575

#56556

#56558

#56561

#56567

#56569

#56591