Groups > comp.compilers > #2585 > unrolled thread

Bit swizzling

Started by	"Rick C. Hodgin" <rick.c.hodgin@gmail.com>
First post	2020-09-05 12:05 -0400
Last post	2020-09-10 17:13 -0400
Articles	12 — 9 participants

Back to article view | Back to comp.compilers

  Bit swizzling "Rick C. Hodgin" <rick.c.hodgin@gmail.com> - 2020-09-05 12:05 -0400
    Re: Bit swizzling Hans-Peter Diettrich <DrDiettrich1@netscape.net> - 2020-09-05 19:38 +0200
    Re: Bit Swizzling "John Levine" <johnl@taugh.com> - 2020-09-05 18:50 +0000
      Re: Bit Swizzling Chris <xxx.syseng.yyy@gfsys.co.uk> - 2020-09-06 17:48 +0100
    Re: Bit swizzling Kaz Kylheku <793-849-0957@kylheku.com> - 2020-09-05 20:15 +0000
    Re: Bit swizzling "davidl...@gmail.com" <davidlovemore@gmail.com> - 2020-09-06 00:31 -0700
    Re: Bit swizzling Martin Ward <martin@gkc.org.uk> - 2020-09-07 12:08 +0100
      Re: Bit swizzling Tom Crick <tomcrick@gmail.com> - 2020-09-08 09:35 +0100
    Re: Bit swizzling "Rick C. Hodgin" <rick.c.hodgin@gmail.com> - 2020-09-07 10:55 -0400
      Re: Bit swizzling Hans-Peter Diettrich <DrDiettrich1@netscape.net> - 2020-09-08 01:45 +0200
    Re: Bit swizzling gah4 <gah4@u.washington.edu> - 2020-09-10 10:34 -0700
      Re: Bit swizzling "Rick C. Hodgin" <rick.c.hodgin@gmail.com> - 2020-09-10 17:13 -0400

#2585 — Bit swizzling

From	"Rick C. Hodgin" <rick.c.hodgin@gmail.com>
Date	2020-09-05 12:05 -0400
Subject	Bit swizzling
Message-ID	<20-09-014@comp.compilers>

Are there any algorithms which take a known-at-compile-time sequence
of bitwise operations on an 8-bit to 64-bit quantity, and optimize
them down to their minimal set of operations?

For example, if I have an 8-bit byte and I want to swizzle the bits
thusly:

     Input:   07 06 05 04 03 02 01 00
    Output:   05 04 07 02 01 03 00 06

I can easily swizzle the data using a brute force method:

     v = get_value();
     o = 0;
     swizzle1(o, v, 0, 6);
     swizzle1(o, v, 1, 0);
     swizzle1(o, v, 2, 3);
     swizzle1(o, v, 3, 1);
     swizzle1(o, v, 4, 2);
     swizzle1(o, v, 5, 7);
     swizzle1(o, v, 6, 4);
     swizzle1(o, v, 7, 5);

     // Untested, off the top of my head
     void swizzle(unsigned char& o, unsigned char v, int s, int d)
     {
         o |= (((v & (1 << s)) >> s) << d);
     }

And, of course, that algorithm can be optimized based on relative
values of s and d, and if s is 0, etc.

However, there also exists a minimal set of steps to complete that
swizzling because in the operation above, bits 5 and 4 are used in
sequence.  It could be done using a swizzle2() operation, for example
and handle 2 bits at a time.

In addition, given the bit operator abilities that exist on various
CPUs there are potentially other combinations that exist behind an
operation, such as bitswap, where the order of bits flips or mirrors
across the center position.

Are there any existing algorithms which examine the operations that
must be conducted and then create an optimized / minimal sequence of
mechanical steps to conduct it given a constrained set of features
(such as those present on a given CPU)?

Thank you in advance.

--
Rick C. Hodgin

[toc] | [next] | [standalone]

#2586

From	Hans-Peter Diettrich <DrDiettrich1@netscape.net>
Date	2020-09-05 19:38 +0200
Message-ID	<20-09-015@comp.compilers>
In reply to	#2585

Am 05.09.2020 um 18:05 schrieb Rick C. Hodgin:
> Are there any algorithms which take a known-at-compile-time sequence
> of bitwise operations on an 8-bit to 64-bit quantity, and optimize
> them down to their minimal set of operations?
>
> For example, if I have an 8-bit byte and I want to swizzle the bits
> thusly:
>
>      Input:   07 06 05 04 03 02 01 00
>     Output:   05 04 07 02 01 03 00 06

In your 8-bit case an array of outputs is sufficient, indexed by the input.


> In addition, given the bit operator abilities that exist on various
> CPUs there are potentially other combinations that exist behind an
> operation, such as bitswap, where the order of bits flips or mirrors
> across the center position.

Somebody must build the tables for those operations, distinct for each
CPU, then write code to make use of these tables. I don't think that it
has been done yet, except perhaps for the basic operations (AND, OR...)

> Are there any existing algorithms which examine the operations that
> must be conducted and then create an optimized / minimal sequence of
> mechanical steps to conduct it given a constrained set of features
> (such as those present on a given CPU)?

I often used Quine-McCluskey to minimize my logic circuits.

In most applications a toggled bit in the outputs invalidates all prior
computation optimization attempts, and everything has to be analyzed and
constructed again.

In general a hash code of the inputs could be used to index an array of
outputs. That code is insensitive to changes of output bits, only the
affected array element has to be uptdated then.

DoDi

[toc] | [prev] | [next] | [standalone]

#2587

From	"John Levine" <johnl@taugh.com>
Date	2020-09-05 18:50 +0000
Message-ID	<20-09-016@comp.compilers>
In reply to	#2585

In article <rivvah$1neg$2@gioia.aioe.org>,
>> -----
>> Are there any algorithms which take a known-at-compile-time sequence
>> of bitwise operations on an 8-bit to 64-bit quantity, and optimize
>> them down to their minimal set of operations?

>Why not just use a lookup table ?. Minimum ops and fast...

Assuming you're looking for something you can implement in logic
rather than by table lookup, it sounds like a set of Karnaugh maps.

--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly

[toc] | [prev] | [next] | [standalone]

#2590

From	Chris <xxx.syseng.yyy@gfsys.co.uk>
Date	2020-09-06 17:48 +0100
Message-ID	<20-09-019@comp.compilers>
In reply to	#2587

On 09/05/20 19:50, John Levine wrote:
> In article<rivvah$1neg$2@gioia.aioe.org>,
>>> -----
>>> Are there any algorithms which take a known-at-compile-time sequence
>>> of bitwise operations on an 8-bit to 64-bit quantity, and optimize
>>> them down to their minimal set of operations?
>
>> Why not just use a lookup table ?. Minimum ops and fast...
>
> Assuming you're looking for something you can implement in logic
> rather than by table lookup, it sounds like a set of Karnaugh maps.

Unless this is just an intellectual exercise for the fun of it, an
engineer would choose the minimal design at lowest cost to
satisfy the requirements. Table methods don't have to be in
software as a single eprom or gate array could do it in hardware.
8 inputs to address lines, then 8 bits of output, scale as required,
so why make life more difficult than necessary ?.

Old project here, where a programmer spent nearly 5 pages of 6800
asm to translate an input connect pin layout to that required for
the internal functions. Code was impenetrable, so substituted a
256 byte lookup table. Less space that the code and easily
modified for new requirements...

Chris
[I think the question is whether there is a way to mechanically
generate a version of the opaque assembler. -John]

[toc] | [prev] | [next] | [standalone]

#2588

From	Kaz Kylheku <793-849-0957@kylheku.com>
Date	2020-09-05 20:15 +0000
Message-ID	<20-09-017@comp.compilers>
In reply to	#2585

On 2020-09-05, Rick C. Hodgin <rick.c.hodgin@gmail.com> wrote:
> Are there any existing algorithms which examine the operations that
> must be conducted and then create an optimized / minimal sequence of
> mechanical steps to conduct it given a constrained set of features
> (such as those present on a given CPU)?

For mapping 8 bit quantities to other 8 bit quantities, you can always
use a 256 byte look up table.

Of course, it's not practical for larger spaces. Still, it may be possibel to
identify the possibility of involving multiple smaller look-up tables.

[toc] | [prev] | [next] | [standalone]

#2589

From	"davidl...@gmail.com" <davidlovemore@gmail.com>
Date	2020-09-06 00:31 -0700
Message-ID	<20-09-018@comp.compilers>
In reply to	#2585

On Saturday, September 5, 2020 at 5:45:43 PM UTC+1, Rick C. Hodgin wrote:
> Are there any algorithms which take a known-at-compile-time sequence
> of bitwise operations on an 8-bit to 64-bit quantity, and optimize
> them down to their minimal set of operations?

There's some good resources on bit permutations including this online calculator here:

http://programming.sirrida.de/calcperm.php

For:
> Input: 07 06 05 04 03 02 01 00
> Output: 05 04 07 02 01 03 00 06

It gives:

x = bit_permute_step_simple(x, 0x55555555, 1);  // Bit index complement 0
x = bit_permute_step_simple(x, 0x33333333, 2);  // Bit index complement 1
x = bit_permute_step_simple(x, 0x0f0f0f0f, 4);  // Bit index complement 2

where

t_bits bit_permute_step_simple(t_bits x, t_bits m, t_uint shift) {
    return ((x & m) << shift) | ((x >> shift) & m);
    }

Approx 12 cycles on superscalar processor.

Source code for a more sophisticated permutation code generator is also available linked from the page.

Some notes here on the online generator:

http://programming.sirrida.de/bit_perm.html#calculator

[toc] | [prev] | [next] | [standalone]

#2591

From	Martin Ward <martin@gkc.org.uk>
Date	2020-09-07 12:08 +0100
Message-ID	<20-09-020@comp.compilers>
In reply to	#2585

On 05/09/2020 17:05, Rick C. Hodgin wrote:
> Are there any existing algorithms which examine the operations that
> must be conducted and then create an optimized / minimal sequence of
> mechanical steps to conduct it given a constrained set of features
> (such as those present on a given CPU)?

The process you are describing is called "Superoptimization":
finding the optimal  code sequence for one loop-free sequence
of instructions.

https://en.wikipedia.org/wiki/Superoptimization

The term superoptimization was first coined by Alexia Massalin in the
1987 paper "Superoptimizer: A Look at the Smallest Program":

http://www.stanford.edu/class/cs343/resources/superoptimizer.pdf

Back in 1980 there was a discussion in the magazine
"Pratical Electronics" regarding the shortest sequence of
instructions needed to reverse the bits in the accumulator
for a 6800 system (an operation that is needed in some Fast Fourier
Transform algorithms).

One solution (given in June 1980, page 70) consisted of just
two instructions and used no additional data.
The instructions were:

STAA PORTA  write to output port
LDAB PORTB  read from input port

with the output port being wired directly to the input port with
the bits reversed!

--
			Martin

Dr Martin Ward | Email: martin@gkc.org.uk | http://www.gkc.org.uk
G.K.Chesterton site: http://www.gkc.org.uk/gkc | Erdos number: 4

[toc] | [prev] | [next] | [standalone]

#2594

From	Tom Crick <tomcrick@gmail.com>
Date	2020-09-08 09:35 +0100
Message-ID	<20-09-023@comp.compilers>
In reply to	#2591

On Mon, 7 Sep 2020 at 18:57, Martin Ward <martin@gkc.org.uk> wrote:
>
> On 05/09/2020 17:05, Rick C. Hodgin wrote:
> > Are there any existing algorithms which examine the operations that
> > must be conducted and then create an optimized / minimal sequence of
> > mechanical steps to conduct it given a constrained set of features
> > (such as those present on a given CPU)?
>
> The process you are describing is called "Superoptimization":
> finding the optimal  code sequence for one loop-free sequence
> of instructions. ...

Back in the distant past (2009), I did my PhD on superoptimisation —
provably optimal code generation using answer set programming:

https://proftomcrick.com/2012/02/18/three-papers-on-superoptimisation/

Still an area with lots of potential (IMHO)...

Best wishes,

Tom

[toc] | [prev] | [next] | [standalone]

#2592

From	"Rick C. Hodgin" <rick.c.hodgin@gmail.com>
Date	2020-09-07 10:55 -0400
Message-ID	<20-09-021@comp.compilers>
In reply to	#2585

Thank you, Martin.

I saw a Google developer talk on C/C++ where they discussed generating
optimum code for a given logic flow (as dictated by source code).

I remember thinking, what if your base logic breakout isn't optimum?
What if the way the developer was designing the algorithm wasn't optimum?
What if the entire premise of what you're trying to do needed to be first
broken down into fundamentals, and then re-assembled in a truly optimum
form?

There are cases where, if you know specific input ranges, you could sub-
stitute the algorithm with something completely unlike the actual function,
yet yield the same results.

One such idea is writing to a port, and reading back from another port,
where the external thing does the processing for you as a co-processor or
custom ASIC.  The actual 6800 algorithm as dictated by the developer may
have been in C code or other lagugae, and would've used Nn asm opcodes to
conduct the same workload with even an optimizing compiler.  But an intel-
ligent compiler, one that could break down the operations to fundamentals,
understand what's trying to be accomplished, and then re-engineer the
logic to accomodate the abilities which exist (such as those extra read/
write ports), is what's really quired.

In some algorithm cases, the nature of what you're trying to do may be
handled fully by a completely unexpected series of instructions.

I've long desired for my CAlive language to support that type of opti-
mization.  Walter Banks and I used to discuss the possibility, and had
he lived we would've continued down that line.  It goes along with
another fundamental idea I have for processing that I would like to
explore someday ... if there's time.

I appreciate your response.

--
Rick C. Hodgin

On 9/7/20 7:08 AM, Martin Ward wrote:
 > On 05/09/2020 17:05, Rick C. Hodgin wrote:
 >> Are there any existing algorithms which examine the operations that
 >> must be conducted and then create an optimized / minimal sequence of
 >> mechanical steps to conduct it given a constrained set of features
 >> (such as those present on a given CPU)?
 >
 > The process you are describing is called "Superoptimization":
 > finding the optimal  code sequence for one loop-free sequence
 > of instructions.
 >
 > https://en.wikipedia.org/wiki/Superoptimization
 >
 > The term superoptimization was first coined by Alexia Massalin in the
1987 paper "Superoptimizer: A Look at the Smallest Program":
 >
 > http://www.stanford.edu/class/cs343/resources/superoptimizer.pdf
 >
 > Back in 1980 there was a discussion in the magazine
 > "Pratical Electronics" regarding the shortest sequence of
 > instructions needed to reverse the bits in the accumulator
 > for a 6800 system (an operation that is needed in some Fast Fourier
 > Transform algorithms).
 >
 > One solution (given in June 1980, page 70) consisted of just
 > two instructions and used no additional data.
 > The instructions were:
 >
 > STAA PORTA  write to output port
 > LDAB PORTB  read from input port
 >
 > with the output port being wired directly to the input port with
 > the bits reversed!

[toc] | [prev] | [next] | [standalone]

#2593

From	Hans-Peter Diettrich <DrDiettrich1@netscape.net>
Date	2020-09-08 01:45 +0200
Message-ID	<20-09-022@comp.compilers>
In reply to	#2592

Am 07.09.2020 um 16:55 schrieb Rick C. Hodgin:

> One such idea is writing to a port, and reading back from another port,
> where the external thing does the processing for you as a co-processor or
> custom ASIC.

In a more general approach you connect a ROM between the output and
input port, for any possible mapping of n-into-n bits. Nowadays a table
in flash memory is better applicable.

DoDi

[toc] | [prev] | [next] | [standalone]

#2597

From	gah4 <gah4@u.washington.edu>
Date	2020-09-10 10:34 -0700
Message-ID	<20-09-026@comp.compilers>
In reply to	#2585

On Saturday, September 5, 2020 at 9:45:43 AM UTC-7, Rick C. Hodgin wrote:
> Are there any algorithms which take a known-at-compile-time sequence
> of bitwise operations on an 8-bit to 64-bit quantity, and optimize
> them down to their minimal set of operations?
>
> For example, if I have an 8-bit byte and I want to swizzle the bits
> thusly:
>
> Input: 07 06 05 04 03 02 01 00
> Output: 05 04 07 02 01 03 00 06

There is a lot of work, and many algorithms, for logic optimization,
or minimization, that, for example, will find the optimal combination
of NAND and NOR gates to evaluate some logical operation.

This is used to compile Verilog or VHDL into either FPGA bit files
or ASIC masks.   On the other hand, logic minimization tends to
believe that you can wire anything to anything else.  That is, bit
swizzle is just wiring.  The question you ask doesn't come up.
(Later on, routing has to get all the wires to the appropriate
place, but that is true in general, not just for this case.)

Note that it isn't just adjacent bits, but bits with the same spacing
in the input and output, such that you can mask with AND, shift,
and combine with OR.  I suspect that with the operations usually
available: AND, OR, XOR, and shift, it wouldn't be hard to find an
algorithm for the optimal case.  If you add more operations,
maybe allow also add and multiply, it gets interesting.

[toc] | [prev] | [next] | [standalone]

#2598

From	"Rick C. Hodgin" <rick.c.hodgin@gmail.com>
Date	2020-09-10 17:13 -0400
Message-ID	<20-09-027@comp.compilers>
In reply to	#2597

On 9/10/20 1:34 PM, gah4 wrote:
> On Saturday, September 5, 2020 at 9:45:43 AM UTC-7, Rick C. Hodgin wrote:
>> Are there any algorithms which take a known-at-compile-time sequence
>> of bitwise operations on an 8-bit to 64-bit quantity, and optimize
>> them down to their minimal set of operations? ...

> There is a lot of work, and many algorithms, for logic optimization,
> or minimization, that, for example, will find the optimal combination
> of NAND and NOR gates to evaluate some logical operation. ...

I have a preliminary algorithm that works, but it's still not fully
optimized and is a little clunky.

I want a way to abstract the logic and work with it there.

I haven't put in any additional time on this algorithm yet, due to
some life things happening.  But I plan to come back to it.

--
Rick C. Hodgin

[toc] | [prev] | [standalone]

csiph-web

Bit swizzling

Contents

#2585 — Bit swizzling

#2586

#2587

#2590

#2588

#2589

#2591

#2594

#2592

#2593

#2597

#2598