Groups > comp.lang.c > #123009 > unrolled thread

Recasting data as long ints and chars

Started by	James Harris <james.harris.1@gmail.com>
First post	2017-11-19 21:11 +0000
Last post	2018-02-10 05:46 +0000
Articles	20 on this page of 42 — 11 participants

Back to article view | Back to comp.lang.c

  Recasting data as long ints and chars James Harris <james.harris.1@gmail.com> - 2017-11-19 21:11 +0000
    Re: Recasting data as long ints and chars Ben Bacarisse <ben.usenet@bsb.me.uk> - 2017-11-23 00:50 +0000
      Re: Recasting data as long ints and chars James Harris <james.harris.1@gmail.com> - 2017-11-26 15:55 +0000
        Re: Recasting data as long ints and chars Ben Bacarisse <ben.usenet@bsb.me.uk> - 2017-11-26 23:55 +0000
          Re: Recasting data as long ints and chars Spiros Bousbouras <spibou@gmail.com> - 2017-11-27 06:43 +0000
            Re: Recasting data as long ints and chars "Pascal J. Bourguignon" <pjb@informatimago.com> - 2017-11-27 09:26 +0100
              Re: Recasting data as long ints and chars Spiros Bousbouras <spibou@gmail.com> - 2017-11-27 08:50 +0000
              Re: Recasting data as long ints and chars "James R. Kuyper" <jameskuyper@verizon.net> - 2017-11-27 10:35 -0500
                Re: Recasting data as long ints and chars supercat@casperkitty.com - 2017-11-27 08:00 -0800
                Re: Recasting data as long ints and chars "Pascal J. Bourguignon" <pjb@informatimago.com> - 2017-11-27 23:49 +0100
                  Re: Recasting data as long ints and chars jameskuyper@verizon.net - 2017-11-27 15:56 -0800
              Re: Recasting data as long ints and chars herrmannsfeldt@gmail.com - 2017-12-15 12:45 -0800
            Re: Recasting data as long ints and chars Ben Bacarisse <ben.usenet@bsb.me.uk> - 2017-11-27 15:08 +0000
          Re: Recasting data as long ints and chars James Harris <james.harris.1@gmail.com> - 2017-11-27 14:55 +0000
            Re: Recasting data as long ints and chars Ben Bacarisse <ben.usenet@bsb.me.uk> - 2017-11-27 15:51 +0000
              Re: Recasting data as long ints and chars James Harris <james.harris.1@gmail.com> - 2017-11-28 17:13 +0000
                Re: Recasting data as long ints and chars Ben Bacarisse <ben.usenet@bsb.me.uk> - 2017-11-29 01:07 +0000
                  Re: Recasting data as long ints and chars Spiros Bousbouras <spibou@gmail.com> - 2017-11-29 18:59 +0000
                    Re: Recasting data as long ints and chars Ben Bacarisse <ben.usenet@bsb.me.uk> - 2017-11-30 00:53 +0000
                      Re: Recasting data as long ints and chars Keith Thompson <kst-u@mib.org> - 2017-11-29 17:01 -0800
                        Re: Recasting data as long ints and chars Ben Bacarisse <ben.usenet@bsb.me.uk> - 2017-11-30 11:40 +0000
                          Re: Recasting data as long ints and chars James Harris <james.harris.1@gmail.com> - 2017-11-30 19:53 +0000
                Re: Recasting data as long ints and chars Spiros Bousbouras <spibou@gmail.com> - 2017-11-29 18:37 +0000
                  Re: Recasting data as long ints and chars James Harris <james.harris.1@gmail.com> - 2017-11-30 22:38 +0000
                    Re: Recasting data as long ints and chars Spiros Bousbouras <spibou@gmail.com> - 2017-12-07 05:06 +0000
                      Re: Recasting data as long ints and chars supercat@casperkitty.com - 2017-12-07 07:51 -0800
                      Re: Recasting data as long ints and chars James Harris <james.harris.1@gmail.com> - 2017-12-08 23:14 +0000
                        Re: Recasting data as long ints and chars supercat@casperkitty.com - 2017-12-12 16:19 -0800
                          Re: Recasting data as long ints and chars Keith Thompson <kst-u@mib.org> - 2017-12-12 17:10 -0800
                            Re: Recasting data as long ints and chars supercat@casperkitty.com - 2017-12-13 09:02 -0800
                              Re: Recasting data as long ints and chars herrmannsfeldt@gmail.com - 2017-12-15 12:39 -0800
                                Re: Recasting data as long ints and chars Robert Wessel <robertwessel2@yahoo.com> - 2017-12-15 17:39 -0600
                                  Re: Recasting data as long ints and chars herrmannsfeldt@gmail.com - 2017-12-15 15:52 -0800
                                    Re: Recasting data as long ints and chars Robert Wessel <robertwessel2@yahoo.com> - 2017-12-16 00:35 -0600
                                      Re: Recasting data as long ints and chars herrmannsfeldt@gmail.com - 2017-12-17 14:02 -0800
                                        Re: Recasting data as long ints and chars "Chris M. Thomasson" <invalid_chris_thomasson@invalid.invalid> - 2017-12-17 15:15 -0800
                                          Re: Recasting data as long ints and chars herrmannsfeldt@gmail.com - 2017-12-17 18:47 -0800
                                      Re: Recasting data as long ints and chars herrmannsfeldt@gmail.com - 2017-12-17 14:09 -0800
                                      Re: Recasting data as long ints and chars herrmannsfeldt@gmail.com - 2017-12-17 14:19 -0800
                                      Re: Recasting data as long ints and chars herrmannsfeldt@gmail.com - 2017-12-17 14:45 -0800
                        Re: Recasting data as long ints and chars Spiros Bousbouras <spibou@gmail.com> - 2017-12-13 19:13 +0000
                          Re: Recasting data as long ints and chars James Harris <james.harris.1@gmail.com> - 2018-02-10 05:46 +0000

Page 1 of 3 [1] 2 3 Next page →

#123009 — Recasting data as long ints and chars

From	James Harris <james.harris.1@gmail.com>
Date	2017-11-19 21:11 +0000
Subject	Recasting data as long ints and chars
Message-ID	<ouss2m$sfm$1@dont-email.me>

IME it's unusual to need to view C variables in their "raw" form but I 
seem to have come across a requirement to do just that. A basic solution 
would be easy enough but I wondered if you guys had any thoughts on how 
to do it efficiently. This is not the kind of optimisation that can be 
left to the compiler, as you will see from the description below.

I've written a bespoke sort routine. It works well enough on ints but I 
am looking to extend it to handle arbitrary data types. The keys on 
which the sort would operate would be a whole number of bytes in length 
(i.e. no partial bytes), and all keys would be of the same length. The 
relatively simple solution would be to pass the custom sort routine an 
array which specified the offset of each byte position in order. That 
has the first issue: coping with endianness. For example, if the sort 
routine were to sort 4-byte ints by working from MSB to LSB it would 
need to be passed an array of offsets: 3,2,1,0 on a little-endian 
machine, and 0,1,2,3 on a big-endian machine.

Sorting byte-by-byte would not always be ideal, however. It would be 
faster to pick up and manipulate parts of a long key in multi-byte 
chunks, where possible. As an example, if the keys were character 
strings the same size as unsigned longs and were stored in big-endian 
order and the routine were to be running on a big-endian machine, and 
the keys were sufficiently aligned ... (phew) ... it would be better to 
treat the keys as unsigned longs. As another example, if the keys were 
structs which included embedded ints it would be worth picking those 
parts up as ints even on a little-endian machine.

I can think of two basic ways to do the mechanics:

1. Call the sort routine with a data structure saying how keys are laid 
out in memory:

   custom_sort(array_of_uints, keyspec_for_uints);
   custom_sort(array_of_longs, keyspec_for_longs);

The second parameter would point to an data structure which specified 
the layout of sort keys. The data structure would include an array of 
the parts of the key.


2. Provide a callback which would return the requested part of the key spec:

   custom_sort(array_of_uints, callback_for_uints);

The sort routine would call callback_for_uints with a parameter saying 
where it had got to, and callback_for_uints would return a value which 
included offset, alignment and datatype of the next part of the key.


All of this is, of course, very low level compared with much C 
programming. Hence the question: Have you any thoughts on how to write 
it in C and make the keyspec portable?


-- 
James Harris

[toc] | [next] | [standalone]

#123348

From	Ben Bacarisse <ben.usenet@bsb.me.uk>
Date	2017-11-23 00:50 +0000
Message-ID	<87efopstrv.fsf@bsb.me.uk>
In reply to	#123009

James Harris <james.harris.1@gmail.com> writes:

> IME it's unusual to need to view C variables in their "raw" form but I
> seem to have come across a requirement to do just that.
<snip>
> I've written a bespoke sort routine.
<snip>
> That has the first issue: coping with endianness. For example, if the
> sort routine were to sort 4-byte ints by working from MSB to LSB it
> would need to be passed an array of offsets: 3,2,1,0 on a
> little-endian machine, and 0,1,2,3 on a big-endian machine.

<snip lots more>

I'm just going to say why (from my point of view) you've had no answers
yet.  There's quite a lot of text but I did not find enough detail to
get my teeth into.  You have a solution, but I have no idea how it might
be made better, partly because I don't really know what solution you
have right now.

And at the detailed level there are some confusing things.  Endianness
is very rarely an issue but I suspect it (or something like it), is for
you for reasons not yet explained.  The bytes from LSB to MSB can easily
be obtained by shifting and masking (and I am sure you know that) so I
suspect that is not available to you for some of the data you are
considering.

<snip>
-- 
Ben.

[toc] | [prev] | [next] | [standalone]

#123521

From	James Harris <james.harris.1@gmail.com>
Date	2017-11-26 15:55 +0000
Message-ID	<oveo59$aa0$1@dont-email.me>
In reply to	#123348

On 23/11/2017 00:50, Ben Bacarisse wrote:
> James Harris <james.harris.1@gmail.com> writes:
> 
>> IME it's unusual to need to view C variables in their "raw" form but I
>> seem to have come across a requirement to do just that.
> <snip>
>> I've written a bespoke sort routine.
> <snip>
>> That has the first issue: coping with endianness. For example, if the
>> sort routine were to sort 4-byte ints by working from MSB to LSB it
>> would need to be passed an array of offsets: 3,2,1,0 on a
>> little-endian machine, and 0,1,2,3 on a big-endian machine.
> 
> <snip lots more>

Thanks for the reply, Ben. Your snip comment made me smile. The post was 
only 57 lines long but maybe I included too much in one post - and it 
just felt like "lots"...!    :-)

The basic query is difficult to describe due to the inherent underlying 
complexity but I have been trying to think of a better way to put it. 
Perhaps it's best to use an example and to speak about just one part of 
the problem at a time - which I'll do, below. I think I can explain the 
main points by analogy with radix sorting.

> 
> I'm just going to say why (from my point of view) you've had no answers
> yet.  There's quite a lot of text but I did not find enough detail to
> get my teeth into.  You have a solution, but I have no idea how it might
> be made better, partly because I don't really know what solution you
> have right now.
> 
> And at the detailed level there are some confusing things.  Endianness
> is very rarely an issue but I suspect it (or something like it), is for
> you for reasons not yet explained.  The bytes from LSB to MSB can easily
> be obtained by shifting and masking (and I am sure you know that) so I
> suspect that is not available to you for some of the data you are
> considering.

Imagine a function to carry out a radix sort. A programmer could write 
one such function for each type of record to be sorted. The solution I 
have now effectively does that: it only sorts integers and was intended 
as a proof of concept.

But it would, of course, be more flexible to have one routine which was 
able to sort arbitrary records. Then it could be called to sort 
different types of data. That, though, would require a way to specify 
the parts of the record which have to be used as sort keys. In turn, 
that raises the question of how to tell the sort function about the 
structure of the keys. E.g. imagine records like the following.

   struct rec {
     uint32_t key_part_1;
     char[20] name;
     char[3] key_part_2;
   };

Assuming the radix sort works one byte at a time, to sort records of 
struct rec type the sort function would have to be told that the key was 
made up of bytes 0 to 3 and 24 to 26. And, what's more, the order of the 
bytes of key_part_1 would depend on the architecture. So a list of bytes 
could be supplied to an MSD radix sort as either of these two.

   3, 2, 1, 0, 24, 25, 26  (little-endian machine)
   0, 1, 2, 3, 24, 25, 26  (big-endian machine)

Given this scenario, the first part of the original query was about a 
good way to tell the sort routine such a list of byte positions, and to 
do so portably.

In summary, if calling a general purpose MSD radix sort which sorted 
byte-by-byte (i.e. where each 'digit' was a byte) what would be a good 
and portable way to specify the offsets of the bytes of the sort key 
fields in descending order of significance?

-- 
James Harris

[toc] | [prev] | [next] | [standalone]

#123526

From	Ben Bacarisse <ben.usenet@bsb.me.uk>
Date	2017-11-26 23:55 +0000
Message-ID	<87fu90mw6q.fsf@bsb.me.uk>
In reply to	#123521

James Harris <james.harris.1@gmail.com> writes:
<snip>
> Imagine a function to carry out a radix sort. A programmer could write
> one such function for each type of record to be sorted. The solution I
> have now effectively does that: it only sorts integers and was
> intended as a proof of concept.
>
> But it would, of course, be more flexible to have one routine which
> was able to sort arbitrary records. Then it could be called to sort
> different types of data. That, though, would require a way to specify
> the parts of the record which have to be used as sort keys. In turn,
> that raises the question of how to tell the sort function about the
> structure of the keys. E.g. imagine records like the following.
>
>   struct rec {
>     uint32_t key_part_1;
>     char[20] name;
>     char[3] key_part_2;
>   };
>
> Assuming the radix sort works one byte at a time, to sort records of
> struct rec type the sort function would have to be told that the key
> was made up of bytes 0 to 3 and 24 to 26. And, what's more, the order
> of the bytes of key_part_1 would depend on the architecture. So a list
> of bytes could be supplied to an MSD radix sort as either of these
> two.
>
>   3, 2, 1, 0, 24, 25, 26  (little-endian machine)
>   0, 1, 2, 3, 24, 25, 26  (big-endian machine)
>
> Given this scenario, the first part of the original query was about a
> good way to tell the sort routine such a list of byte positions, and
> to do so portably.

The last bit is confusing me in that I would have to work hard to come
up with a non-portable way to tell the sort routine the list of byte
positions.  That makes me think you mean something else.  Do you want a
portable way to *determine* the list of positions in some portable and
quasi-automatic way?  If so, how automatic do you want it to be?

> In summary, if calling a general purpose MSD radix sort which sorted
> byte-by-byte (i.e. where each 'digit' was a byte) what would be a good
> and portable way to specify the offsets of the bytes of the sort key
> fields in descending order of significance?

An array of unsigned int?  The fact that the answer seems so clear to me
again makes me think I'm missing something.

-- 
Ben.

[toc] | [prev] | [next] | [standalone]

#123528

From	Spiros Bousbouras <spibou@gmail.com>
Date	2017-11-27 06:43 +0000
Message-ID	<g2V32ppLhPDKcENafQyUTwdjN1mwQ@bongo-ra.co>
In reply to	#123526

On Sun, 26 Nov 2017 23:55:57 +0000
Ben Bacarisse <ben.usenet@bsb.me.uk> wrote:
> James Harris <james.harris.1@gmail.com> writes:
> <snip>
> > Imagine a function to carry out a radix sort. A programmer could write
> > one such function for each type of record to be sorted. The solution I
> > have now effectively does that: it only sorts integers and was
> > intended as a proof of concept.
> >
> > But it would, of course, be more flexible to have one routine which
> > was able to sort arbitrary records. Then it could be called to sort
> > different types of data. That, though, would require a way to specify
> > the parts of the record which have to be used as sort keys. In turn,
> > that raises the question of how to tell the sort function about the
> > structure of the keys. E.g. imagine records like the following.
> >
> >   struct rec {
> >     uint32_t key_part_1;
> >     char[20] name;
> >     char[3] key_part_2;
> >   };
> >
> > Assuming the radix sort works one byte at a time, to sort records of
> > struct rec type the sort function would have to be told that the key
> > was made up of bytes 0 to 3 and 24 to 26. And, what's more, the order
> > of the bytes of key_part_1 would depend on the architecture. So a list
> > of bytes could be supplied to an MSD radix sort as either of these
> > two.
> >
> >   3, 2, 1, 0, 24, 25, 26  (little-endian machine)
> >   0, 1, 2, 3, 24, 25, 26  (big-endian machine)

Do you know that the only possibilities are little-endian and big-endian ?

> > Given this scenario, the first part of the original query was about a
> > good way to tell the sort routine such a list of byte positions, and
> > to do so portably.
> 
> The last bit is confusing me in that I would have to work hard to come
> up with a non-portable way to tell the sort routine the list of byte
> positions.  That makes me think you mean something else.  Do you want a
> portable way to *determine* the list of positions in some portable and
> quasi-automatic way?  If so, how automatic do you want it to be?
> 
> > In summary, if calling a general purpose MSD radix sort which sorted
> > byte-by-byte (i.e. where each 'digit' was a byte) what would be a good
> > and portable way to specify the offsets of the bytes of the sort key
> > fields in descending order of significance?
> 
> An array of unsigned int?

size_t  seems to me like a better choice of type than  unsigned int .

> The fact that the answer seems so clear to me
> again makes me think I'm missing something.

[toc] | [prev] | [next] | [standalone]

#123530

From	"Pascal J. Bourguignon" <pjb@informatimago.com>
Date	2017-11-27 09:26 +0100
Message-ID	<m2efokcelh.fsf@informatimago.com>
In reply to	#123528

Spiros Bousbouras <spibou@gmail.com> writes:

>> >   3, 2, 1, 0, 24, 25, 26  (little-endian machine)
>> >   0, 1, 2, 3, 24, 25, 26  (big-endian machine)
>
> Do you know that the only possibilities are little-endian and big-endian ?

You mean that little-endian and big-endian are NOT the only
possibilities?

In practice there are (old) processors where you'd have:

      1, 0, 3, 2, 24, 25, 26

-- 
__Pascal J. Bourguignon
http://www.informatimago.com

[toc] | [prev] | [next] | [standalone]

#123531

From	Spiros Bousbouras <spibou@gmail.com>
Date	2017-11-27 08:50 +0000
Message-ID	<g2V32ppLhPDKcENafQyUTwdjN1mwb@bongo-ra.co>
In reply to	#123530

On Mon, 27 Nov 2017 09:26:02 +0100
"Pascal J. Bourguignon" <pjb@informatimago.com> wrote:
> Spiros Bousbouras <spibou@gmail.com> writes:
> 
> >> >   3, 2, 1, 0, 24, 25, 26  (little-endian machine)
> >> >   0, 1, 2, 3, 24, 25, 26  (big-endian machine)
> >
> > Do you know that the only possibilities are little-endian and big-endian ?
> 
> You mean that little-endian and big-endian are NOT the only
> possibilities?
> 
> 
> In practice there are (old) processors where you'd have:
> 
>       1, 0, 3, 2, 24, 25, 26

I don't know how old they are or what the full range of possibilities is
but yes , I was referring to the fact that they exist(ed).

[toc] | [prev] | [next] | [standalone]

#123535

From	"James R. Kuyper" <jameskuyper@verizon.net>
Date	2017-11-27 10:35 -0500
Message-ID	<4c779337-fbfa-6fd9-c727-834e074ee673@verizon.net>
In reply to	#123530

On 11/27/2017 03:26 AM, Pascal J. Bourguignon wrote:
> Spiros Bousbouras <spibou@gmail.com> writes:
> 
>>>>    3, 2, 1, 0, 24, 25, 26  (little-endian machine)
>>>>    0, 1, 2, 3, 24, 25, 26  (big-endian machine)
>>
>> Do you know that the only possibilities are little-endian and big-endian ?
> 
> You mean that little-endian and big-endian are NOT the only
> possibilities?

Yes, though the complete list of possibilities depends in part on what 
you mean by "possibilities". Some people, especially those who would 
describe themselves as "practical, common sense" people, only consider 
something possible if has actually been done; more accurately, they 
would consider those to be the only possibilities worth worrying about. 
Other people (myself included) believe that something should be 
considered "possible" if it could be done without violating any of the 
applicable constraints, even it if has never yet been done.

Given an integer type that occupies N bytes, there's N! different 
possible (in the second sense) orderings of the bytes. For N==4, that's 
24 orderings. A decade or two ago I found a web site listing the various 
orderings that had actually been used, and examples of devices where 
they were used. Unfortunately, I didn't appreciate the value of what I'd 
found, or I would have saved a copy of the web page, or at least a link 
to it. I've never been able to locate it again, so all of the following 
statements are based upon my fallible organic memory - feel free to 
discount them accordingly:

Of those 24 possible orderings, 8 had been put into actual use for 
devices that the authors were familiar with (possible in the first sense 
described above). Given the difficulty of collecting such information, I 
think it's virtually guaranteed that there's a least a couple of 
orderings in actual use that they didn't know about.
Overwhelmingly, the two most common orderings were big-endian and 
little-endian. The other possible orderings are generically called 
middle-endian. Of the middle-endian orderings, overwhelmingly the most 
common were 1 0 3 2 and 2 3 0 1, and some people use the term 
middle-endian exclusively for those two orderings - either because they 
are ignorant of the other orderings, or because they use a different 
term for those other orderings.

I'd like to emphasize that not all 8 of those orderings were used by 
CPUs - many of them were used only by specialized devices. The web page 
was not about C, but about byte orderings, and therefore did not 
identify which of those orderings were in use by CPUs for which there 
was a conforming implementation of C.

Note that not only does the standard not constrain the byte ordering in 
any way, it also does not constrain the bit ordering, either. I can't 
imagine any reason why anyone would do so, but it wouldn't violate any 
requirement imposed by the C standard to, for instance, store each of 
the four lowest-order bits in a different byte of a four-byte integer.

[toc] | [prev] | [next] | [standalone]

#123538

From	supercat@casperkitty.com
Date	2017-11-27 08:00 -0800
Message-ID	<a5efc5bb-bbf8-4571-8a85-62a9fb9e201c@googlegroups.com>
In reply to	#123535

On Monday, November 27, 2017 at 9:36:17 AM UTC-6, James R. Kuyper wrote:
Eight?  Really?  What devices used something other than 0123, 3210, or 2301?

> Note that not only does the standard not constrain the byte ordering in 
> any way, it also does not constrain the bit ordering, either. I can't 
> imagine any reason why anyone would do so, but it wouldn't violate any 
> requirement imposed by the C standard to, for instance, store each of 
> the four lowest-order bits in a different byte of a four-byte integer.

How much value is gained by having a Standard regard 2.6313e+35 arrangements
of bits in a 32-bit word as equally likely, when in practice 99.9% of
machines will use one of TWO, and 99.9999999%+ of the possible arrangements
will NEVER be used by ANY implementation anywhere?

I would favor having the Standard define a category of conformance that
allows for things like weird bit orderings, padding bits, byte sizes other
than eight, non-two's-complement representations, etc. but that doesn't
mean that it shouldn't recognize traits that are common to 99.9%+ of
implementations.

[toc] | [prev] | [next] | [standalone]

#123541

From	"Pascal J. Bourguignon" <pjb@informatimago.com>
Date	2017-11-27 23:49 +0100
Message-ID	<m27eubcp68.fsf@informatimago.com>
In reply to	#123535

"James R. Kuyper" <jameskuyper@verizon.net> writes:

> Note that not only does the standard not constrain the byte ordering
> in any way, it also does not constrain the bit ordering, either. I
> can't imagine any reason why anyone would do so, but it wouldn't
> violate any requirement imposed by the C standard to, for instance,
> store each of the four lowest-order bits in a different byte of a
> four-byte integer.

The notion of thing-ordering has a sense only if there's is a way to
_address_ things.

You cannot have bit-ordering if you can't have a pointer to a bit.

In the case of the C programming language, pointers may address at most
char (sizeof(char)=1).  Therefore the notion of bit ordering is
meaningless in C.

It is also meaningless in hardware, in the sense that bits are not
ordered logically (by an increasing address), but physically,
spacio-temporally.  You can have some bits laid out on the right or on
the left, perhaps in front or behind, or even above or below other bits,
some bits may be transmitted before or after some other bits when
serialized.  And this changes from chip to chip, from bus to bus.  It
doesn't matter, as long as the hardware is consistent (you read back the
same bit you wrote, in the same order, when you read or write a byte).

-- 
__Pascal J. Bourguignon
http://www.informatimago.com

[toc] | [prev] | [next] | [standalone]

#123542

From	jameskuyper@verizon.net
Date	2017-11-27 15:56 -0800
Message-ID	<8e037c7d-9582-4915-aa8c-3802dab7b1b5@googlegroups.com>
In reply to	#123541

On 11/27/2017 05:49 PM, Pascal J. Bourguignon wrote:
> "James R. Kuyper" <jameskuyper@verizon.net> writes:
> 
>> Note that not only does the standard not constrain the byte ordering
>> in any way, it also does not constrain the bit ordering, either. I
>> can't imagine any reason why anyone would do so, but it wouldn't
>> violate any requirement imposed by the C standard to, for instance,
>> store each of the four lowest-order bits in a different byte of a
>> four-byte integer.
> 
> The notion of thing-ordering has a sense only if there's is a way to
> _address_ things.

You're right - you can't determine absolute bit orderings, since there's no way
to address individual bits. However, in any context which allows you to access
memory using two different types, you can determine the relative bit order of
those types. Unions provide one such context, and any object can be accessed
either as an array of unsigned char or by using it's effective type.

Unless those two types are a signed type and the corresponding unsigned type,
the standard imposes no requirements that the order be consistent between them.

Consider the following code:

   for(uint32_t i=1; i; i<<=1)
   {
       printf("%" PRIx32 "\t", i);
       for(unsigned char *p=(unsigned char*)&i;
           p < (unsigned char*)(1+&i); p++)
           printf("%02x ", *p);
       printf("\n");
   }

If int32_t is a supported type, then CHAR_BIT must be 8, 16, or 32, and that
code should print exactly 32 lines.

From what the C standard does tell us about the representation of integer types,
including the fact that int32_t is not allowed to have any padding bits, it's
possible to deduce that every single line should show only one byte with a non-zero value, and that value should be a power of 2 that is <= UCHAR_MAX. For any
permitted value of CHAR_BIT, there are 32 possible sets of byte values meeting that requirement, and all of those sets will occur exactly once in the output.

People who pay too much attention to what's actually being done by real world
systems, and not enough attention to what the standard actually permits, would
claim that all of the lines where the non-zero bit is in a given byte must
appear consecutively, and in increasing order of the value of that bit,
interpreted as an unsigned char. And if they tested with virtually any real world implementation of C, they would get results consistent with that claim.

However, the standard imposes no such requirement. There are 32! different
possible orders in which those 32 distinct sets of byte values can occur, and
not one of those orders would violate any requirement imposed by the C standard.

[toc] | [prev] | [next] | [standalone]

#124368

From	herrmannsfeldt@gmail.com
Date	2017-12-15 12:45 -0800
Message-ID	<bcb652f4-7e65-4476-bb27-2bf1748a7c58@googlegroups.com>
In reply to	#123530

On Monday, November 27, 2017 at 12:26:18 AM UTC-8, Pascal J. Bourguignon wrote:
> Spiros Bousbouras <spi...@gmail.com> writes:

> >> >   3, 2, 1, 0, 24, 25, 26  (little-endian machine)
> >> >   0, 1, 2, 3, 24, 25, 26  (big-endian machine)
> >
> > Do you know that the only possibilities are little-endian and big-endian ?

> You mean that little-endian and big-endian are NOT the only
> possibilities?

> In practice there are (old) processors where you'd have:

>       1, 0, 3, 2, 24, 25, 26

VAX uses middle endian for floating point, but little endian
for integers. There are additional problems with sorting floating
point than bit (and byte) order.

The PDP-10 uses two's complement for floating point, such that
you can (and do) compare floating point values with fixed point
compare instructions.

Note that even for twos complement fixed point, you have to
treat the byte with sign different from the other bytes,
when comparing a byte at a time.

[toc] | [prev] | [next] | [standalone]

#123534

From	Ben Bacarisse <ben.usenet@bsb.me.uk>
Date	2017-11-27 15:08 +0000
Message-ID	<871skj3gje.fsf@bsb.me.uk>
In reply to	#123528

Spiros Bousbouras <spibou@gmail.com> writes:

> On Sun, 26 Nov 2017 23:55:57 +0000
> Ben Bacarisse <ben.usenet@bsb.me.uk> wrote:
>> James Harris <james.harris.1@gmail.com> writes:
>> <snip>
>> > Imagine a function to carry out a radix sort. A programmer could write
>> > one such function for each type of record to be sorted. The solution I
>> > have now effectively does that: it only sorts integers and was
>> > intended as a proof of concept.
>> >
>> > But it would, of course, be more flexible to have one routine which
>> > was able to sort arbitrary records. Then it could be called to sort
>> > different types of data. That, though, would require a way to specify
>> > the parts of the record which have to be used as sort keys. In turn,
>> > that raises the question of how to tell the sort function about the
>> > structure of the keys. E.g. imagine records like the following.
>> >
>> >   struct rec {
>> >     uint32_t key_part_1;
>> >     char[20] name;
>> >     char[3] key_part_2;
>> >   };
>> >
>> > Assuming the radix sort works one byte at a time, to sort records of
>> > struct rec type the sort function would have to be told that the key
>> > was made up of bytes 0 to 3 and 24 to 26. And, what's more, the order
>> > of the bytes of key_part_1 would depend on the architecture. So a list
>> > of bytes could be supplied to an MSD radix sort as either of these
>> > two.
>> >
>> >   3, 2, 1, 0, 24, 25, 26  (little-endian machine)
>> >   0, 1, 2, 3, 24, 25, 26  (big-endian machine)
>
> Do you know that the only possibilities are little-endian and
> big-endian ?

I do.  The OP probably does too.  In fact the idea of listing byut
offsets might be because of such possibilities.

>> > Given this scenario, the first part of the original query was about a
>> > good way to tell the sort routine such a list of byte positions, and
>> > to do so portably.
>> 
>> The last bit is confusing me in that I would have to work hard to come
>> up with a non-portable way to tell the sort routine the list of byte
>> positions.  That makes me think you mean something else.  Do you want a
>> portable way to *determine* the list of positions in some portable and
>> quasi-automatic way?  If so, how automatic do you want it to be?
>> 
>> > In summary, if calling a general purpose MSD radix sort which sorted
>> > byte-by-byte (i.e. where each 'digit' was a byte) what would be a good
>> > and portable way to specify the offsets of the bytes of the sort key
>> > fields in descending order of significance?
>> 
>> An array of unsigned int?
>
> size_t  seems to me like a better choice of type than  unsigned int .

I was not certain.  It seems unlikely that you need to index bytes that
far into individual records, but the exact type is not really the point.

-- 
Ben.

[toc] | [prev] | [next] | [standalone]

#123533

From	James Harris <james.harris.1@gmail.com>
Date	2017-11-27 14:55 +0000
Message-ID	<ovh906$kj4$1@dont-email.me>
In reply to	#123526

On 26/11/2017 23:55, Ben Bacarisse wrote:
> James Harris <james.harris.1@gmail.com> writes:
> <snip>
>> Imagine a function to carry out a radix sort. A programmer could write
>> one such function for each type of record to be sorted. The solution I
>> have now effectively does that: it only sorts integers and was
>> intended as a proof of concept.
>>
>> But it would, of course, be more flexible to have one routine which
>> was able to sort arbitrary records. Then it could be called to sort
>> different types of data. That, though, would require a way to specify
>> the parts of the record which have to be used as sort keys. In turn,
>> that raises the question of how to tell the sort function about the
>> structure of the keys. E.g. imagine records like the following.
>>
>>    struct rec {
>>      uint32_t key_part_1;
>>      char[20] name;
>>      char[3] key_part_2;
>>    };
>>
>> Assuming the radix sort works one byte at a time, to sort records of
>> struct rec type the sort function would have to be told that the key
>> was made up of bytes 0 to 3 and 24 to 26. And, what's more, the order
>> of the bytes of key_part_1 would depend on the architecture. So a list
>> of bytes could be supplied to an MSD radix sort as either of these
>> two.
>>
>>    3, 2, 1, 0, 24, 25, 26  (little-endian machine)
>>    0, 1, 2, 3, 24, 25, 26  (big-endian machine)
>>
>> Given this scenario, the first part of the original query was about a
>> good way to tell the sort routine such a list of byte positions, and
>> to do so portably.
> 
> The last bit is confusing me in that I would have to work hard to come
> up with a non-portable way to tell the sort routine the list of byte
> positions.  That makes me think you mean something else.  Do you want a
> portable way to *determine* the list of positions in some portable and
> quasi-automatic way?  If so, how automatic do you want it to be?

When I speak of "portable" I am thinking that the source would be the 
same. The offsets would have to be determined at compile time (or 
later). It's easy to do this non-portably. For example, a non-portable 
way to carry out the sort on 32-bit integers might define an array of 
offsets such as

   int i32_offsets[] = {3, 2, 1, 0, -1};

The -1 indicates the end of the list. In any call to the sort function 
the offsets would be given, as in

   radm_sort(values_array, <other parameters>, i32_offsets, 0);

But in a number of respects this would not be portable. E.g. it would 
not work on a big-endian machine.

> 
>> In summary, if calling a general purpose MSD radix sort which sorted
>> byte-by-byte (i.e. where each 'digit' was a byte) what would be a good
>> and portable way to specify the offsets of the bytes of the sort key
>> fields in descending order of significance?
> 
> An array of unsigned int?  The fact that the answer seems so clear to me
> again makes me think I'm missing something.

Does it make more sense now?


-- 
James Harris

[toc] | [prev] | [next] | [standalone]

#123537

From	Ben Bacarisse <ben.usenet@bsb.me.uk>
Date	2017-11-27 15:51 +0000
Message-ID	<87d143k9e1.fsf@bsb.me.uk>
In reply to	#123533

James Harris <james.harris.1@gmail.com> writes:

> On 26/11/2017 23:55, Ben Bacarisse wrote:
>> James Harris <james.harris.1@gmail.com> writes:
>> <snip>
>>> Imagine a function to carry out a radix sort. A programmer could write
>>> one such function for each type of record to be sorted. The solution I
>>> have now effectively does that: it only sorts integers and was
>>> intended as a proof of concept.
>>>
>>> But it would, of course, be more flexible to have one routine which
>>> was able to sort arbitrary records. Then it could be called to sort
>>> different types of data. That, though, would require a way to specify
>>> the parts of the record which have to be used as sort keys. In turn,
>>> that raises the question of how to tell the sort function about the
>>> structure of the keys. E.g. imagine records like the following.
>>>
>>>    struct rec {
>>>      uint32_t key_part_1;
>>>      char[20] name;
>>>      char[3] key_part_2;
>>>    };
>>>
>>> Assuming the radix sort works one byte at a time, to sort records of
>>> struct rec type the sort function would have to be told that the key
>>> was made up of bytes 0 to 3 and 24 to 26. And, what's more, the order
>>> of the bytes of key_part_1 would depend on the architecture. So a list
>>> of bytes could be supplied to an MSD radix sort as either of these
>>> two.
>>>
>>>    3, 2, 1, 0, 24, 25, 26  (little-endian machine)
>>>    0, 1, 2, 3, 24, 25, 26  (big-endian machine)
>>>
>>> Given this scenario, the first part of the original query was about a
>>> good way to tell the sort routine such a list of byte positions, and
>>> to do so portably.
>>
>> The last bit is confusing me in that I would have to work hard to come
>> up with a non-portable way to tell the sort routine the list of byte
>> positions.  That makes me think you mean something else.  Do you want a
>> portable way to *determine* the list of positions in some portable and
>> quasi-automatic way?  If so, how automatic do you want it to be?
>
> When I speak of "portable" I am thinking that the source would be the
> same. The offsets would have to be determined at compile time (or
> later). It's easy to do this non-portably. For example, a non-portable
> way to carry out the sort on 32-bit integers might define an array of
> offsets such as
>
>   int i32_offsets[] = {3, 2, 1, 0, -1};
>
> The -1 indicates the end of the list.

Ah, so the problem is generating the list at compile time?  That's
tricky without a build-phase -- i.e. a program that runs "at compile
time".

With running code there are lots of options.  The end result could be to
write a series of macros to a file that get #included into the source:

  #define UINT64_BNUMS 7, 6, 5, 4, 3, 2, 1, 0
  #define UINT32_BNUMS 3, 2, 1, 0
  #define UINT16_BNUMS 1, 0

so you can write

  int i32_offsets[] = { UINT32_BNUMS, -1 };

To handle integer types deeper in the struct you'd need a macro that can
add an offset to a list.  I think that can be done with macro magic.
I'll have go if you think this is the way you will be going.

The other issue is padding between struct members but that can be done
with the standard offsetof macro.

<snip>
> Does it make more sense now?

You'll have to tell me if I've got what you mean!

-- 
Ben.

[toc] | [prev] | [next] | [standalone]

#123576

From	James Harris <james.harris.1@gmail.com>
Date	2017-11-28 17:13 +0000
Message-ID	<ovk5ev$ee3$1@dont-email.me>
In reply to	#123537

On 27/11/2017 15:51, Ben Bacarisse wrote:
> James Harris <james.harris.1@gmail.com> writes:
> 
>> On 26/11/2017 23:55, Ben Bacarisse wrote:
>>> James Harris <james.harris.1@gmail.com> writes:
>>> <snip>
>>>> Imagine a function to carry out a radix sort. A programmer could write
>>>> one such function for each type of record to be sorted. The solution I
>>>> have now effectively does that: it only sorts integers and was
>>>> intended as a proof of concept.
>>>>
>>>> But it would, of course, be more flexible to have one routine which
>>>> was able to sort arbitrary records. Then it could be called to sort
>>>> different types of data. That, though, would require a way to specify
>>>> the parts of the record which have to be used as sort keys. In turn,
>>>> that raises the question of how to tell the sort function about the
>>>> structure of the keys. E.g. imagine records like the following.
>>>>
>>>>     struct rec {
>>>>       uint32_t key_part_1;
>>>>       char[20] name;
>>>>       char[3] key_part_2;
>>>>     };
>>>>
>>>> Assuming the radix sort works one byte at a time, to sort records of
>>>> struct rec type the sort function would have to be told that the key
>>>> was made up of bytes 0 to 3 and 24 to 26. And, what's more, the order
>>>> of the bytes of key_part_1 would depend on the architecture. So a list
>>>> of bytes could be supplied to an MSD radix sort as either of these
>>>> two.
>>>>
>>>>     3, 2, 1, 0, 24, 25, 26  (little-endian machine)
>>>>     0, 1, 2, 3, 24, 25, 26  (big-endian machine)
>>>>
>>>> Given this scenario, the first part of the original query was about a
>>>> good way to tell the sort routine such a list of byte positions, and
>>>> to do so portably.
>>>
>>> The last bit is confusing me in that I would have to work hard to come
>>> up with a non-portable way to tell the sort routine the list of byte
>>> positions.  That makes me think you mean something else.  Do you want a
>>> portable way to *determine* the list of positions in some portable and
>>> quasi-automatic way?  If so, how automatic do you want it to be?
>>
>> When I speak of "portable" I am thinking that the source would be the
>> same. The offsets would have to be determined at compile time (or
>> later). It's easy to do this non-portably. For example, a non-portable
>> way to carry out the sort on 32-bit integers might define an array of
>> offsets such as
>>
>>    int i32_offsets[] = {3, 2, 1, 0, -1};
>>
>> The -1 indicates the end of the list.
> 
> Ah, so the problem is generating the list at compile time?  That's
> tricky without a build-phase -- i.e. a program that runs "at compile
> time".

Compile time would be ideal, if feasible.

> 
> With running code there are lots of options.  The end result could be to
> write a series of macros to a file that get #included into the source:
> 
>    #define UINT64_BNUMS 7, 6, 5, 4, 3, 2, 1, 0
>    #define UINT32_BNUMS 3, 2, 1, 0
>    #define UINT16_BNUMS 1, 0
> 
> so you can write
> 
>    int i32_offsets[] = { UINT32_BNUMS, -1 };
> 
> To handle integer types deeper in the struct you'd need a macro that can
> add an offset to a list.  I think that can be done with macro magic.
> I'll have go if you think this is the way you will be going.
> 
> The other issue is padding between struct members but that can be done
> with the standard offsetof macro.
> 
> <snip>
>> Does it make more sense now?
> 
> You'll have to tell me if I've got what you mean!

I think so. And I see that picking code in a #if block can be done if 
suitable tests can be devised. But I am not sure that such an approach 
would scale well and I wonder if there would be a more flexible way. To 
illustrate, consider

   unsigned U_ORDER = 0x00010203;
   unsigned char *p = (unsigned char *) &U_ORDER;

The idea would be that p[0] through p[3] would give the offsets of the 
bytes of an unsigned int in descending order of significance, and that 
it would work irrespective of endianness - big, little or anything else. 
But it has problems: it assumes sizeof(unsigned) is 4 and that CHAR_BIT 
is 8. Hence it is not portable.

Signed ints would be harder. At least the byte with the sign would need 
to be treated differently. That could be done with a simple mapping 
function (possibly using xor or subtract operations) which would be 
fine. But the code would need to adapt to how signed ints were laid out 
in memory on different machines. And, again, that seems to be hard to do 
portably.

I guess I am looking for a portable way for C datatypes to be treated as 
a sequence of bytes, and hoping there would be some language support 
(i.e. set of definitions in the C standards) for doing so. Although my 
sort routine is not a radix sort it faces the same problems of 
decomposing an arbitrary key into manageable chunks so I was thinking 
that this kind of problem might have already been solved.

AIUI any C object apart from a bit field can be viewed as a sequence of 
bytes but that only gets me part way to a portable approach to knowing 
what is in each byte.

It seems like a simple requirement, but implementing it portably is 
where it becomes difficult.

-- 
James Harris

[toc] | [prev] | [next] | [standalone]

#123593

From	Ben Bacarisse <ben.usenet@bsb.me.uk>
Date	2017-11-29 01:07 +0000
Message-ID	<87efohsxin.fsf@bsb.me.uk>
In reply to	#123576

James Harris <james.harris.1@gmail.com> writes:

> On 27/11/2017 15:51, Ben Bacarisse wrote:
>> James Harris <james.harris.1@gmail.com> writes:
<snip>
>>> When I speak of "portable" I am thinking that the source would be the
>>> same. The offsets would have to be determined at compile time (or
>>> later). It's easy to do this non-portably. For example, a non-portable
>>> way to carry out the sort on 32-bit integers might define an array of
>>> offsets such as
>>>
>>>    int i32_offsets[] = {3, 2, 1, 0, -1};
>>>
>>> The -1 indicates the end of the list.
>>
>> Ah, so the problem is generating the list at compile time?  That's
>> tricky without a build-phase -- i.e. a program that runs "at compile
>> time".
>
> Compile time would be ideal, if feasible.

I don't think you can do it "at compile time" rather than "at
configuration time".

<snip>
>> You'll have to tell me if I've got what you mean!
>
> I think so. And I see that picking code in a #if block can be done if
> suitable tests can be devised.

That's was not what I was suggesting.  But I'm not sure what I was
suggesting will be what you want anyway.

> But I am not sure that such an approach
> would scale well and I wonder if there would be a more flexible
> way. To illustrate, consider
>
>   unsigned U_ORDER = 0x00010203;
>   unsigned char *p = (unsigned char *) &U_ORDER;
>
> The idea would be that p[0] through p[3] would give the offsets of the
> bytes of an unsigned int in descending order of significance, and that
> it would work irrespective of endianness - big, little or anything
> else. But it has problems: it assumes sizeof(unsigned) is 4 and that
> CHAR_BIT is 8. Hence it is not portable.

This is run-time code, yes?  If you are prepared to do it at run-time
then you don't need to assume the size of anything.  You can start with

  unsigned nb = 0;
  unsigned U_ORDER = 0;

and loop setting

  (U_ORDER << CHAR_BIT) | nb++

sizeof (unsigned) times.  And everywhere you are tempted to use p[0] to
p[3] use p[0] to p[sizeof U_ORDER - 1] instead.

<snip>
-- 
Ben.

[toc] | [prev] | [next] | [standalone]

#123614

From	Spiros Bousbouras <spibou@gmail.com>
Date	2017-11-29 18:59 +0000
Message-ID	<RrDMy1cnJtEcOsyJAqgdzUr1TUNmG@bongo-ra.co>
In reply to	#123593

On Wed, 29 Nov 2017 01:07:28 +0000
Ben Bacarisse <ben.usenet@bsb.me.uk> wrote:
>   unsigned nb = 0;
>   unsigned U_ORDER = 0;
> 
> and loop setting
> 
>   (U_ORDER << CHAR_BIT) | nb++

If  sizeof(unsigned) == 1    then   U_ORDER << CHAR_BIT
is undefined behaviour.

> sizeof (unsigned) times.  And everywhere you are tempted to use p[0] to
> p[3] use p[0] to p[sizeof U_ORDER - 1] instead.

[toc] | [prev] | [next] | [standalone]

#123629

From	Ben Bacarisse <ben.usenet@bsb.me.uk>
Date	2017-11-30 00:53 +0000
Message-ID	<87bmjklh7u.fsf@bsb.me.uk>
In reply to	#123614

Spiros Bousbouras <spibou@gmail.com> writes:

> On Wed, 29 Nov 2017 01:07:28 +0000
> Ben Bacarisse <ben.usenet@bsb.me.uk> wrote:
>>   unsigned nb = 0;
>>   unsigned U_ORDER = 0;
>> 
>> and loop setting
>> 
>>   (U_ORDER << CHAR_BIT) | nb++
>
> If  sizeof(unsigned) == 1    then   U_ORDER << CHAR_BIT
> is undefined behaviour.

Why?

But that's a side issue.  The key thing is "loop setting" -- i.e. rather
than a fixed assignment the code should loop setting the byte numbers.
When sizeof U_ORDER == 1, you won't loop.  There is only one byte, it
gets given number 0 and no others need to be set so the code in question
is not run when the size is 1.

>> sizeof (unsigned) times.  And everywhere you are tempted to use p[0] to
>> p[3] use p[0] to p[sizeof U_ORDER - 1] instead.

As here -- everything is to be done from i = 0 to i < sizeof U_ORDER.

-- 
Ben.

[toc] | [prev] | [next] | [standalone]

#123630

From	Keith Thompson <kst-u@mib.org>
Date	2017-11-29 17:01 -0800
Message-ID	<lnindsy3y6.fsf@kst-u.example.com>
In reply to	#123629

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
> Spiros Bousbouras <spibou@gmail.com> writes:
>> On Wed, 29 Nov 2017 01:07:28 +0000
>> Ben Bacarisse <ben.usenet@bsb.me.uk> wrote:
>>>   unsigned nb = 0;
>>>   unsigned U_ORDER = 0;
>>> 
>>> and loop setting
>>> 
>>>   (U_ORDER << CHAR_BIT) | nb++
>>
>> If  sizeof(unsigned) == 1    then   U_ORDER << CHAR_BIT
>> is undefined behaviour.
>
> Why?

Because CHAR_BIT (which would have to be at least 16) would be equal to
the width of the left operand.

N1570 6.5.7p3:
    The integer promotions are performed on each of the operands. The
    type of the result is that of the promoted left operand. If the
    value of the right operand is negative or is greater than or
    equal to the width of the promoted left operand, the behavior
    is undefined

> But that's a side issue.
[snip]

-- 
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"

[toc] | [prev] | [next] | [standalone]

Page 1 of 3 [1] 2 3 Next page →

csiph-web

Recasting data as long ints and chars

Contents

#123009 — Recasting data as long ints and chars

#123348

#123521

#123526

#123528

#123530

#123531

#123535

#123538

#123541

#123542

#124368

#123534

#123533

#123537

#123576

#123593

#123614

#123629

#123630