Groups > comp.lang.python > #22266 > unrolled thread

"convert" string to bytes without changing data (encoding)

Started by	Peter Daum <gator@cs.tu-berlin.de>
First post	2012-03-28 10:56 +0200
Last post	2012-03-28 13:16 -0400
Articles	17 on this page of 57 — 22 participants

Back to article view | Back to comp.lang.python

  "convert" string to bytes without changing data (encoding) Peter Daum <gator@cs.tu-berlin.de> - 2012-03-28 10:56 +0200
    Re: "convert" string to bytes without changing data (encoding) Chris Angelico <rosuav@gmail.com> - 2012-03-28 20:02 +1100
      Re: "convert" string to bytes without changing data (encoding) Peter Daum <gator@cs.tu-berlin.de> - 2012-03-28 11:43 +0200
        Re: "convert" string to bytes without changing data (encoding) Heiko Wundram <modelnine@modelnine.org> - 2012-03-28 12:42 +0200
          Re: "convert" string to bytes without changing data (encoding) Peter Daum <gator@cs.tu-berlin.de> - 2012-03-28 19:43 +0200
            Re: "convert" string to bytes without changing data (encoding) Heiko Wundram <modelnine@modelnine.org> - 2012-03-28 20:13 +0200
            Re: "convert" string to bytes without changing data (encoding) Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2012-03-28 21:13 +0300
              RE: "convert" string to bytes without changing data (encoding) "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-03-28 18:31 +0000
              Re: "convert" string to bytes without changing data (encoding) Ethan Furman <ethan@stoneleaf.us> - 2012-03-28 11:49 -0700
            RE: "convert" string to bytes without changing data (encoding) "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-03-28 18:20 +0000
            Re: "convert" string to bytes without changing data (encoding) Ian Kelly <ian.g.kelly@gmail.com> - 2012-03-28 12:20 -0600
            Re: "convert" string to bytes without changing data (encoding) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-28 18:26 +0000
              Re: "convert" string to bytes without changing data (encoding) Grant Edwards <invalid@invalid.invalid> - 2012-03-28 19:40 +0000
            Re: "convert" string to bytes without changing data (encoding) Ethan Furman <ethan@stoneleaf.us> - 2012-03-28 11:17 -0700
            Re: "convert" string to bytes without changing data (encoding) John Nagle <nagle@animats.com> - 2012-03-28 12:30 -0700
            Re: "convert" string to bytes without changing data (encoding) Terry Reedy <tjreedy@udel.edu> - 2012-03-28 17:37 -0400
              Re: "convert" string to bytes without changing data (encoding) Peter Daum <gator@cs.tu-berlin.de> - 2012-03-29 16:57 +0200
              Re: "convert" string to bytes without changing data (encoding) Peter Daum <gator@cs.tu-berlin.de> - 2012-03-29 16:57 +0200
            Re: "convert" string to bytes without changing data (encoding) Serhiy Storchaka <storchaka@gmail.com> - 2012-03-30 22:06 +0300
            Re: "convert" string to bytes without changing data (encoding) Chris Angelico <rosuav@gmail.com> - 2012-03-31 06:10 +1100
        Re: "convert" string to bytes without changing data (encoding) Stefan Behnel <stefan_ml@behnel.de> - 2012-03-28 13:25 +0200
        Re: "convert" string to bytes without changing data (encoding) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-28 18:12 +0000
      Re: "convert" string to bytes without changing data (encoding) Ross Ridge <rridge@csclub.uwaterloo.ca> - 2012-03-28 11:36 -0400
        Re: "convert" string to bytes without changing data (encoding) Chris Angelico <rosuav@gmail.com> - 2012-03-29 03:18 +1100
          Re: "convert" string to bytes without changing data (encoding) Grant Edwards <invalid@invalid.invalid> - 2012-03-28 16:33 +0000
          Re: "convert" string to bytes without changing data (encoding) Ross Ridge <rridge@csclub.uwaterloo.ca> - 2012-03-28 14:05 -0400
            Re: "convert" string to bytes without changing data (encoding) Tim Chase <python.list@tim.thechases.com> - 2012-03-28 13:49 -0500
              Re: "convert" string to bytes without changing data (encoding) Ross Ridge <rridge@csclub.uwaterloo.ca> - 2012-03-28 15:10 -0400
            Re: "convert" string to bytes without changing data (encoding) "Albert W. Hopkins" <marduk@letterboxes.org> - 2012-03-28 15:22 -0400
        Re: "convert" string to bytes without changing data (encoding) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-28 17:54 +0000
          Re: "convert" string to bytes without changing data (encoding) Ross Ridge <rridge@csclub.uwaterloo.ca> - 2012-03-28 14:22 -0400
            Re: Re: "convert" string to bytes without changing data (encoding) Evan Driscoll <driscoll@cs.wisc.edu> - 2012-03-28 14:20 -0500
              Re: Re: "convert" string to bytes without changing data (encoding) Ross Ridge <rridge@csclub.uwaterloo.ca> - 2012-03-28 15:43 -0400
                Re: "convert" string to bytes without changing data (encoding) Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-03-28 21:44 +0100
                Re: "convert" string to bytes without changing data (encoding) Neil Cerutti <neilc@norwich.edu> - 2012-03-28 20:56 +0000
                Re: "convert" string to bytes without changing data (encoding) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-29 00:02 +0000
                Re: Re: Re: "convert" string to bytes without changing data (encoding) Evan Driscoll <driscoll@cs.wisc.edu> - 2012-03-28 19:11 -0500
                  Re: Re: Re: "convert" string to bytes without changing data (encoding) Ross Ridge <rridge@csclub.uwaterloo.ca> - 2012-03-28 23:04 -0400
                    Re: Re: Re: "convert" string to bytes without changing data (encoding) Chris Angelico <rosuav@gmail.com> - 2012-03-29 14:31 +1100
                      Re: Re: Re: "convert" string to bytes without changing data (encoding) Ross Ridge <rridge@csclub.uwaterloo.ca> - 2012-03-28 23:58 -0400
                        Re: "convert" string to bytes without changing data (encoding) Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-03-29 07:01 +0100
                        Re: "convert" string to bytes without changing data (encoding) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-29 06:51 +0000
                          Re: "convert" string to bytes without changing data (encoding) Ross Ridge <rridge@csclub.uwaterloo.ca> - 2012-03-29 11:30 -0400
                            Re: "convert" string to bytes without changing data (encoding) Terry Reedy <tjreedy@udel.edu> - 2012-03-29 12:49 -0400
                              Re: "convert" string to bytes without changing data (encoding) Ross Ridge <rridge@csclub.uwaterloo.ca> - 2012-03-29 14:00 -0400
                                Re: "convert" string to bytes without changing data (encoding) Chris Angelico <rosuav@gmail.com> - 2012-03-30 07:41 +1100
                            Re: "convert" string to bytes without changing data (encoding) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-30 01:16 +0000
                    Re: Re: Re: Re: "convert" string to bytes without changing data (encoding) Evan Driscoll <driscoll@cs.wisc.edu> - 2012-03-29 11:31 -0500
            RE: "convert" string to bytes without changing data (encoding) "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-03-28 19:02 +0000
              Re: "convert" string to bytes without changing data (encoding) Grant Edwards <invalid@invalid.invalid> - 2012-03-28 19:44 +0000
            Re: "convert" string to bytes without changing data (encoding) MRAB <python@mrabarnett.plus.com> - 2012-03-28 20:50 +0100
            RE: "convert" string to bytes without changing data (encoding) "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-03-29 17:36 +0000
              Re: "convert" string to bytes without changing data (encoding) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-30 01:10 +0000
                Re: "convert" string to bytes without changing data (encoding) Michael Ströder <michael@stroeder.com> - 2012-03-30 09:04 +0200
        Re: "convert" string to bytes without changing data (encoding) Terry Reedy <tjreedy@udel.edu> - 2012-03-28 14:11 -0400
    Re: "convert" string to bytes without changing data (encoding) Stefan Behnel <stefan_ml@behnel.de> - 2012-03-28 11:08 +0200
    Re: "convert" string to bytes without changing data (encoding) Dave Angel <d@davea.name> - 2012-03-28 13:16 -0400

Page 3 of 3 — ← Prev page 1 2 [3]

#22326

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2012-03-29 07:01 +0100
Message-ID	<mailman.1109.1333000850.3037.python-list@python.org>
In reply to	#22325

On 29/03/2012 04:58, Ross Ridge wrote:
> Chris Angelico<rosuav@gmail.com>  wrote:
>> Actually, he is justified. It's one thing to work in C or assembly and
>> write code that depends on certain bit-pattern representations of data
>> (although even that causes trouble - assuming that
>> sizeof(int)=3D=3Dsizeof(int*) isn't good for portability), but in a high
>> level language, you cannot assume any correlation between objects and
>> bytes. Any code that depends on implementation details is risky.
>
> How does that in anyway justify Evan Driscoll maliciously lying about
> code he's never seen?
>
> 					Ross Ridge
>

We appear to have a case of "would you stand up please, your voice is 
rather muffled".  I can hear all the *plonks* from miles away.

-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]

#22328

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-03-29 06:51 +0000
Message-ID	<4f740687$0$29884$c3e8da3$5496439d@news.astraweb.com>
In reply to	#22325

On Wed, 28 Mar 2012 23:58:53 -0400, Ross Ridge wrote:

> How does that in anyway justify Evan Driscoll maliciously lying about
> code he's never seen?

You are perfectly justified to complain about Evan making sweeping 
generalisations about your code when he has not seen it; you are NOT 
justified in making your own sweeping generalisations that he is not just 
lying but *maliciously* lying. He might be just confused by the strength 
of his emotions and so making an honest mistake. Or he might have guessed 
perfectly accurately about your code, and you are the one being 
dishonest. Who knows?

Evan's impassioned rant is based on his estimate of your mindset, namely 
that you are the sort of developer who writes code making assumptions 
about implementation details even when explicitly told not to by the 
library authors. I have no idea whether Evan's estimate is right or not, 
but I don't think it is justified based on the little amount we've seen 
of you.

Your reaction is to make an equally unjustified estimate of Evan's 
mindset, namely that he is not just wrong about you, but *deliberately 
and maliciously* lying about you in the full knowledge that he is wrong. 
If anything, I would say that you have less justification for calling 
Evan a malicious liar than he has for calling you the sort of person who 
would write to an implementation instead of an interface.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#22345

From	Ross Ridge <rridge@csclub.uwaterloo.ca>
Date	2012-03-29 11:30 -0400
Message-ID	<jl1v6b$67a$1@rumours.uwaterloo.ca>
In reply to	#22328

Steven D'Aprano  <steve+comp.lang.python@pearwood.info> wrote:
>Your reaction is to make an equally unjustified estimate of Evan's 
>mindset, namely that he is not just wrong about you, but *deliberately 
>and maliciously* lying about you in the full knowledge that he is wrong. 

No, Evan in his own words admitted that his post was ment to be harsh,
"a bit harsher than it deserves", showing his malicious intent.  He made
accusations that where neither supported by anything I've said in this
thread nor by the code I actually write.  His accusation about me were
completely made up, he was not telling the truth and had no reasonable
basis to beleive he was telling the truth.  He was malicously lying and
I'm completely justified in saying so.

Just to make it clear to all you zealots.  I've not once advocated writing
any sort "risky code" in this thread.  I have not once advocated writing
any style of code in thread.  Just because I refuse to drink the "it's
impossible to represent strings as a series of bytes" kool-aid does't mean
that I'm a heretic that must oppose against everything you believe in.

					Ross Ridge

-- 
 l/  //	  Ross Ridge -- The Great HTMU
[oo][oo]  rridge@csclub.uwaterloo.ca
-()-/()/  http://www.csclub.uwaterloo.ca/~rridge/ 
 db  //

[toc] | [prev] | [next] | [standalone]

#22350

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-03-29 12:49 -0400
Message-ID	<mailman.1126.1333039783.3037.python-list@python.org>
In reply to	#22345

On 3/29/2012 11:30 AM, Ross Ridge wrote:

> No, Evan in his own words admitted that his post was ment to be harsh,

I agree that he should have restrained and censored his writing.

> Just because I refuse to drink the
 > "it's impossible to represent strings as a series of bytes" kool-aid

I do not believe *anyone* has made that claim. Is this meant to be a 
wild exaggeration? As wild as Evan's?

In my first post on this thread, I made three truthful claims.

1. A 3.x text string is logically a sequence of unicode 'characters' 
(codepoints).

2. The Python language definition does not require that a string be 
bytes or become bytes unless and until it is explicitly encoded.

3. The intentionally hidden byte implementation of strings on byte 
machines is version and system dependent. The bytes used for a 
particular character is (in 3.3) context dependent.

As it turns out, the OP had mistakenly assumed that the hidden byte 
implementation of 3.3 strings was both well-defined and something 
(utf-8) that it is not and (almost certainly) never will be. Guido and 
most other devs strongly want string indexing (and hence slice endpoint 
finding) to be O(1).

So all of the above is moot as far as the OP's problem is concerned. I 
already gave him the three standard solutions.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#22354

From	Ross Ridge <rridge@csclub.uwaterloo.ca>
Date	2012-03-29 14:00 -0400
Message-ID	<jl280a$pr5$1@rumours.uwaterloo.ca>
In reply to	#22350

Ross Ridge wrote:
> Just because I refuse to drink the
> "it's impossible to represent strings as a series of bytes" kool-aid

Terry Reedy  <tjreedy@udel.edu> wrote:
>I do not believe *anyone* has made that claim. Is this meant to be a 
>wild exaggeration? As wild as Evan's?

Sorry, it would've been more accurate to label the flavour of kool-aid
Chris Angelico was trying to push as "it's impossible ... without
encoding":

	What is a string? It's not a series of bytes. You can't convert
	it without encoding those characters into bytes in some way.

>In my first post on this thread, I made three truthful claims.

I'm not objecting to every post made in this thread.  If your post had
been made before the original poster had figured it out on his own,
I would've hoped he would have found it much more convincing than what
I quoted above.

					Ross Ridge

-- 
 l/  //	  Ross Ridge -- The Great HTMU
[oo][oo]  rridge@csclub.uwaterloo.ca
-()-/()/  http://www.csclub.uwaterloo.ca/~rridge/ 
 db  //

[toc] | [prev] | [next] | [standalone]

#22360

From	Chris Angelico <rosuav@gmail.com>
Date	2012-03-30 07:41 +1100
Message-ID	<mailman.1135.1333053694.3037.python-list@python.org>
In reply to	#22354

On Fri, Mar 30, 2012 at 5:00 AM, Ross Ridge <rridge@csclub.uwaterloo.ca> wrote:
> Sorry, it would've been more accurate to label the flavour of kool-aid
> Chris Angelico was trying to push as "it's impossible ... without
> encoding":
>
>        What is a string? It's not a series of bytes. You can't convert
>        it without encoding those characters into bytes in some way.

I still stand by that statement. Do you try to convert a "dictionary
of filename to open file object" into a "series of bytes" inside
Python? It doesn't matter that, on some level, it's *stored as* a
series of bytes; the actual object *is not* a series of bytes. There
is no logical equivalency, ergo it is illogical and nonsensical to
expect to turn one into the other without some form of encoding.
Python does include an encoding that can handle lists and
dictionaries. It's called Pickle, and it returns (in Python 3) a bytes
object - which IS a series of bytes. It doesn't simply return some
internal representation.

ChrisA

[toc] | [prev] | [next] | [standalone]

#22367

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-03-30 01:16 +0000
Message-ID	<4f750965$0$29981$c3e8da3$5496439d@news.astraweb.com>
In reply to	#22345

On Thu, 29 Mar 2012 11:30:19 -0400, Ross Ridge wrote:

> Steven D'Aprano  <steve+comp.lang.python@pearwood.info> wrote:
>>Your reaction is to make an equally unjustified estimate of Evan's
>>mindset, namely that he is not just wrong about you, but *deliberately
>>and maliciously* lying about you in the full knowledge that he is wrong.
> 
> No, Evan in his own words admitted that his post was ment to be harsh,
> "a bit harsher than it deserves", showing his malicious intent.

Being harsher than it deserves is not synonymous with malicious. You are 
making assumptions about Evan's mental state that are not supported by 
the evidence. Evan may believe that by "punishing" (for some feeble sense 
of punishment) you harshly, he is teaching you better behaviour that will 
be to your own benefit; or that it will act as a warning to others. 
Either way he may believe that he is actually doing good. 

And then he entirely undermined his own actions by admitting that he was 
over-reacting. This suggests that, in fact, he wasn't really motivated by 
either malice or beneficence but mere frustration.

It is quite clear that Evan let his passions about writing maintainable 
code get the best of him. His rant was more about "people like you" than 
you personally.

Evan, if you're reading this, I think you owe Ross an apology for flying 
off the handle. Ross, I think you owe Evan an apology for unjustified 
accusations of malice.

> He made
> accusations that where neither supported by anything I've said 

Now that is not actually true. Your posts have defended the idea that 
copying the raw internal byte representation of strings is a reasonable 
thing to do. You even claimed to know how to do so, for any version of 
Python (but so far have ignored my request for you to demonstrate).

> in this
> thread nor by the code I actually write.  His accusation about me were
> completely made up, he was not telling the truth and had no reasonable
> basis to beleive he was telling the truth.  He was malicously lying and
> I'm completely justified in saying so.

No, they were not completely made up. Your posts give many signs of being 
somebody who might very well write code to the implementation rather than 
the interface. Whether you are or not is a separate question, but your 
posts in this thread indicate that you very likely could be.

If this is not the impression you want to give, then you should 
reconsider your posting style.

Ross, to be frank, your posting style in this thread has been cowardly 
and pedantic, an obnoxious combination. Please take this as constructive 
criticism and not an attack -- you have alienated people in this thread, 
leading at least one person to publicly kill-file your future posts. I 
choose to assume you aren't aware of why that is than that you are doing 
so deliberately.

Without actually coming out and making a clear, explicit statement that 
you approve or disapprove of the OP's attempt to use implementation 
details, you *imply* support without explicitly giving it; you criticise 
others for saying it can't be done without demonstrating that it can be 
done. If this is a deliberate rhetorical trick, then shame on you for 
being a coward without the conviction to stand behind concrete 
expressions of your opinion. If not, then you should be aware that you 
are using a rhetorical style that will make many people predisposed to 
think you are a twat.

You *might* have said 

    Guys, you're technically wrong about this. This is how you can
    retrieve the internal representation of a string as a sequence
    of bytes: ...code... but you shouldn't use this in production 
    code because it is fragile and depends on implementation details 
    that may break in PyPy and Jython and IronPython.

But you didn't.

You *might* have said 

    Wrong, you can convert a string into a sequence of bytes without
    encoding or decoding: ...code... but don't do this.

But you didn't.

Instead you puffed yourself up as a big shot who was more technically 
correct than everyone else, but without *actually* demonstrating that you 
can do what you said you can do. You labelled as "bullshit" our attempts 
to discourage the OP from his misguided approached.

If your intention was to put people off-side, you succeeded very well. If 
not, you should be aware that you have, and consider how you might avoid 
this in the future.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#22348

From	Evan Driscoll <driscoll@cs.wisc.edu>
Date	2012-03-29 11:31 -0500
Message-ID	<mailman.1124.1333038695.3037.python-list@python.org>
In reply to	#22323

On 01/-10/-28163 01:59 PM, Ross Ridge wrote:
> Evan Driscoll<driscoll@cs.wisc.edu>  wrote:
>> People like you -- who write to assumptions which are not even remotely
>> guaranteed by the spec -- are part of the reason software sucks.
> ...
>> This email is a bit harsher than it deserves -- but I feel not by much.
>
> I don't see how you could feel the least bit justified.  Well meaning,
> if unhelpful, lies about the nature Python strings in order to try to
> convince someone to follow what you think are good programming practices
> is one thing.  Maliciously lying about someone else's code that you've
> never seen is another thing entirely.

I'm not even talking about code that you or the OP has written. I'm 
talking about your suggestion that

    I can in fact say what the internal byte string representation
    of strings is any given build of Python 3.

Aside from the questionable truth of this assertion (there's no 
guarantee that an implementation uses one consistent encoding or data 
structure representation consistently), that's of no consequence because 
you can't depend on what the representation is. So why even bring it up?

Also irrelevant is:

   In practice the number of ways that CPython (the only Python 3
   implementation) represents strings is much more limited.
   Pretending otherwise really isn't helpful.

If you can't depend on CPython's implementation (and, I would argue, 
your code is broken if you do), then it *is* helpful. Saying that "you 
can just look at what CPython does" is what is unhelpful.

That said, looking again I did misread your post that I sent that harsh 
reply to; I was looking at it perhaps a bit too much through the lens of 
the CPython comment I said above, and interpreting it as "I can say what 
the internal representation is of CPython, so just give me that" and 
launched into my spiel. If that's not what was intended, I retract my 
statement. As long as everyone is clear on the fact that Python 3 
implementations can use whatever encoding and data structures they want, 
perhaps even different encodings or data structures for equal strings, 
and that as a consequence saying "what's the internal representation of 
this string" is a meaningless question as far as Python itself is 
concerned, I'm happy.

Evan

[toc] | [prev] | [next] | [standalone]

#22306

From	"Prasad, Ramit" <ramit.prasad@jpmorgan.com>
Date	2012-03-28 19:02 +0000
Message-ID	<mailman.1094.1332962975.3037.python-list@python.org>
In reply to	#22299

> >The right way to convert bytes to strings, and vice versa, is via
> >encoding and decoding operations.
> 
> If you want to dictate to the original poster the correct way to do
> things then you don't need to do anything more that.  You don't need to
> pretend like Chris Angelico that there's isn't a direct mapping from
> the his Python 3 implementation's internal respresentation of strings
> to bytes in order to label what he's asking for as being "silly".

It might be technically possible to recreate internal implementation,
or get the byte data. That does not mean it will make any sense or
be understood in a meaningful manner. I think Ian summarized it
very well:

>You can't generally just "deal with the ascii portions" without
>knowing something about the encoding.  Say you encounter a byte
>greater than 127.  Is it a single non-ASCII character, or is it the
>leading byte of a multi-byte character?  If the next character is less
>than 127, is it an ASCII character, or a continuation of the previous
>character?  For UTF-8 you could safely assume ASCII, but without
>knowing the encoding, there is no way to be sure.  If you just assume
>it's ASCII and manipulate it as such, you could be messing up
>non-ASCII characters.

Technically, ASCII goes up to 256 but they are not A-z letters.

Ramit


Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology
712 Main Street | Houston, TX 77002
work phone: 713 - 216 - 5423

--


This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.

[toc] | [prev] | [next] | [standalone]

#22310

From	Grant Edwards <invalid@invalid.invalid>
Date	2012-03-28 19:44 +0000
Message-ID	<jkvpm2$9nk$2@reader1.panix.com>
In reply to	#22306

On 2012-03-28, Prasad, Ramit <ramit.prasad@jpmorgan.com> wrote:
> 
>>You can't generally just "deal with the ascii portions" without
>>knowing something about the encoding.  Say you encounter a byte
>>greater than 127.  Is it a single non-ASCII character, or is it the
>>leading byte of a multi-byte character?  If the next character is less
>>than 127, is it an ASCII character, or a continuation of the previous
>>character?  For UTF-8 you could safely assume ASCII, but without
>>knowing the encoding, there is no way to be sure.  If you just assume
>>it's ASCII and manipulate it as such, you could be messing up
>>non-ASCII characters.
> 
> Technically, ASCII goes up to 256

No, ASCII only defines 0-127.  Values >=128 are not ASCII.

From https://en.wikipedia.org/wiki/ASCII:

  ASCII includes definitions for 128 characters: 33 are non-printing
  control characters (now mostly obsolete) that affect how text and
  space is processed and 95 printable characters, including the space
  (which is considered an invisible graphic).

-- 
Grant Edwards               grant.b.edwards        Yow! Used staples are good
                                  at               with SOY SAUCE!
                              gmail.com

[toc] | [prev] | [next] | [standalone]

#22311

From	MRAB <python@mrabarnett.plus.com>
Date	2012-03-28 20:50 +0100
Message-ID	<mailman.1096.1332964201.3037.python-list@python.org>
In reply to	#22299

On 28/03/2012 20:02, Prasad, Ramit wrote:
>>  >The right way to convert bytes to strings, and vice versa, is via
>>  >encoding and decoding operations.
>>
>>  If you want to dictate to the original poster the correct way to do
>>  things then you don't need to do anything more that.  You don't need to
>>  pretend like Chris Angelico that there's isn't a direct mapping from
>>  the his Python 3 implementation's internal respresentation of strings
>>  to bytes in order to label what he's asking for as being "silly".
>
> It might be technically possible to recreate internal implementation,
> or get the byte data. That does not mean it will make any sense or
> be understood in a meaningful manner. I think Ian summarized it
> very well:
>
>>You can't generally just "deal with the ascii portions" without
>>knowing something about the encoding.  Say you encounter a byte
>>greater than 127.  Is it a single non-ASCII character, or is it the
>>leading byte of a multi-byte character?  If the next character is less
>>than 127, is it an ASCII character, or a continuation of the previous
>>character?  For UTF-8 you could safely assume ASCII, but without
>>knowing the encoding, there is no way to be sure.  If you just assume
>>it's ASCII and manipulate it as such, you could be messing up
>>non-ASCII characters.
>
> Technically, ASCII goes up to 256 but they are not A-z letters.
>
Technically, ASCII is 7-bit, so it goes up to 127.

[toc] | [prev] | [next] | [standalone]

#22352

From	"Prasad, Ramit" <ramit.prasad@jpmorgan.com>
Date	2012-03-29 17:36 +0000
Message-ID	<mailman.1128.1333042614.3037.python-list@python.org>
In reply to	#22299

> > Technically, ASCII goes up to 256 but they are not A-z letters.
> >
> Technically, ASCII is 7-bit, so it goes up to 127.

> No, ASCII only defines 0-127.  Values >=128 are not ASCII.
> 
> >From https://en.wikipedia.org/wiki/ASCII:
> 
>   ASCII includes definitions for 128 characters: 33 are non-printing
>   control characters (now mostly obsolete) that affect how text and
>   space is processed and 95 printable characters, including the space
>   (which is considered an invisible graphic).


Doh! I was mistaking extended ASCII for ASCII. Thanks for the
correction.

Ramit


Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology
712 Main Street | Houston, TX 77002
work phone: 713 - 216 - 5423

--


> -----Original Message-----
> From: python-list-bounces+ramit.prasad=jpmorgan.com@python.org
> [mailto:python-list-bounces+ramit.prasad=jpmorgan.com@python.org] On
> Behalf Of MRAB
> Sent: Wednesday, March 28, 2012 2:50 PM
> To: python-list@python.org
> Subject: Re: "convert" string to bytes without changing data (encoding)
> 
> On 28/03/2012 20:02, Prasad, Ramit wrote:
> >>  >The right way to convert bytes to strings, and vice versa, is via
> >>  >encoding and decoding operations.
> >>
> >>  If you want to dictate to the original poster the correct way to do
> >>  things then you don't need to do anything more that.  You don't need
> to
> >>  pretend like Chris Angelico that there's isn't a direct mapping from
> >>  the his Python 3 implementation's internal respresentation of strings
> >>  to bytes in order to label what he's asking for as being "silly".
> >
> > It might be technically possible to recreate internal implementation,
> > or get the byte data. That does not mean it will make any sense or
> > be understood in a meaningful manner. I think Ian summarized it
> > very well:
> >
> >>You can't generally just "deal with the ascii portions" without
> >>knowing something about the encoding.  Say you encounter a byte
> >>greater than 127.  Is it a single non-ASCII character, or is it the
> >>leading byte of a multi-byte character?  If the next character is less
> >>than 127, is it an ASCII character, or a continuation of the previous
> >>character?  For UTF-8 you could safely assume ASCII, but without
> >>knowing the encoding, there is no way to be sure.  If you just assume
> >>it's ASCII and manipulate it as such, you could be messing up
> >>non-ASCII characters.
> >
> --
> http://mail.python.org/mailman/listinfo/python-list
This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.

[toc] | [prev] | [next] | [standalone]

#22366

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-03-30 01:10 +0000
Message-ID	<4f75080a$0$29981$c3e8da3$5496439d@news.astraweb.com>
In reply to	#22352

On Thu, 29 Mar 2012 17:36:34 +0000, Prasad, Ramit wrote:

>> > Technically, ASCII goes up to 256 but they are not A-z letters.
>> >
>> Technically, ASCII is 7-bit, so it goes up to 127.
> 
>> No, ASCII only defines 0-127.  Values >=128 are not ASCII.
>> 
>> >From https://en.wikipedia.org/wiki/ASCII:
>> 
>>   ASCII includes definitions for 128 characters: 33 are non-printing
>>   control characters (now mostly obsolete) that affect how text and
>>   space is processed and 95 printable characters, including the space
>>   (which is considered an invisible graphic).
> 
> 
> Doh! I was mistaking extended ASCII for ASCII. Thanks for the
> correction.

There actually is no such thing as "extended ASCII" -- there is a whole 
series of many different "extended ASCIIs". If you look at the encodings 
available in (for example) Thunderbird, many of the ISO-8859-* and 
Windows-* encodings are "extended ASCII" in the sense that they extend 
ASCII to include bytes 128-255. Unfortunately they all extend ASCII in a 
different way (hence they are different encodings).

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#22386

From	Michael Ströder <michael@stroeder.com>
Date	2012-03-30 09:04 +0200
Message-ID	<jl4pru$r1a$1@dont-email.me>
In reply to	#22366

Steven D'Aprano wrote:
> On Thu, 29 Mar 2012 17:36:34 +0000, Prasad, Ramit wrote:
> 
>>>> Technically, ASCII goes up to 256 but they are not A-z letters.
>>>>
>>> Technically, ASCII is 7-bit, so it goes up to 127.
>>
>>> No, ASCII only defines 0-127.  Values >=128 are not ASCII.
>>>
>>> >From https://en.wikipedia.org/wiki/ASCII:
>>>
>>>   ASCII includes definitions for 128 characters: 33 are non-printing
>>>   control characters (now mostly obsolete) that affect how text and
>>>   space is processed and 95 printable characters, including the space
>>>   (which is considered an invisible graphic).
>>
>>
>> Doh! I was mistaking extended ASCII for ASCII. Thanks for the
>> correction.
> 
> There actually is no such thing as "extended ASCII" -- there is a whole 
> series of many different "extended ASCIIs". If you look at the encodings 
> available in (for example) Thunderbird, many of the ISO-8859-* and 
> Windows-* encodings are "extended ASCII" in the sense that they extend 
> ASCII to include bytes 128-255. Unfortunately they all extend ASCII in a 
> different way (hence they are different encodings).

Yupp.

Looking at RFC 1345 some years ago (while having to deal with EBCDIC) made
this all pretty clear to me. I appreciate that someone did this heavy work of
collecting historical encodings.

Ciao, Michael.

[toc] | [prev] | [next] | [standalone]

#22298

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-03-28 14:11 -0400
Message-ID	<mailman.1087.1332959217.3037.python-list@python.org>
In reply to	#22280

On 3/28/2012 11:36 AM, Ross Ridge wrote:
> Chris Angelico<rosuav@gmail.com>  wrote:
>> What is a string? It's not a series of bytes.
>
> Of course it is.  Conceptually you're not supposed to think of it that
> way, but a string is stored in memory as a series of bytes.

*If* it is stored in byte memory. If you execute a 3.x program mentally 
or on paper, then there are no bytes.

If you execute a 3.3 program on a byte-oriented computer, then the 'a' 
in the string might be represented by 1, 2, or 4 bytes, depending on the 
other characters in the string. The actual logical bit pattern will 
depend on the big versus little endianness of the system.

My impression is that if you go down to the physical bit level, then 
again there are, possibly, no 'bytes' as a physical construct as the 
bits, possibly, are stored in parallel on multiple ram chips.

> What he's asking for many not be very useful or practical, but if that's
> your problem here than then that's what you should be addressing, not
> pretending that it's fundamentally impossible.

The python-level way to get the bytes of an object that supports the 
buffer interface is memoryview(). 3.x strings intentionally do not 
support the buffer interface as there is not any particular 
correspondence between characters (codepoints) and bytes.

The OP could get the ordinal for each character and decide how *he* 
wants to convert them to bytes.

ba = bytearray()
for c in s:
   i = ord(c)
   <append bytes to ba corresponding to i>

To get the particular bytes used for a particular string on a particular 
system, OP should use the C API, possibly through ctypes.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#22268

From	Stefan Behnel <stefan_ml@behnel.de>
Date	2012-03-28 11:08 +0200
Message-ID	<mailman.1066.1332925721.3037.python-list@python.org>
In reply to	#22266

Peter Daum, 28.03.2012 10:56:
> is there any way to convert a string to bytes without
> interpreting the data in any way? Something like:
> 
> s='abcde'
> b=bytes(s, "unchanged")

If you can tell us what you actually want to achieve, i.e. why you want to
do this, we may be able to tell you how to do what you want.

Stefan

[toc] | [prev] | [next] | [standalone]

#22286

From	Dave Angel <d@davea.name>
Date	2012-03-28 13:16 -0400
Message-ID	<mailman.1083.1332955038.3037.python-list@python.org>
In reply to	#22266

On 03/28/2012 04:56 AM, Peter Daum wrote:
> Hi,
>
> is there any way to convert a string to bytes without
> interpreting the data in any way? Something like:
>
> s='abcde'
> b=bytes(s, "unchanged")
>
> Regards,
>                                Peter

You needed to specify that you are using Python 3.x .  In python 2.x, a 
string is indeed a series of bytes.  But in Python 3.x, you have to be 
much more specific.

For example, if that string is coming from a literal, then you usually 
can convert it back to bytes simply by encoding using the same method as 
the one specified for the source file.  So look at the encoding line at 
the top of the file.

-- 

DaveA

[toc] | [prev] | [standalone]

Page 3 of 3 — ← Prev page 1 2 [3]

csiph-web

"convert" string to bytes without changing data (encoding)

Contents

#22326

#22328

#22345

#22350

#22354

#22360

#22367

#22348

#22306

#22310

#22311

#22352

#22366

#22386

#22298

#22268

#22286