Groups > comp.lang.python > #22266 > unrolled thread

"convert" string to bytes without changing data (encoding)

Started by	Peter Daum <gator@cs.tu-berlin.de>
First post	2012-03-28 10:56 +0200
Last post	2012-03-28 13:16 -0400
Articles	20 on this page of 57 — 22 participants

Back to article view | Back to comp.lang.python

  "convert" string to bytes without changing data (encoding) Peter Daum <gator@cs.tu-berlin.de> - 2012-03-28 10:56 +0200
    Re: "convert" string to bytes without changing data (encoding) Chris Angelico <rosuav@gmail.com> - 2012-03-28 20:02 +1100
      Re: "convert" string to bytes without changing data (encoding) Peter Daum <gator@cs.tu-berlin.de> - 2012-03-28 11:43 +0200
        Re: "convert" string to bytes without changing data (encoding) Heiko Wundram <modelnine@modelnine.org> - 2012-03-28 12:42 +0200
          Re: "convert" string to bytes without changing data (encoding) Peter Daum <gator@cs.tu-berlin.de> - 2012-03-28 19:43 +0200
            Re: "convert" string to bytes without changing data (encoding) Heiko Wundram <modelnine@modelnine.org> - 2012-03-28 20:13 +0200
            Re: "convert" string to bytes without changing data (encoding) Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2012-03-28 21:13 +0300
              RE: "convert" string to bytes without changing data (encoding) "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-03-28 18:31 +0000
              Re: "convert" string to bytes without changing data (encoding) Ethan Furman <ethan@stoneleaf.us> - 2012-03-28 11:49 -0700
            RE: "convert" string to bytes without changing data (encoding) "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-03-28 18:20 +0000
            Re: "convert" string to bytes without changing data (encoding) Ian Kelly <ian.g.kelly@gmail.com> - 2012-03-28 12:20 -0600
            Re: "convert" string to bytes without changing data (encoding) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-28 18:26 +0000
              Re: "convert" string to bytes without changing data (encoding) Grant Edwards <invalid@invalid.invalid> - 2012-03-28 19:40 +0000
            Re: "convert" string to bytes without changing data (encoding) Ethan Furman <ethan@stoneleaf.us> - 2012-03-28 11:17 -0700
            Re: "convert" string to bytes without changing data (encoding) John Nagle <nagle@animats.com> - 2012-03-28 12:30 -0700
            Re: "convert" string to bytes without changing data (encoding) Terry Reedy <tjreedy@udel.edu> - 2012-03-28 17:37 -0400
              Re: "convert" string to bytes without changing data (encoding) Peter Daum <gator@cs.tu-berlin.de> - 2012-03-29 16:57 +0200
              Re: "convert" string to bytes without changing data (encoding) Peter Daum <gator@cs.tu-berlin.de> - 2012-03-29 16:57 +0200
            Re: "convert" string to bytes without changing data (encoding) Serhiy Storchaka <storchaka@gmail.com> - 2012-03-30 22:06 +0300
            Re: "convert" string to bytes without changing data (encoding) Chris Angelico <rosuav@gmail.com> - 2012-03-31 06:10 +1100
        Re: "convert" string to bytes without changing data (encoding) Stefan Behnel <stefan_ml@behnel.de> - 2012-03-28 13:25 +0200
        Re: "convert" string to bytes without changing data (encoding) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-28 18:12 +0000
      Re: "convert" string to bytes without changing data (encoding) Ross Ridge <rridge@csclub.uwaterloo.ca> - 2012-03-28 11:36 -0400
        Re: "convert" string to bytes without changing data (encoding) Chris Angelico <rosuav@gmail.com> - 2012-03-29 03:18 +1100
          Re: "convert" string to bytes without changing data (encoding) Grant Edwards <invalid@invalid.invalid> - 2012-03-28 16:33 +0000
          Re: "convert" string to bytes without changing data (encoding) Ross Ridge <rridge@csclub.uwaterloo.ca> - 2012-03-28 14:05 -0400
            Re: "convert" string to bytes without changing data (encoding) Tim Chase <python.list@tim.thechases.com> - 2012-03-28 13:49 -0500
              Re: "convert" string to bytes without changing data (encoding) Ross Ridge <rridge@csclub.uwaterloo.ca> - 2012-03-28 15:10 -0400
            Re: "convert" string to bytes without changing data (encoding) "Albert W. Hopkins" <marduk@letterboxes.org> - 2012-03-28 15:22 -0400
        Re: "convert" string to bytes without changing data (encoding) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-28 17:54 +0000
          Re: "convert" string to bytes without changing data (encoding) Ross Ridge <rridge@csclub.uwaterloo.ca> - 2012-03-28 14:22 -0400
            Re: Re: "convert" string to bytes without changing data (encoding) Evan Driscoll <driscoll@cs.wisc.edu> - 2012-03-28 14:20 -0500
              Re: Re: "convert" string to bytes without changing data (encoding) Ross Ridge <rridge@csclub.uwaterloo.ca> - 2012-03-28 15:43 -0400
                Re: "convert" string to bytes without changing data (encoding) Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-03-28 21:44 +0100
                Re: "convert" string to bytes without changing data (encoding) Neil Cerutti <neilc@norwich.edu> - 2012-03-28 20:56 +0000
                Re: "convert" string to bytes without changing data (encoding) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-29 00:02 +0000
                Re: Re: Re: "convert" string to bytes without changing data (encoding) Evan Driscoll <driscoll@cs.wisc.edu> - 2012-03-28 19:11 -0500
                  Re: Re: Re: "convert" string to bytes without changing data (encoding) Ross Ridge <rridge@csclub.uwaterloo.ca> - 2012-03-28 23:04 -0400
                    Re: Re: Re: "convert" string to bytes without changing data (encoding) Chris Angelico <rosuav@gmail.com> - 2012-03-29 14:31 +1100
                      Re: Re: Re: "convert" string to bytes without changing data (encoding) Ross Ridge <rridge@csclub.uwaterloo.ca> - 2012-03-28 23:58 -0400
                        Re: "convert" string to bytes without changing data (encoding) Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-03-29 07:01 +0100
                        Re: "convert" string to bytes without changing data (encoding) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-29 06:51 +0000
                          Re: "convert" string to bytes without changing data (encoding) Ross Ridge <rridge@csclub.uwaterloo.ca> - 2012-03-29 11:30 -0400
                            Re: "convert" string to bytes without changing data (encoding) Terry Reedy <tjreedy@udel.edu> - 2012-03-29 12:49 -0400
                              Re: "convert" string to bytes without changing data (encoding) Ross Ridge <rridge@csclub.uwaterloo.ca> - 2012-03-29 14:00 -0400
                                Re: "convert" string to bytes without changing data (encoding) Chris Angelico <rosuav@gmail.com> - 2012-03-30 07:41 +1100
                            Re: "convert" string to bytes without changing data (encoding) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-30 01:16 +0000
                    Re: Re: Re: Re: "convert" string to bytes without changing data (encoding) Evan Driscoll <driscoll@cs.wisc.edu> - 2012-03-29 11:31 -0500
            RE: "convert" string to bytes without changing data (encoding) "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-03-28 19:02 +0000
              Re: "convert" string to bytes without changing data (encoding) Grant Edwards <invalid@invalid.invalid> - 2012-03-28 19:44 +0000
            Re: "convert" string to bytes without changing data (encoding) MRAB <python@mrabarnett.plus.com> - 2012-03-28 20:50 +0100
            RE: "convert" string to bytes without changing data (encoding) "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-03-29 17:36 +0000
              Re: "convert" string to bytes without changing data (encoding) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-30 01:10 +0000
                Re: "convert" string to bytes without changing data (encoding) Michael Ströder <michael@stroeder.com> - 2012-03-30 09:04 +0200
        Re: "convert" string to bytes without changing data (encoding) Terry Reedy <tjreedy@udel.edu> - 2012-03-28 14:11 -0400
    Re: "convert" string to bytes without changing data (encoding) Stefan Behnel <stefan_ml@behnel.de> - 2012-03-28 11:08 +0200
    Re: "convert" string to bytes without changing data (encoding) Dave Angel <d@davea.name> - 2012-03-28 13:16 -0400

Page 1 of 3 [1] 2 3 Next page →

#22266 — "convert" string to bytes without changing data (encoding)

From	Peter Daum <gator@cs.tu-berlin.de>
Date	2012-03-28 10:56 +0200
Subject	"convert" string to bytes without changing data (encoding)
Message-ID	<9tg21lFmo3U1@mid.dfncis.de>

Hi,

is there any way to convert a string to bytes without
interpreting the data in any way? Something like:

s='abcde'
b=bytes(s, "unchanged")

Regards,
                              Peter

[toc] | [next] | [standalone]

#22267

From	Chris Angelico <rosuav@gmail.com>
Date	2012-03-28 20:02 +1100
Message-ID	<mailman.1065.1332925364.3037.python-list@python.org>
In reply to	#22266

On Wed, Mar 28, 2012 at 7:56 PM, Peter Daum <gator@cs.tu-berlin.de> wrote:
> Hi,
>
> is there any way to convert a string to bytes without
> interpreting the data in any way? Something like:
>
> s='abcde'
> b=bytes(s, "unchanged")

What is a string? It's not a series of bytes. You can't convert it
without encoding those characters into bytes in some way.

ChrisA

[toc] | [prev] | [next] | [standalone]

#22270

From	Peter Daum <gator@cs.tu-berlin.de>
Date	2012-03-28 11:43 +0200
Message-ID	<9tg4qoFbfpU1@mid.dfncis.de>
In reply to	#22267

On 2012-03-28 11:02, Chris Angelico wrote:
> On Wed, Mar 28, 2012 at 7:56 PM, Peter Daum <gator@cs.tu-berlin.de> wrote:
>> is there any way to convert a string to bytes without
>> interpreting the data in any way? Something like:
>>
>> s='abcde'
>> b=bytes(s, "unchanged")
> 
> What is a string? It's not a series of bytes. You can't convert it
> without encoding those characters into bytes in some way.

... in my example, the variable s points to a "string", i.e. a series of
bytes, (0x61,0x62 ...) interpreted as ascii/unicode characters.

b=bytes(s,'ascii') # or ('utf-8', 'latin1', ...)

would of course work in this case, but in general, if s holds any
data with bytes > 127, the actual data will be changed according
to the provided encoding.

What I am looking for is a general way to just copy the raw data
from a "string" object to a "byte" object without any attempt to
"decode" or "encode" anything ...

Regards,
                        Peter

[toc] | [prev] | [next] | [standalone]

#22272

From	Heiko Wundram <modelnine@modelnine.org>
Date	2012-03-28 12:42 +0200
Message-ID	<mailman.1069.1332931371.3037.python-list@python.org>
In reply to	#22270

Am 28.03.2012 11:43, schrieb Peter Daum:
> ... in my example, the variable s points to a "string", i.e. a series 
> of
> bytes, (0x61,0x62 ...) interpreted as ascii/unicode characters.

No; a string contains a series of codepoints from the unicode plane, 
representing natural language characters (at least in the simplistic 
view, I'm not talking about surrogates). These can be encoded to 
different binary storage representations, of which ascii is (a common) 
one.

> What I am looking for is a general way to just copy the raw data
> from a "string" object to a "byte" object without any attempt to
> "decode" or "encode" anything ...

There is "logically" no raw data in the string, just a series of 
codepoints, as stated above. You'll have to specify the encoding to use 
to get at "raw" data, and from what I gather you're interested in the 
latin-1 (or iso-8859-15) encoding, as you're specifically referencing 
chars >= 0x80 (which hints at your mindset being in LATIN-land, so to 
speak).

-- 
--- Heiko.

[toc] | [prev] | [next] | [standalone]

#22287

From	Peter Daum <gator@cs.tu-berlin.de>
Date	2012-03-28 19:43 +0200
Message-ID	<9th0u8Fuf2U1@mid.dfncis.de>
In reply to	#22272

On 2012-03-28 12:42, Heiko Wundram wrote:
> Am 28.03.2012 11:43, schrieb Peter Daum:
>> ... in my example, the variable s points to a "string", i.e. a series of
>> bytes, (0x61,0x62 ...) interpreted as ascii/unicode characters.
> 
> No; a string contains a series of codepoints from the unicode plane,
> representing natural language characters (at least in the simplistic
> view, I'm not talking about surrogates). These can be encoded to
> different binary storage representations, of which ascii is (a common) one.
> 
>> What I am looking for is a general way to just copy the raw data
>> from a "string" object to a "byte" object without any attempt to
>> "decode" or "encode" anything ...
> 
> There is "logically" no raw data in the string, just a series of
> codepoints, as stated above. You'll have to specify the encoding to use
> to get at "raw" data, and from what I gather you're interested in the
> latin-1 (or iso-8859-15) encoding, as you're specifically referencing
> chars >= 0x80 (which hints at your mindset being in LATIN-land, so to
> speak).

... I was under the illusion, that python (like e.g. perl) stored
strings internally in utf-8. In this case the "conversion" would simple
mean to re-label the data. Unfortunately, as I meanwhile found out, this
is not the case (nor the "apple encoding" ;-), so it would indeed be
pretty useless.

The longer story of my question is: I am new to python (obviously), and
since I am not familiar with either one, I thought it would be advisory
to go for python 3.x. The biggest problem that I am facing is, that I
am often dealing with data, that is basically text, but it can contain
8-bit bytes. In this case, I can not safely assume any given encoding,
but I actually also don't need to know - for my purposes, it would be
perfectly good enough to deal with the ascii portions and keep anything
else unchanged.

As it seems, this would be far easier with python 2.x. With python 3
and its strict distinction between "str" and "bytes", things gets
syntactically pretty awkward and error-prone (something as innocently
looking like "s=s+'/'" hidden in a rarely reached branch and a
seemingly correct program will crash with a TypeError 2 years
later ...)

Regards,
                         Peter

[toc] | [prev] | [next] | [standalone]

#22292

From	Heiko Wundram <modelnine@modelnine.org>
Date	2012-03-28 20:13 +0200
Message-ID	<mailman.1084.1332958393.3037.python-list@python.org>
In reply to	#22287

Am 28.03.2012 19:43, schrieb Peter Daum:
> As it seems, this would be far easier with python 2.x. With python 3
> and its strict distinction between "str" and "bytes", things gets
> syntactically pretty awkward and error-prone (something as innocently
> looking like "s=s+'/'" hidden in a rarely reached branch and a
> seemingly correct program will crash with a TypeError 2 years
> later ...)

It seems that you're mixing things up wrt. the string/bytes 
distinction; it's not as "complicated" as it might seem.

1) Strings

s = "This is a test string"
s = 'This is another test string with single quotes'
s = """
And this is a multiline test string.
"""
s = 'c' # This is also a string...

all create/refer to string objects. How Python internally stores them 
is none of your concern (actually, that's rather complicated anyway, at 
least with the upcoming Python 3.3), and processing a string basically 
means that you'll work on the natural language characters present in the 
string. Python strings can store (pretty much) all characters and 
surrogates that unicode allows, and when the python interpreter/compiler 
reads strings from input (I'm talking about source files), a default 
encoding defines how the bytes in your input file get interpreted as 
unicode codepoint encodings (generally, it depends on your system locale 
or file header indications) to construct the internal string object 
you're using to access the data in the string.

There is no such thing as a type for a single character; single 
characters are simply strings of length 1 (and so indexing also returns 
a [new] string object).

Single/double quotes work no different.

The internal encoding used by the Python interpreter is of no concern 
to you.

2) Bytes

s = b'this is a byte-string'
s = b'\x22\x33\x44'

The above define bytes. Think of the bytes type as arrays of 8-bit 
integers, only representing a buffer which you can process as an array 
of fixed-width integers. Reading from stdin/a file gets you bytes, and 
not a string, because Python cannot automagically guess what format the 
input is in.

Indexing the bytes type returns an integer (which is the clearest 
distinction between string and bytes).

Being able to input "string-looking" data in source files as bytes is a 
debatable "feature" (IMHO; see the first example), simply because it 
breaks the semantic difference between the two types in the eye of the 
programmer looking at source.

3) Conversions

To get from bytes to string, you have to decode the bytes buffer, 
telling Python what kind of character data is contained in the array of 
integers. After decoding, you'll get a string object which you can 
process using the standard string methods. For decoding to succeed, you 
have to tell Python how the natural language characters are encoded in 
your array of bytes:

b'hello'.decode('iso-8859-15')

To get from string back to bytes (you want to write the natural 
language character data you've processed to a file), you have to encode 
the data in your string buffer, which gets you an array of 8-bit 
integers to write to the output:

'hello'.encode('iso-8859-15')

Most output methods will happily do the encoding for you, using a 
standard encoding, and if that happens to be ASCII, you're getting 
UnicodeEncodeErrors which tell you that a character in your string 
source is unsuited to be transmitted using the encoding you've 
specified.

If the above doesn't make the string/bytes-distinction and usage 
clearer, and you have a C#-background, check out the distinction between 
byte[] (which the System.IO-streams get you), and how you have to use a 
System.Encoding-derived class to get at actual System.String objects to 
manipulate character data. Pythons type system wrt. character data is 
pretty much similar, except for missing the "single character" type 
(char).

Anyway, back to what you wrote: how are you getting the input data? Why 
are "high bytes" in there which you do not know the encoding for? 
Generally, from what I gather, you'll decode data from some source, 
process it, and write it back using the same encoding which you used for 
decoding, which should do exactly what you want and not get you into any 
trouble with encodings.

-- 
--- Heiko.

[toc] | [prev] | [next] | [standalone]

#22293

From	Jussi Piitulainen <jpiitula@ling.helsinki.fi>
Date	2012-03-28 21:13 +0300
Message-ID	<qotbong1tlq.fsf@ruuvi.it.helsinki.fi>
In reply to	#22287

Peter Daum writes:

> ... I was under the illusion, that python (like e.g. perl) stored
> strings internally in utf-8. In this case the "conversion" would simple
> mean to re-label the data. Unfortunately, as I meanwhile found out, this
> is not the case (nor the "apple encoding" ;-), so it would indeed be
> pretty useless.
> 
> The longer story of my question is: I am new to python (obviously), and
> since I am not familiar with either one, I thought it would be advisory
> to go for python 3.x. The biggest problem that I am facing is, that I
> am often dealing with data, that is basically text, but it can contain
> 8-bit bytes. In this case, I can not safely assume any given encoding,
> but I actually also don't need to know - for my purposes, it would be
> perfectly good enough to deal with the ascii portions and keep anything
> else unchanged.

You can read as bytes and decode as ASCII but ignoring the troublesome
non-text characters:

>>> print(open('text.txt', 'br').read().decode('ascii', 'ignore'))
Das fr ASCII nicht benutzte Bit kann auch fr Fehlerkorrekturzwecke
(Parittsbit) auf den Kommunikationsleitungen oder fr andere
Steuerungsaufgaben verwendet werden. Heute wird es aber fast immer zur
Erweiterung von ASCII auf einen 8-Bit-Code verwendet. Diese
Erweiterungen sind mit dem ursprnglichen ASCII weitgehend kompatibel,
so dass alle im ASCII definierten Zeichen auch in den verschiedenen
Erweiterungen durch die gleichen Bitmuster kodiert werden. Die
einfachsten Erweiterungen sind Kodierungen mit sprachspezifischen
Zeichen, die nicht im lateinischen Grundalphabet enthalten sind.

The paragraph is from the German Wikipedia on ASCII, in UTF-8.

[toc] | [prev] | [next] | [standalone]

#22300

From	"Prasad, Ramit" <ramit.prasad@jpmorgan.com>
Date	2012-03-28 18:31 +0000
Message-ID	<mailman.1089.1332959476.3037.python-list@python.org>
In reply to	#22293

> You can read as bytes and decode as ASCII but ignoring the troublesome
> non-text characters:
> 
> >>> print(open('text.txt', 'br').read().decode('ascii', 'ignore'))
> Das fr ASCII nicht benutzte Bit kann auch fr Fehlerkorrekturzwecke
> (Parittsbit) auf den Kommunikationsleitungen oder fr andere
> Steuerungsaufgaben verwendet werden. Heute wird es aber fast immer zur
> Erweiterung von ASCII auf einen 8-Bit-Code verwendet. Diese
> Erweiterungen sind mit dem ursprnglichen ASCII weitgehend kompatibel,
> so dass alle im ASCII definierten Zeichen auch in den verschiedenen
> Erweiterungen durch die gleichen Bitmuster kodiert werden. Die
> einfachsten Erweiterungen sind Kodierungen mit sprachspezifischen
> Zeichen, die nicht im lateinischen Grundalphabet enthalten sind.
> 
> The paragraph is from the German Wikipedia on ASCII, in UTF-8.

I see no non-ASCII characters, not sure if that is because the source
has none or something else. From this example I would not say that
the rest of the text is "unchanged".  Decode converts to Unicode,
did you mean encode?

I think "ignore" will remove non-translatable characters and not 
leave them in the returned string.

Ramit


Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology
712 Main Street | Houston, TX 77002
work phone: 713 - 216 - 5423

--

This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.

[toc] | [prev] | [next] | [standalone]

#22308

From	Ethan Furman <ethan@stoneleaf.us>
Date	2012-03-28 11:49 -0700
Message-ID	<mailman.1095.1332963309.3037.python-list@python.org>
In reply to	#22293

Prasad, Ramit wrote:
>> You can read as bytes and decode as ASCII but ignoring the troublesome
>> non-text characters:
>>
>>>>> print(open('text.txt', 'br').read().decode('ascii', 'ignore'))
>> Das fr ASCII nicht benutzte Bit kann auch fr Fehlerkorrekturzwecke
>> (Parittsbit) auf den Kommunikationsleitungen oder fr andere
>> Steuerungsaufgaben verwendet werden. Heute wird es aber fast immer zur
>> Erweiterung von ASCII auf einen 8-Bit-Code verwendet. Diese
>> Erweiterungen sind mit dem ursprnglichen ASCII weitgehend kompatibel,
>> so dass alle im ASCII definierten Zeichen auch in den verschiedenen
>> Erweiterungen durch die gleichen Bitmuster kodiert werden. Die
>> einfachsten Erweiterungen sind Kodierungen mit sprachspezifischen
>> Zeichen, die nicht im lateinischen Grundalphabet enthalten sind.
>>
>> The paragraph is from the German Wikipedia on ASCII, in UTF-8.
> 
> I see no non-ASCII characters, not sure if that is because the source
> has none or something else.

The 'ignore' argument to .decode() caused all non-ascii characters to be 
removed.

~Ethan~

[toc] | [prev] | [next] | [standalone]

#22294

From	"Prasad, Ramit" <ramit.prasad@jpmorgan.com>
Date	2012-03-28 18:20 +0000
Message-ID	<mailman.1085.1332958844.3037.python-list@python.org>
In reply to	#22287

> As it seems, this would be far easier with python 2.x. With python 3
> and its strict distinction between "str" and "bytes", things gets
> syntactically pretty awkward and error-prone (something as innocently
> looking like "s=s+'/'" hidden in a rarely reached branch and a
> seemingly correct program will crash with a TypeError 2 years
> later ...)

Just a small note as you are new to Python, string concatenation can
be expensive (quadratic time). The Python (2.x and 3.x) idiom for 
frequent string concatenation is to append to a list and then join 
them like the following (linear time). 

>>>lst = [ 'Hi,' ]
>>>lst.append( 'how' )
>>>lst.append( 'are' )
>>>lst.append( 'you?' )
>>>sentence = ' '.join( lst ) # use a space separating each element
>>>print sentence
Hi, how are you?

You can use join on an empty string, but then they will not be 
separated by spaces.

>>>sentence = ''.join( lst ) # empty string so no separation
>>>print sentence
Hi,howareyou?

You can use any string as a separator, length does not matter.

>>>sentence = '@-Q'.join( lst )
>>>print sentence
Hi,@-Qhow@-Qare@-Qyou?


Ramit


Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology
712 Main Street | Houston, TX 77002
work phone: 713 - 216 - 5423

--

This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.

[toc] | [prev] | [next] | [standalone]

#22295

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-03-28 12:20 -0600
Message-ID	<mailman.1086.1332958864.3037.python-list@python.org>
In reply to	#22287

On Wed, Mar 28, 2012 at 11:43 AM, Peter Daum <gator@cs.tu-berlin.de> wrote:
> ... I was under the illusion, that python (like e.g. perl) stored
> strings internally in utf-8. In this case the "conversion" would simple
> mean to re-label the data. Unfortunately, as I meanwhile found out, this
> is not the case (nor the "apple encoding" ;-), so it would indeed be
> pretty useless.

No, unicode strings can be stored internally as any of UCS-1, UCS-2,
UCS-4, C wchar strings, or even plain ASCII.  And those are all
implementation details that could easily change in future versions of
Python.

> The longer story of my question is: I am new to python (obviously), and
> since I am not familiar with either one, I thought it would be advisory
> to go for python 3.x. The biggest problem that I am facing is, that I
> am often dealing with data, that is basically text, but it can contain
> 8-bit bytes. In this case, I can not safely assume any given encoding,
> but I actually also don't need to know - for my purposes, it would be
> perfectly good enough to deal with the ascii portions and keep anything
> else unchanged.

You can't generally just "deal with the ascii portions" without
knowing something about the encoding.  Say you encounter a byte
greater than 127.  Is it a single non-ASCII character, or is it the
leading byte of a multi-byte character?  If the next character is less
than 127, is it an ASCII character, or a continuation of the previous
character?  For UTF-8 you could safely assume ASCII, but without
knowing the encoding, there is no way to be sure.  If you just assume
it's ASCII and manipulate it as such, you could be messing up
non-ASCII characters.

Cheers,
Ian

[toc] | [prev] | [next] | [standalone]

#22296

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-03-28 18:26 +0000
Message-ID	<4f7357d5$0$29981$c3e8da3$5496439d@news.astraweb.com>
In reply to	#22287

On Wed, 28 Mar 2012 19:43:36 +0200, Peter Daum wrote:

> The longer story of my question is: I am new to python (obviously), and
> since I am not familiar with either one, I thought it would be advisory
> to go for python 3.x. The biggest problem that I am facing is, that I am
> often dealing with data, that is basically text, but it can contain
> 8-bit bytes. 

All bytes are 8-bit, at least on modern hardware. I think you have to go 
back to the 1950s to find 10-bit or 12-bit machines.

> In this case, I can not safely assume any given encoding,
> but I actually also don't need to know - for my purposes, it would be
> perfectly good enough to deal with the ascii portions and keep anything
> else unchanged.

Well you can't do that, because *by definition* you are changing a 
CHARACTER into ONE OR MORE BYTES. So the question you have to ask is, 
*how* do you want to change them?

You can use an error handler to convert any untranslatable characters 
into question marks, or to ignore them altogether:

bytes = string.encode('ascii', 'replace')
bytes = string.encode('ascii', 'ignore')

When going the other way, from bytes to strings, it can sometimes be 
useful to use the Latin-1 encoding, which essentially cannot fail:

string = bytes.decode('latin1')

although the non-ASCII chars that you get may not be sensible or 
meaningful in any way. But if there are only a few of them, and you don't 
care too much, this may be a simple approach.

But in a nutshell, it is physically impossible to map the millions of 
Unicode characters to just 256 possible bytes without either throwing 
some characters away, or performing an encoding.

> As it seems, this would be far easier with python 2.x. 

It only seems that way until you try.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#22309

From	Grant Edwards <invalid@invalid.invalid>
Date	2012-03-28 19:40 +0000
Message-ID	<jkvpg8$9nk$1@reader1.panix.com>
In reply to	#22296

On 2012-03-28, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
> On Wed, 28 Mar 2012 19:43:36 +0200, Peter Daum wrote:
>
>> The longer story of my question is: I am new to python (obviously), and
>> since I am not familiar with either one, I thought it would be advisory
>> to go for python 3.x. The biggest problem that I am facing is, that I am
>> often dealing with data, that is basically text, but it can contain
>> 8-bit bytes. 
>
> All bytes are 8-bit, at least on modern hardware. I think you have to
> go back to the 1950s to find 10-bit or 12-bit machines.

Well, on anything likely to run Python that's true.  There are modern
DSP-oriented CPUs where a byte is 16 or 32 bits (and so is an int and
a long, and a float and a double).

>> As it seems, this would be far easier with python 2.x. 
>
> It only seems that way until you try.

It's easy as long as you deal with nothing but ASCII and Latin-1. ;)

-- 
Grant Edwards               grant.b.edwards        Yow! Somewhere in Tenafly,
                                  at               New Jersey, a chiropractor
                              gmail.com            is viewing "Leave it to
                                                   Beaver"!

[toc] | [prev] | [next] | [standalone]

#22301

From	Ethan Furman <ethan@stoneleaf.us>
Date	2012-03-28 11:17 -0700
Message-ID	<mailman.1090.1332959710.3037.python-list@python.org>
In reply to	#22287

Peter Daum wrote:
> On 2012-03-28 12:42, Heiko Wundram wrote:
>> Am 28.03.2012 11:43, schrieb Peter Daum:
>>> ... in my example, the variable s points to a "string", i.e. a series of
>>> bytes, (0x61,0x62 ...) interpreted as ascii/unicode characters.
>> No; a string contains a series of codepoints from the unicode plane,
>> representing natural language characters (at least in the simplistic
>> view, I'm not talking about surrogates). These can be encoded to
>> different binary storage representations, of which ascii is (a common) one.
>>
>>> What I am looking for is a general way to just copy the raw data
>>> from a "string" object to a "byte" object without any attempt to
>>> "decode" or "encode" anything ...
>> There is "logically" no raw data in the string, just a series of
>> codepoints, as stated above. You'll have to specify the encoding to use
>> to get at "raw" data, and from what I gather you're interested in the
>> latin-1 (or iso-8859-15) encoding, as you're specifically referencing
>> chars >= 0x80 (which hints at your mindset being in LATIN-land, so to
>> speak).
> 
> The longer story of my question is: I am new to python (obviously), and
> since I am not familiar with either one, I thought it would be advisory
> to go for python 3.x. The biggest problem that I am facing is, that I
> am often dealing with data, that is basically text, but it can contain
> 8-bit bytes. In this case, I can not safely assume any given encoding,
> but I actually also don't need to know - for my purposes, it would be
> perfectly good enough to deal with the ascii portions and keep anything
> else unchanged.

Where is the data coming from?  Files?  In that case, it sounds like you 
will want to decode/encode using 'latin-1', as the bulk of your text is 
plain ascii and you don't really care about the upper-ascii chars.

~Ethan~

[toc] | [prev] | [next] | [standalone]

#22307

From	John Nagle <nagle@animats.com>
Date	2012-03-28 12:30 -0700
Message-ID	<jkvot5$m3o$1@dont-email.me>
In reply to	#22287

On 3/28/2012 10:43 AM, Peter Daum wrote:
> On 2012-03-28 12:42, Heiko Wundram wrote:
>> Am 28.03.2012 11:43, schrieb Peter Daum:

> The longer story of my question is: I am new to python (obviously), and
> since I am not familiar with either one, I thought it would be advisory
> to go for python 3.x. The biggest problem that I am facing is, that I
> am often dealing with data, that is basically text, but it can contain
> 8-bit bytes. In this case, I can not safely assume any given encoding,
> but I actually also don't need to know - for my purposes, it would be
> perfectly good enough to deal with the ascii portions and keep anything
> else unchanged.

    So why let the data get into a "str" type at all? Do everything
end to end with "bytes" or "bytearray" types.

				John Nagle

[toc] | [prev] | [next] | [standalone]

#22315

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-03-28 17:37 -0400
Message-ID	<mailman.1098.1332970699.3037.python-list@python.org>
In reply to	#22287

On 3/28/2012 1:43 PM, Peter Daum wrote:

> The longer story of my question is: I am new to python (obviously), and
> since I am not familiar with either one, I thought it would be advisory
> to go for python 3.x.

I strongly agree with that unless you have reason to use 2.7. Python 3.3 
(.0a1 in nearly out) has an improved unicode implementation, among other 
things.

< The biggest problem that I am facing is, that I
> am often dealing with data, that is basically text, but it can contain
> 8-bit bytes. In this case, I can not safely assume any given encoding,
> but I actually also don't need to know - for my purposes, it would be
> perfectly good enough to deal with the ascii portions and keep anything
> else unchanged.

You are assuming, or must assume, that the text is in an 
ascii-compatible encoding, meaning that bytes 0-127 really represent 
ascii chars. Otherwise, you cannot reliably interpret anything, let 
alone change it.

This problem of knowing that much but not the specific encoding is 
unfortunately common. It has been discussed among core developers and 
others the last few months. Different people prefer one of the following 
approaches.

1. Keep the bytes as bytes and use bytes literals and bytes functions as 
needed. The danger, as you noticed, is forgetting the 'b' prefix.

2. Decode as if the text were latin-1 and ignore the non-ascii 'latin-1' 
chars. When done, encode back to 'latin-1' and the non-ascii chars will 
be as they originally were. The danger is forgetting the pretense, and 
perhaps passing on the the string (as a string, not bytes) to other 
modules that will not know the pretense.

3. Decode using encoding = 'ascii', errors='surrogate_escape'. This 
reversibly encodes the unknown non-ascii chars as 'illegal' non-chars 
(using the surrogate-pair second-half code units). This is probably the 
safest in that invalid operations on the non-chars should raise an 
exception. Re-encoding with the same setting will reproduce the original 
hi-bit chars. The main danger is passing the illegal strings out of your 
local sandbox.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#22340

From	Peter Daum <gator@cs.tu-berlin.de>
Date	2012-03-29 16:57 +0200
Message-ID	<4F74784F.40804@cs.tu-berlin.de>
In reply to	#22315

On 2012-03-28 23:37, Terry Reedy wrote:
> 2. Decode as if the text were latin-1 and ignore the non-ascii 'latin-1'
> chars. When done, encode back to 'latin-1' and the non-ascii chars will
> be as they originally were.

... actually, in the beginning of my quest, I ran into an decoding
exception trying to read data as "latin1" (which was more or less what
I had expected anyway because byte values between 128 and 160 are not
defined there).

Obviously, I must have misinterpreted something there;
I just ran a little test:

  l=[i for i in range(256)]; b=bytes(l)
  s=b.decode('latin1'); b=s.encode('latin1'); s=b.decode('latin1')
  for c in s:
      print(hex(ord(c)), end=' ')
      if (ord(c)+1) % 16 ==0: print("")
  print()

... and got all the original bytes back. So it looks like I tried to
solve a problem that did not exist to start with (the problems, I ran
into then were pretty real, though ;-)

> 3. Decode using encoding = 'ascii', errors='surrogate_escape'. This
> reversibly encodes the unknown non-ascii chars as 'illegal' non-chars
> (using the surrogate-pair second-half code units). This is probably the
> safest in that invalid operations on the non-chars should raise an
> exception. Re-encoding with the same setting will reproduce the original
> hi-bit chars. The main danger is passing the illegal strings out of your
> local sandbox.

Unfortunately, this is a very well-kept secret unless you know that
something with that name exists. The options currently mentioned in the
documentation are not really helpful, because the non-decodeable will
be lost. With some trying, I got it to work, too (the option is named
"surrogateescape" without the "_" and in python 3.1 it exists, but only
not as a keyword argument: "s=b.decode('utf-8','surrogateescape')" ...)

Thank you very much for your constructive advice!

Regards,
                                 Peter

[toc] | [prev] | [next] | [standalone]

#22342

From	Peter Daum <gator@cs.tu-berlin.de>
Date	2012-03-29 16:57 +0200
Message-ID	<mailman.1118.1333033671.3037.python-list@python.org>
In reply to	#22315

On 2012-03-28 23:37, Terry Reedy wrote:
> 2. Decode as if the text were latin-1 and ignore the non-ascii 'latin-1'
> chars. When done, encode back to 'latin-1' and the non-ascii chars will
> be as they originally were.

... actually, in the beginning of my quest, I ran into an decoding
exception trying to read data as "latin1" (which was more or less what
I had expected anyway because byte values between 128 and 160 are not
defined there).

Obviously, I must have misinterpreted something there;
I just ran a little test:

  l=[i for i in range(256)]; b=bytes(l)
  s=b.decode('latin1'); b=s.encode('latin1'); s=b.decode('latin1')
  for c in s:
      print(hex(ord(c)), end=' ')
      if (ord(c)+1) % 16 ==0: print("")
  print()

... and got all the original bytes back. So it looks like I tried to
solve a problem that did not exist to start with (the problems, I ran
into then were pretty real, though ;-)

> 3. Decode using encoding = 'ascii', errors='surrogate_escape'. This
> reversibly encodes the unknown non-ascii chars as 'illegal' non-chars
> (using the surrogate-pair second-half code units). This is probably the
> safest in that invalid operations on the non-chars should raise an
> exception. Re-encoding with the same setting will reproduce the original
> hi-bit chars. The main danger is passing the illegal strings out of your
> local sandbox.

Unfortunately, this is a very well-kept secret unless you know that
something with that name exists. The options currently mentioned in the
documentation are not really helpful, because the non-decodeable will
be lost. With some trying, I got it to work, too (the option is named
"surrogateescape" without the "_" and in python 3.1 it exists, but only
not as a keyword argument: "s=b.decode('utf-8','surrogateescape')" ...)

Thank you very much for your constructive advice!

Regards,
                                 Peter

[toc] | [prev] | [next] | [standalone]

#22390

From	Serhiy Storchaka <storchaka@gmail.com>
Date	2012-03-30 22:06 +0300
Message-ID	<mailman.1154.1333134438.3037.python-list@python.org>
In reply to	#22287

28.03.12 21:13, Heiko Wundram написав(ла):
> Reading from stdin/a file gets you bytes, and
> not a string, because Python cannot automagically guess what format the
> input is in.

In Python3 reading from stdin gets you string. Use sys.stdin.buffer.raw 
for access to byte stream. And reading from file opened in text mode 
gets you string too.

[toc] | [prev] | [next] | [standalone]

#22391

From	Chris Angelico <rosuav@gmail.com>
Date	2012-03-31 06:10 +1100
Message-ID	<mailman.1155.1333134606.3037.python-list@python.org>
In reply to	#22287

On Sat, Mar 31, 2012 at 6:06 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
> 28.03.12 21:13, Heiko Wundram написав(ла):
>
>> Reading from stdin/a file gets you bytes, and
>> not a string, because Python cannot automagically guess what format the
>> input is in.
>
>
> In Python3 reading from stdin gets you string. Use sys.stdin.buffer.raw for
> access to byte stream. And reading from file opened in text mode gets you
> string too.

True. But that's only if it's been told the encoding of stdin (which I
believe is the normal case on Linux). It's still not "automagically
guess(ing)", it's explicitly told.

ChrisA

[toc] | [prev] | [next] | [standalone]

Page 1 of 3 [1] 2 3 Next page →

csiph-web

"convert" string to bytes without changing data (encoding)

Contents

#22266 — "convert" string to bytes without changing data (encoding)

#22267

#22270

#22272

#22287

#22292

#22293

#22300

#22308

#22294

#22295

#22296

#22309

#22301

#22307

#22315

#22340

#22342

#22390

#22391