Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #22266 > unrolled thread
| Started by | Peter Daum <gator@cs.tu-berlin.de> |
|---|---|
| First post | 2012-03-28 10:56 +0200 |
| Last post | 2012-03-28 13:16 -0400 |
| Articles | 20 on this page of 57 — 22 participants |
Back to article view | Back to comp.lang.python
"convert" string to bytes without changing data (encoding) Peter Daum <gator@cs.tu-berlin.de> - 2012-03-28 10:56 +0200
Re: "convert" string to bytes without changing data (encoding) Chris Angelico <rosuav@gmail.com> - 2012-03-28 20:02 +1100
Re: "convert" string to bytes without changing data (encoding) Peter Daum <gator@cs.tu-berlin.de> - 2012-03-28 11:43 +0200
Re: "convert" string to bytes without changing data (encoding) Heiko Wundram <modelnine@modelnine.org> - 2012-03-28 12:42 +0200
Re: "convert" string to bytes without changing data (encoding) Peter Daum <gator@cs.tu-berlin.de> - 2012-03-28 19:43 +0200
Re: "convert" string to bytes without changing data (encoding) Heiko Wundram <modelnine@modelnine.org> - 2012-03-28 20:13 +0200
Re: "convert" string to bytes without changing data (encoding) Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2012-03-28 21:13 +0300
RE: "convert" string to bytes without changing data (encoding) "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-03-28 18:31 +0000
Re: "convert" string to bytes without changing data (encoding) Ethan Furman <ethan@stoneleaf.us> - 2012-03-28 11:49 -0700
RE: "convert" string to bytes without changing data (encoding) "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-03-28 18:20 +0000
Re: "convert" string to bytes without changing data (encoding) Ian Kelly <ian.g.kelly@gmail.com> - 2012-03-28 12:20 -0600
Re: "convert" string to bytes without changing data (encoding) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-28 18:26 +0000
Re: "convert" string to bytes without changing data (encoding) Grant Edwards <invalid@invalid.invalid> - 2012-03-28 19:40 +0000
Re: "convert" string to bytes without changing data (encoding) Ethan Furman <ethan@stoneleaf.us> - 2012-03-28 11:17 -0700
Re: "convert" string to bytes without changing data (encoding) John Nagle <nagle@animats.com> - 2012-03-28 12:30 -0700
Re: "convert" string to bytes without changing data (encoding) Terry Reedy <tjreedy@udel.edu> - 2012-03-28 17:37 -0400
Re: "convert" string to bytes without changing data (encoding) Peter Daum <gator@cs.tu-berlin.de> - 2012-03-29 16:57 +0200
Re: "convert" string to bytes without changing data (encoding) Peter Daum <gator@cs.tu-berlin.de> - 2012-03-29 16:57 +0200
Re: "convert" string to bytes without changing data (encoding) Serhiy Storchaka <storchaka@gmail.com> - 2012-03-30 22:06 +0300
Re: "convert" string to bytes without changing data (encoding) Chris Angelico <rosuav@gmail.com> - 2012-03-31 06:10 +1100
Re: "convert" string to bytes without changing data (encoding) Stefan Behnel <stefan_ml@behnel.de> - 2012-03-28 13:25 +0200
Re: "convert" string to bytes without changing data (encoding) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-28 18:12 +0000
Re: "convert" string to bytes without changing data (encoding) Ross Ridge <rridge@csclub.uwaterloo.ca> - 2012-03-28 11:36 -0400
Re: "convert" string to bytes without changing data (encoding) Chris Angelico <rosuav@gmail.com> - 2012-03-29 03:18 +1100
Re: "convert" string to bytes without changing data (encoding) Grant Edwards <invalid@invalid.invalid> - 2012-03-28 16:33 +0000
Re: "convert" string to bytes without changing data (encoding) Ross Ridge <rridge@csclub.uwaterloo.ca> - 2012-03-28 14:05 -0400
Re: "convert" string to bytes without changing data (encoding) Tim Chase <python.list@tim.thechases.com> - 2012-03-28 13:49 -0500
Re: "convert" string to bytes without changing data (encoding) Ross Ridge <rridge@csclub.uwaterloo.ca> - 2012-03-28 15:10 -0400
Re: "convert" string to bytes without changing data (encoding) "Albert W. Hopkins" <marduk@letterboxes.org> - 2012-03-28 15:22 -0400
Re: "convert" string to bytes without changing data (encoding) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-28 17:54 +0000
Re: "convert" string to bytes without changing data (encoding) Ross Ridge <rridge@csclub.uwaterloo.ca> - 2012-03-28 14:22 -0400
Re: Re: "convert" string to bytes without changing data (encoding) Evan Driscoll <driscoll@cs.wisc.edu> - 2012-03-28 14:20 -0500
Re: Re: "convert" string to bytes without changing data (encoding) Ross Ridge <rridge@csclub.uwaterloo.ca> - 2012-03-28 15:43 -0400
Re: "convert" string to bytes without changing data (encoding) Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-03-28 21:44 +0100
Re: "convert" string to bytes without changing data (encoding) Neil Cerutti <neilc@norwich.edu> - 2012-03-28 20:56 +0000
Re: "convert" string to bytes without changing data (encoding) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-29 00:02 +0000
Re: Re: Re: "convert" string to bytes without changing data (encoding) Evan Driscoll <driscoll@cs.wisc.edu> - 2012-03-28 19:11 -0500
Re: Re: Re: "convert" string to bytes without changing data (encoding) Ross Ridge <rridge@csclub.uwaterloo.ca> - 2012-03-28 23:04 -0400
Re: Re: Re: "convert" string to bytes without changing data (encoding) Chris Angelico <rosuav@gmail.com> - 2012-03-29 14:31 +1100
Re: Re: Re: "convert" string to bytes without changing data (encoding) Ross Ridge <rridge@csclub.uwaterloo.ca> - 2012-03-28 23:58 -0400
Re: "convert" string to bytes without changing data (encoding) Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-03-29 07:01 +0100
Re: "convert" string to bytes without changing data (encoding) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-29 06:51 +0000
Re: "convert" string to bytes without changing data (encoding) Ross Ridge <rridge@csclub.uwaterloo.ca> - 2012-03-29 11:30 -0400
Re: "convert" string to bytes without changing data (encoding) Terry Reedy <tjreedy@udel.edu> - 2012-03-29 12:49 -0400
Re: "convert" string to bytes without changing data (encoding) Ross Ridge <rridge@csclub.uwaterloo.ca> - 2012-03-29 14:00 -0400
Re: "convert" string to bytes without changing data (encoding) Chris Angelico <rosuav@gmail.com> - 2012-03-30 07:41 +1100
Re: "convert" string to bytes without changing data (encoding) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-30 01:16 +0000
Re: Re: Re: Re: "convert" string to bytes without changing data (encoding) Evan Driscoll <driscoll@cs.wisc.edu> - 2012-03-29 11:31 -0500
RE: "convert" string to bytes without changing data (encoding) "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-03-28 19:02 +0000
Re: "convert" string to bytes without changing data (encoding) Grant Edwards <invalid@invalid.invalid> - 2012-03-28 19:44 +0000
Re: "convert" string to bytes without changing data (encoding) MRAB <python@mrabarnett.plus.com> - 2012-03-28 20:50 +0100
RE: "convert" string to bytes without changing data (encoding) "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2012-03-29 17:36 +0000
Re: "convert" string to bytes without changing data (encoding) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-03-30 01:10 +0000
Re: "convert" string to bytes without changing data (encoding) Michael Ströder <michael@stroeder.com> - 2012-03-30 09:04 +0200
Re: "convert" string to bytes without changing data (encoding) Terry Reedy <tjreedy@udel.edu> - 2012-03-28 14:11 -0400
Re: "convert" string to bytes without changing data (encoding) Stefan Behnel <stefan_ml@behnel.de> - 2012-03-28 11:08 +0200
Re: "convert" string to bytes without changing data (encoding) Dave Angel <d@davea.name> - 2012-03-28 13:16 -0400
Page 1 of 3 [1] 2 3 Next page →
| From | Peter Daum <gator@cs.tu-berlin.de> |
|---|---|
| Date | 2012-03-28 10:56 +0200 |
| Subject | "convert" string to bytes without changing data (encoding) |
| Message-ID | <9tg21lFmo3U1@mid.dfncis.de> |
Hi,
is there any way to convert a string to bytes without
interpreting the data in any way? Something like:
s='abcde'
b=bytes(s, "unchanged")
Regards,
Peter
[toc] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-03-28 20:02 +1100 |
| Message-ID | <mailman.1065.1332925364.3037.python-list@python.org> |
| In reply to | #22266 |
On Wed, Mar 28, 2012 at 7:56 PM, Peter Daum <gator@cs.tu-berlin.de> wrote: > Hi, > > is there any way to convert a string to bytes without > interpreting the data in any way? Something like: > > s='abcde' > b=bytes(s, "unchanged") What is a string? It's not a series of bytes. You can't convert it without encoding those characters into bytes in some way. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Peter Daum <gator@cs.tu-berlin.de> |
|---|---|
| Date | 2012-03-28 11:43 +0200 |
| Message-ID | <9tg4qoFbfpU1@mid.dfncis.de> |
| In reply to | #22267 |
On 2012-03-28 11:02, Chris Angelico wrote:
> On Wed, Mar 28, 2012 at 7:56 PM, Peter Daum <gator@cs.tu-berlin.de> wrote:
>> is there any way to convert a string to bytes without
>> interpreting the data in any way? Something like:
>>
>> s='abcde'
>> b=bytes(s, "unchanged")
>
> What is a string? It's not a series of bytes. You can't convert it
> without encoding those characters into bytes in some way.
... in my example, the variable s points to a "string", i.e. a series of
bytes, (0x61,0x62 ...) interpreted as ascii/unicode characters.
b=bytes(s,'ascii') # or ('utf-8', 'latin1', ...)
would of course work in this case, but in general, if s holds any
data with bytes > 127, the actual data will be changed according
to the provided encoding.
What I am looking for is a general way to just copy the raw data
from a "string" object to a "byte" object without any attempt to
"decode" or "encode" anything ...
Regards,
Peter
[toc] | [prev] | [next] | [standalone]
| From | Heiko Wundram <modelnine@modelnine.org> |
|---|---|
| Date | 2012-03-28 12:42 +0200 |
| Message-ID | <mailman.1069.1332931371.3037.python-list@python.org> |
| In reply to | #22270 |
Am 28.03.2012 11:43, schrieb Peter Daum: > ... in my example, the variable s points to a "string", i.e. a series > of > bytes, (0x61,0x62 ...) interpreted as ascii/unicode characters. No; a string contains a series of codepoints from the unicode plane, representing natural language characters (at least in the simplistic view, I'm not talking about surrogates). These can be encoded to different binary storage representations, of which ascii is (a common) one. > What I am looking for is a general way to just copy the raw data > from a "string" object to a "byte" object without any attempt to > "decode" or "encode" anything ... There is "logically" no raw data in the string, just a series of codepoints, as stated above. You'll have to specify the encoding to use to get at "raw" data, and from what I gather you're interested in the latin-1 (or iso-8859-15) encoding, as you're specifically referencing chars >= 0x80 (which hints at your mindset being in LATIN-land, so to speak). -- --- Heiko.
[toc] | [prev] | [next] | [standalone]
| From | Peter Daum <gator@cs.tu-berlin.de> |
|---|---|
| Date | 2012-03-28 19:43 +0200 |
| Message-ID | <9th0u8Fuf2U1@mid.dfncis.de> |
| In reply to | #22272 |
On 2012-03-28 12:42, Heiko Wundram wrote:
> Am 28.03.2012 11:43, schrieb Peter Daum:
>> ... in my example, the variable s points to a "string", i.e. a series of
>> bytes, (0x61,0x62 ...) interpreted as ascii/unicode characters.
>
> No; a string contains a series of codepoints from the unicode plane,
> representing natural language characters (at least in the simplistic
> view, I'm not talking about surrogates). These can be encoded to
> different binary storage representations, of which ascii is (a common) one.
>
>> What I am looking for is a general way to just copy the raw data
>> from a "string" object to a "byte" object without any attempt to
>> "decode" or "encode" anything ...
>
> There is "logically" no raw data in the string, just a series of
> codepoints, as stated above. You'll have to specify the encoding to use
> to get at "raw" data, and from what I gather you're interested in the
> latin-1 (or iso-8859-15) encoding, as you're specifically referencing
> chars >= 0x80 (which hints at your mindset being in LATIN-land, so to
> speak).
... I was under the illusion, that python (like e.g. perl) stored
strings internally in utf-8. In this case the "conversion" would simple
mean to re-label the data. Unfortunately, as I meanwhile found out, this
is not the case (nor the "apple encoding" ;-), so it would indeed be
pretty useless.
The longer story of my question is: I am new to python (obviously), and
since I am not familiar with either one, I thought it would be advisory
to go for python 3.x. The biggest problem that I am facing is, that I
am often dealing with data, that is basically text, but it can contain
8-bit bytes. In this case, I can not safely assume any given encoding,
but I actually also don't need to know - for my purposes, it would be
perfectly good enough to deal with the ascii portions and keep anything
else unchanged.
As it seems, this would be far easier with python 2.x. With python 3
and its strict distinction between "str" and "bytes", things gets
syntactically pretty awkward and error-prone (something as innocently
looking like "s=s+'/'" hidden in a rarely reached branch and a
seemingly correct program will crash with a TypeError 2 years
later ...)
Regards,
Peter
[toc] | [prev] | [next] | [standalone]
| From | Heiko Wundram <modelnine@modelnine.org> |
|---|---|
| Date | 2012-03-28 20:13 +0200 |
| Message-ID | <mailman.1084.1332958393.3037.python-list@python.org> |
| In reply to | #22287 |
Am 28.03.2012 19:43, schrieb Peter Daum:
> As it seems, this would be far easier with python 2.x. With python 3
> and its strict distinction between "str" and "bytes", things gets
> syntactically pretty awkward and error-prone (something as innocently
> looking like "s=s+'/'" hidden in a rarely reached branch and a
> seemingly correct program will crash with a TypeError 2 years
> later ...)
It seems that you're mixing things up wrt. the string/bytes
distinction; it's not as "complicated" as it might seem.
1) Strings
s = "This is a test string"
s = 'This is another test string with single quotes'
s = """
And this is a multiline test string.
"""
s = 'c' # This is also a string...
all create/refer to string objects. How Python internally stores them
is none of your concern (actually, that's rather complicated anyway, at
least with the upcoming Python 3.3), and processing a string basically
means that you'll work on the natural language characters present in the
string. Python strings can store (pretty much) all characters and
surrogates that unicode allows, and when the python interpreter/compiler
reads strings from input (I'm talking about source files), a default
encoding defines how the bytes in your input file get interpreted as
unicode codepoint encodings (generally, it depends on your system locale
or file header indications) to construct the internal string object
you're using to access the data in the string.
There is no such thing as a type for a single character; single
characters are simply strings of length 1 (and so indexing also returns
a [new] string object).
Single/double quotes work no different.
The internal encoding used by the Python interpreter is of no concern
to you.
2) Bytes
s = b'this is a byte-string'
s = b'\x22\x33\x44'
The above define bytes. Think of the bytes type as arrays of 8-bit
integers, only representing a buffer which you can process as an array
of fixed-width integers. Reading from stdin/a file gets you bytes, and
not a string, because Python cannot automagically guess what format the
input is in.
Indexing the bytes type returns an integer (which is the clearest
distinction between string and bytes).
Being able to input "string-looking" data in source files as bytes is a
debatable "feature" (IMHO; see the first example), simply because it
breaks the semantic difference between the two types in the eye of the
programmer looking at source.
3) Conversions
To get from bytes to string, you have to decode the bytes buffer,
telling Python what kind of character data is contained in the array of
integers. After decoding, you'll get a string object which you can
process using the standard string methods. For decoding to succeed, you
have to tell Python how the natural language characters are encoded in
your array of bytes:
b'hello'.decode('iso-8859-15')
To get from string back to bytes (you want to write the natural
language character data you've processed to a file), you have to encode
the data in your string buffer, which gets you an array of 8-bit
integers to write to the output:
'hello'.encode('iso-8859-15')
Most output methods will happily do the encoding for you, using a
standard encoding, and if that happens to be ASCII, you're getting
UnicodeEncodeErrors which tell you that a character in your string
source is unsuited to be transmitted using the encoding you've
specified.
If the above doesn't make the string/bytes-distinction and usage
clearer, and you have a C#-background, check out the distinction between
byte[] (which the System.IO-streams get you), and how you have to use a
System.Encoding-derived class to get at actual System.String objects to
manipulate character data. Pythons type system wrt. character data is
pretty much similar, except for missing the "single character" type
(char).
Anyway, back to what you wrote: how are you getting the input data? Why
are "high bytes" in there which you do not know the encoding for?
Generally, from what I gather, you'll decode data from some source,
process it, and write it back using the same encoding which you used for
decoding, which should do exactly what you want and not get you into any
trouble with encodings.
--
--- Heiko.
[toc] | [prev] | [next] | [standalone]
| From | Jussi Piitulainen <jpiitula@ling.helsinki.fi> |
|---|---|
| Date | 2012-03-28 21:13 +0300 |
| Message-ID | <qotbong1tlq.fsf@ruuvi.it.helsinki.fi> |
| In reply to | #22287 |
Peter Daum writes:
> ... I was under the illusion, that python (like e.g. perl) stored
> strings internally in utf-8. In this case the "conversion" would simple
> mean to re-label the data. Unfortunately, as I meanwhile found out, this
> is not the case (nor the "apple encoding" ;-), so it would indeed be
> pretty useless.
>
> The longer story of my question is: I am new to python (obviously), and
> since I am not familiar with either one, I thought it would be advisory
> to go for python 3.x. The biggest problem that I am facing is, that I
> am often dealing with data, that is basically text, but it can contain
> 8-bit bytes. In this case, I can not safely assume any given encoding,
> but I actually also don't need to know - for my purposes, it would be
> perfectly good enough to deal with the ascii portions and keep anything
> else unchanged.
You can read as bytes and decode as ASCII but ignoring the troublesome
non-text characters:
>>> print(open('text.txt', 'br').read().decode('ascii', 'ignore'))
Das fr ASCII nicht benutzte Bit kann auch fr Fehlerkorrekturzwecke
(Parittsbit) auf den Kommunikationsleitungen oder fr andere
Steuerungsaufgaben verwendet werden. Heute wird es aber fast immer zur
Erweiterung von ASCII auf einen 8-Bit-Code verwendet. Diese
Erweiterungen sind mit dem ursprnglichen ASCII weitgehend kompatibel,
so dass alle im ASCII definierten Zeichen auch in den verschiedenen
Erweiterungen durch die gleichen Bitmuster kodiert werden. Die
einfachsten Erweiterungen sind Kodierungen mit sprachspezifischen
Zeichen, die nicht im lateinischen Grundalphabet enthalten sind.
The paragraph is from the German Wikipedia on ASCII, in UTF-8.
[toc] | [prev] | [next] | [standalone]
| From | "Prasad, Ramit" <ramit.prasad@jpmorgan.com> |
|---|---|
| Date | 2012-03-28 18:31 +0000 |
| Message-ID | <mailman.1089.1332959476.3037.python-list@python.org> |
| In reply to | #22293 |
> You can read as bytes and decode as ASCII but ignoring the troublesome
> non-text characters:
>
> >>> print(open('text.txt', 'br').read().decode('ascii', 'ignore'))
> Das fr ASCII nicht benutzte Bit kann auch fr Fehlerkorrekturzwecke
> (Parittsbit) auf den Kommunikationsleitungen oder fr andere
> Steuerungsaufgaben verwendet werden. Heute wird es aber fast immer zur
> Erweiterung von ASCII auf einen 8-Bit-Code verwendet. Diese
> Erweiterungen sind mit dem ursprnglichen ASCII weitgehend kompatibel,
> so dass alle im ASCII definierten Zeichen auch in den verschiedenen
> Erweiterungen durch die gleichen Bitmuster kodiert werden. Die
> einfachsten Erweiterungen sind Kodierungen mit sprachspezifischen
> Zeichen, die nicht im lateinischen Grundalphabet enthalten sind.
>
> The paragraph is from the German Wikipedia on ASCII, in UTF-8.
I see no non-ASCII characters, not sure if that is because the source
has none or something else. From this example I would not say that
the rest of the text is "unchanged". Decode converts to Unicode,
did you mean encode?
I think "ignore" will remove non-translatable characters and not
leave them in the returned string.
Ramit
Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology
712 Main Street | Houston, TX 77002
work phone: 713 - 216 - 5423
--
This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.
[toc] | [prev] | [next] | [standalone]
| From | Ethan Furman <ethan@stoneleaf.us> |
|---|---|
| Date | 2012-03-28 11:49 -0700 |
| Message-ID | <mailman.1095.1332963309.3037.python-list@python.org> |
| In reply to | #22293 |
Prasad, Ramit wrote:
>> You can read as bytes and decode as ASCII but ignoring the troublesome
>> non-text characters:
>>
>>>>> print(open('text.txt', 'br').read().decode('ascii', 'ignore'))
>> Das fr ASCII nicht benutzte Bit kann auch fr Fehlerkorrekturzwecke
>> (Parittsbit) auf den Kommunikationsleitungen oder fr andere
>> Steuerungsaufgaben verwendet werden. Heute wird es aber fast immer zur
>> Erweiterung von ASCII auf einen 8-Bit-Code verwendet. Diese
>> Erweiterungen sind mit dem ursprnglichen ASCII weitgehend kompatibel,
>> so dass alle im ASCII definierten Zeichen auch in den verschiedenen
>> Erweiterungen durch die gleichen Bitmuster kodiert werden. Die
>> einfachsten Erweiterungen sind Kodierungen mit sprachspezifischen
>> Zeichen, die nicht im lateinischen Grundalphabet enthalten sind.
>>
>> The paragraph is from the German Wikipedia on ASCII, in UTF-8.
>
> I see no non-ASCII characters, not sure if that is because the source
> has none or something else.
The 'ignore' argument to .decode() caused all non-ascii characters to be
removed.
~Ethan~
[toc] | [prev] | [next] | [standalone]
| From | "Prasad, Ramit" <ramit.prasad@jpmorgan.com> |
|---|---|
| Date | 2012-03-28 18:20 +0000 |
| Message-ID | <mailman.1085.1332958844.3037.python-list@python.org> |
| In reply to | #22287 |
> As it seems, this would be far easier with python 2.x. With python 3 > and its strict distinction between "str" and "bytes", things gets > syntactically pretty awkward and error-prone (something as innocently > looking like "s=s+'/'" hidden in a rarely reached branch and a > seemingly correct program will crash with a TypeError 2 years > later ...) Just a small note as you are new to Python, string concatenation can be expensive (quadratic time). The Python (2.x and 3.x) idiom for frequent string concatenation is to append to a list and then join them like the following (linear time). >>>lst = [ 'Hi,' ] >>>lst.append( 'how' ) >>>lst.append( 'are' ) >>>lst.append( 'you?' ) >>>sentence = ' '.join( lst ) # use a space separating each element >>>print sentence Hi, how are you? You can use join on an empty string, but then they will not be separated by spaces. >>>sentence = ''.join( lst ) # empty string so no separation >>>print sentence Hi,howareyou? You can use any string as a separator, length does not matter. >>>sentence = '@-Q'.join( lst ) >>>print sentence Hi,@-Qhow@-Qare@-Qyou? Ramit Ramit Prasad | JPMorgan Chase Investment Bank | Currencies Technology 712 Main Street | Houston, TX 77002 work phone: 713 - 216 - 5423 -- This email is confidential and subject to important disclaimers and conditions including on offers for the purchase or sale of securities, accuracy and completeness of information, viruses, confidentiality, legal privilege, and legal entity disclaimers, available at http://www.jpmorgan.com/pages/disclosures/email.
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-03-28 12:20 -0600 |
| Message-ID | <mailman.1086.1332958864.3037.python-list@python.org> |
| In reply to | #22287 |
On Wed, Mar 28, 2012 at 11:43 AM, Peter Daum <gator@cs.tu-berlin.de> wrote: > ... I was under the illusion, that python (like e.g. perl) stored > strings internally in utf-8. In this case the "conversion" would simple > mean to re-label the data. Unfortunately, as I meanwhile found out, this > is not the case (nor the "apple encoding" ;-), so it would indeed be > pretty useless. No, unicode strings can be stored internally as any of UCS-1, UCS-2, UCS-4, C wchar strings, or even plain ASCII. And those are all implementation details that could easily change in future versions of Python. > The longer story of my question is: I am new to python (obviously), and > since I am not familiar with either one, I thought it would be advisory > to go for python 3.x. The biggest problem that I am facing is, that I > am often dealing with data, that is basically text, but it can contain > 8-bit bytes. In this case, I can not safely assume any given encoding, > but I actually also don't need to know - for my purposes, it would be > perfectly good enough to deal with the ascii portions and keep anything > else unchanged. You can't generally just "deal with the ascii portions" without knowing something about the encoding. Say you encounter a byte greater than 127. Is it a single non-ASCII character, or is it the leading byte of a multi-byte character? If the next character is less than 127, is it an ASCII character, or a continuation of the previous character? For UTF-8 you could safely assume ASCII, but without knowing the encoding, there is no way to be sure. If you just assume it's ASCII and manipulate it as such, you could be messing up non-ASCII characters. Cheers, Ian
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-03-28 18:26 +0000 |
| Message-ID | <4f7357d5$0$29981$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #22287 |
On Wed, 28 Mar 2012 19:43:36 +0200, Peter Daum wrote:
> The longer story of my question is: I am new to python (obviously), and
> since I am not familiar with either one, I thought it would be advisory
> to go for python 3.x. The biggest problem that I am facing is, that I am
> often dealing with data, that is basically text, but it can contain
> 8-bit bytes.
All bytes are 8-bit, at least on modern hardware. I think you have to go
back to the 1950s to find 10-bit or 12-bit machines.
> In this case, I can not safely assume any given encoding,
> but I actually also don't need to know - for my purposes, it would be
> perfectly good enough to deal with the ascii portions and keep anything
> else unchanged.
Well you can't do that, because *by definition* you are changing a
CHARACTER into ONE OR MORE BYTES. So the question you have to ask is,
*how* do you want to change them?
You can use an error handler to convert any untranslatable characters
into question marks, or to ignore them altogether:
bytes = string.encode('ascii', 'replace')
bytes = string.encode('ascii', 'ignore')
When going the other way, from bytes to strings, it can sometimes be
useful to use the Latin-1 encoding, which essentially cannot fail:
string = bytes.decode('latin1')
although the non-ASCII chars that you get may not be sensible or
meaningful in any way. But if there are only a few of them, and you don't
care too much, this may be a simple approach.
But in a nutshell, it is physically impossible to map the millions of
Unicode characters to just 256 possible bytes without either throwing
some characters away, or performing an encoding.
> As it seems, this would be far easier with python 2.x.
It only seems that way until you try.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Grant Edwards <invalid@invalid.invalid> |
|---|---|
| Date | 2012-03-28 19:40 +0000 |
| Message-ID | <jkvpg8$9nk$1@reader1.panix.com> |
| In reply to | #22296 |
On 2012-03-28, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
> On Wed, 28 Mar 2012 19:43:36 +0200, Peter Daum wrote:
>
>> The longer story of my question is: I am new to python (obviously), and
>> since I am not familiar with either one, I thought it would be advisory
>> to go for python 3.x. The biggest problem that I am facing is, that I am
>> often dealing with data, that is basically text, but it can contain
>> 8-bit bytes.
>
> All bytes are 8-bit, at least on modern hardware. I think you have to
> go back to the 1950s to find 10-bit or 12-bit machines.
Well, on anything likely to run Python that's true. There are modern
DSP-oriented CPUs where a byte is 16 or 32 bits (and so is an int and
a long, and a float and a double).
>> As it seems, this would be far easier with python 2.x.
>
> It only seems that way until you try.
It's easy as long as you deal with nothing but ASCII and Latin-1. ;)
--
Grant Edwards grant.b.edwards Yow! Somewhere in Tenafly,
at New Jersey, a chiropractor
gmail.com is viewing "Leave it to
Beaver"!
[toc] | [prev] | [next] | [standalone]
| From | Ethan Furman <ethan@stoneleaf.us> |
|---|---|
| Date | 2012-03-28 11:17 -0700 |
| Message-ID | <mailman.1090.1332959710.3037.python-list@python.org> |
| In reply to | #22287 |
Peter Daum wrote: > On 2012-03-28 12:42, Heiko Wundram wrote: >> Am 28.03.2012 11:43, schrieb Peter Daum: >>> ... in my example, the variable s points to a "string", i.e. a series of >>> bytes, (0x61,0x62 ...) interpreted as ascii/unicode characters. >> No; a string contains a series of codepoints from the unicode plane, >> representing natural language characters (at least in the simplistic >> view, I'm not talking about surrogates). These can be encoded to >> different binary storage representations, of which ascii is (a common) one. >> >>> What I am looking for is a general way to just copy the raw data >>> from a "string" object to a "byte" object without any attempt to >>> "decode" or "encode" anything ... >> There is "logically" no raw data in the string, just a series of >> codepoints, as stated above. You'll have to specify the encoding to use >> to get at "raw" data, and from what I gather you're interested in the >> latin-1 (or iso-8859-15) encoding, as you're specifically referencing >> chars >= 0x80 (which hints at your mindset being in LATIN-land, so to >> speak). > > The longer story of my question is: I am new to python (obviously), and > since I am not familiar with either one, I thought it would be advisory > to go for python 3.x. The biggest problem that I am facing is, that I > am often dealing with data, that is basically text, but it can contain > 8-bit bytes. In this case, I can not safely assume any given encoding, > but I actually also don't need to know - for my purposes, it would be > perfectly good enough to deal with the ascii portions and keep anything > else unchanged. Where is the data coming from? Files? In that case, it sounds like you will want to decode/encode using 'latin-1', as the bulk of your text is plain ascii and you don't really care about the upper-ascii chars. ~Ethan~
[toc] | [prev] | [next] | [standalone]
| From | John Nagle <nagle@animats.com> |
|---|---|
| Date | 2012-03-28 12:30 -0700 |
| Message-ID | <jkvot5$m3o$1@dont-email.me> |
| In reply to | #22287 |
On 3/28/2012 10:43 AM, Peter Daum wrote:
> On 2012-03-28 12:42, Heiko Wundram wrote:
>> Am 28.03.2012 11:43, schrieb Peter Daum:
> The longer story of my question is: I am new to python (obviously), and
> since I am not familiar with either one, I thought it would be advisory
> to go for python 3.x. The biggest problem that I am facing is, that I
> am often dealing with data, that is basically text, but it can contain
> 8-bit bytes. In this case, I can not safely assume any given encoding,
> but I actually also don't need to know - for my purposes, it would be
> perfectly good enough to deal with the ascii portions and keep anything
> else unchanged.
So why let the data get into a "str" type at all? Do everything
end to end with "bytes" or "bytearray" types.
John Nagle
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2012-03-28 17:37 -0400 |
| Message-ID | <mailman.1098.1332970699.3037.python-list@python.org> |
| In reply to | #22287 |
On 3/28/2012 1:43 PM, Peter Daum wrote: > The longer story of my question is: I am new to python (obviously), and > since I am not familiar with either one, I thought it would be advisory > to go for python 3.x. I strongly agree with that unless you have reason to use 2.7. Python 3.3 (.0a1 in nearly out) has an improved unicode implementation, among other things. < The biggest problem that I am facing is, that I > am often dealing with data, that is basically text, but it can contain > 8-bit bytes. In this case, I can not safely assume any given encoding, > but I actually also don't need to know - for my purposes, it would be > perfectly good enough to deal with the ascii portions and keep anything > else unchanged. You are assuming, or must assume, that the text is in an ascii-compatible encoding, meaning that bytes 0-127 really represent ascii chars. Otherwise, you cannot reliably interpret anything, let alone change it. This problem of knowing that much but not the specific encoding is unfortunately common. It has been discussed among core developers and others the last few months. Different people prefer one of the following approaches. 1. Keep the bytes as bytes and use bytes literals and bytes functions as needed. The danger, as you noticed, is forgetting the 'b' prefix. 2. Decode as if the text were latin-1 and ignore the non-ascii 'latin-1' chars. When done, encode back to 'latin-1' and the non-ascii chars will be as they originally were. The danger is forgetting the pretense, and perhaps passing on the the string (as a string, not bytes) to other modules that will not know the pretense. 3. Decode using encoding = 'ascii', errors='surrogate_escape'. This reversibly encodes the unknown non-ascii chars as 'illegal' non-chars (using the surrogate-pair second-half code units). This is probably the safest in that invalid operations on the non-chars should raise an exception. Re-encoding with the same setting will reproduce the original hi-bit chars. The main danger is passing the illegal strings out of your local sandbox. -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Peter Daum <gator@cs.tu-berlin.de> |
|---|---|
| Date | 2012-03-29 16:57 +0200 |
| Message-ID | <4F74784F.40804@cs.tu-berlin.de> |
| In reply to | #22315 |
On 2012-03-28 23:37, Terry Reedy wrote:
> 2. Decode as if the text were latin-1 and ignore the non-ascii 'latin-1'
> chars. When done, encode back to 'latin-1' and the non-ascii chars will
> be as they originally were.
... actually, in the beginning of my quest, I ran into an decoding
exception trying to read data as "latin1" (which was more or less what
I had expected anyway because byte values between 128 and 160 are not
defined there).
Obviously, I must have misinterpreted something there;
I just ran a little test:
l=[i for i in range(256)]; b=bytes(l)
s=b.decode('latin1'); b=s.encode('latin1'); s=b.decode('latin1')
for c in s:
print(hex(ord(c)), end=' ')
if (ord(c)+1) % 16 ==0: print("")
print()
... and got all the original bytes back. So it looks like I tried to
solve a problem that did not exist to start with (the problems, I ran
into then were pretty real, though ;-)
> 3. Decode using encoding = 'ascii', errors='surrogate_escape'. This
> reversibly encodes the unknown non-ascii chars as 'illegal' non-chars
> (using the surrogate-pair second-half code units). This is probably the
> safest in that invalid operations on the non-chars should raise an
> exception. Re-encoding with the same setting will reproduce the original
> hi-bit chars. The main danger is passing the illegal strings out of your
> local sandbox.
Unfortunately, this is a very well-kept secret unless you know that
something with that name exists. The options currently mentioned in the
documentation are not really helpful, because the non-decodeable will
be lost. With some trying, I got it to work, too (the option is named
"surrogateescape" without the "_" and in python 3.1 it exists, but only
not as a keyword argument: "s=b.decode('utf-8','surrogateescape')" ...)
Thank you very much for your constructive advice!
Regards,
Peter
[toc] | [prev] | [next] | [standalone]
| From | Peter Daum <gator@cs.tu-berlin.de> |
|---|---|
| Date | 2012-03-29 16:57 +0200 |
| Message-ID | <mailman.1118.1333033671.3037.python-list@python.org> |
| In reply to | #22315 |
On 2012-03-28 23:37, Terry Reedy wrote:
> 2. Decode as if the text were latin-1 and ignore the non-ascii 'latin-1'
> chars. When done, encode back to 'latin-1' and the non-ascii chars will
> be as they originally were.
... actually, in the beginning of my quest, I ran into an decoding
exception trying to read data as "latin1" (which was more or less what
I had expected anyway because byte values between 128 and 160 are not
defined there).
Obviously, I must have misinterpreted something there;
I just ran a little test:
l=[i for i in range(256)]; b=bytes(l)
s=b.decode('latin1'); b=s.encode('latin1'); s=b.decode('latin1')
for c in s:
print(hex(ord(c)), end=' ')
if (ord(c)+1) % 16 ==0: print("")
print()
... and got all the original bytes back. So it looks like I tried to
solve a problem that did not exist to start with (the problems, I ran
into then were pretty real, though ;-)
> 3. Decode using encoding = 'ascii', errors='surrogate_escape'. This
> reversibly encodes the unknown non-ascii chars as 'illegal' non-chars
> (using the surrogate-pair second-half code units). This is probably the
> safest in that invalid operations on the non-chars should raise an
> exception. Re-encoding with the same setting will reproduce the original
> hi-bit chars. The main danger is passing the illegal strings out of your
> local sandbox.
Unfortunately, this is a very well-kept secret unless you know that
something with that name exists. The options currently mentioned in the
documentation are not really helpful, because the non-decodeable will
be lost. With some trying, I got it to work, too (the option is named
"surrogateescape" without the "_" and in python 3.1 it exists, but only
not as a keyword argument: "s=b.decode('utf-8','surrogateescape')" ...)
Thank you very much for your constructive advice!
Regards,
Peter
[toc] | [prev] | [next] | [standalone]
| From | Serhiy Storchaka <storchaka@gmail.com> |
|---|---|
| Date | 2012-03-30 22:06 +0300 |
| Message-ID | <mailman.1154.1333134438.3037.python-list@python.org> |
| In reply to | #22287 |
28.03.12 21:13, Heiko Wundram написав(ла): > Reading from stdin/a file gets you bytes, and > not a string, because Python cannot automagically guess what format the > input is in. In Python3 reading from stdin gets you string. Use sys.stdin.buffer.raw for access to byte stream. And reading from file opened in text mode gets you string too.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-03-31 06:10 +1100 |
| Message-ID | <mailman.1155.1333134606.3037.python-list@python.org> |
| In reply to | #22287 |
On Sat, Mar 31, 2012 at 6:06 AM, Serhiy Storchaka <storchaka@gmail.com> wrote: > 28.03.12 21:13, Heiko Wundram написав(ла): > >> Reading from stdin/a file gets you bytes, and >> not a string, because Python cannot automagically guess what format the >> input is in. > > > In Python3 reading from stdin gets you string. Use sys.stdin.buffer.raw for > access to byte stream. And reading from file opened in text mode gets you > string too. True. But that's only if it's been told the encoding of stdin (which I believe is the normal case on Linux). It's still not "automagically guess(ing)", it's explicitly told. ChrisA
[toc] | [prev] | [next] | [standalone]
Page 1 of 3 [1] 2 3 Next page →
Back to top | Article view | comp.lang.python
csiph-web