Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #5163 > unrolled thread
| Started by | harrismh777 <harrismh777@charter.net> |
|---|---|
| First post | 2011-05-11 16:37 -0500 |
| Last post | 2011-05-11 15:34 -0700 |
| Articles | 12 on this page of 32 — 12 participants |
Back to article view | Back to comp.lang.python
unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 16:37 -0500
Re: unicode by default Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-11 16:09 -0600
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 17:51 -0500
Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 09:32 +1000
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 20:22 -0500
Re: unicode by default MRAB <python@mrabarnett.plus.com> - 2011-05-12 03:31 +0100
Re: unicode by default Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-05-12 03:16 +0000
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 22:44 -0500
Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-12 00:12 -0400
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-12 01:43 -0500
Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 14:14 +1000
Re: unicode by default Benjamin Kaplan <benjamin.kaplan@case.edu> - 2011-05-11 21:14 -0700
Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 14:41 +1000
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-12 01:14 -0500
Re: unicode by default TheSaint <nobody@nowhere.net.no> - 2011-05-12 20:40 +0800
Re: unicode by default Ben Finney <ben+python@benfinney.id.au> - 2011-05-12 14:07 +1000
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-12 01:31 -0500
Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 17:58 +1000
Re: unicode by default Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-12 10:17 -0600
Re: unicode by default jmfauth <wxjmfauth@gmail.com> - 2011-05-12 23:28 -0700
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-13 14:53 -0500
Re: unicode by default Robert Kern <robert.kern@gmail.com> - 2011-05-13 15:18 -0500
Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-13 21:41 -0400
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-14 02:41 -0500
Re: unicode by default jmfauth <wxjmfauth@gmail.com> - 2011-05-14 03:26 -0700
Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-14 16:26 -0400
Re: unicode by default Ben Finney <ben+python@benfinney.id.au> - 2011-05-15 09:47 +1000
Re: unicode by default Nobody <nobody@nowhere.com> - 2011-05-14 09:34 +0100
Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-12 16:42 -0400
Re: unicode by default Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-12 16:25 -0600
Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 13:54 +1000
Re: unicode by default Benjamin Kaplan <benjamin.kaplan@case.edu> - 2011-05-11 15:34 -0700
Page 2 of 2 — ← Prev page 1 [2]
| From | harrismh777 <harrismh777@charter.net> |
|---|---|
| Date | 2011-05-13 14:53 -0500 |
| Message-ID | <j%fzp.1806$7N5.1387@newsfe04.iad> |
| In reply to | #5281 |
jmfauth wrote:
>> to worry about encodings are when you're encoding unicode characters
>> > to byte strings, or decoding bytes to unicode characters
>
> A small but important correction/clarification:
>
> In Unicode, "unicode" does not encode a*character*. It
> encodes a*code point*, a number, the integer associated
> to the character.
>
That is a huge code-point... pun intended.
... and there is another point that I continue to be somewhat puzzled
about, and that is the issue of fonts.
On of my hobbies at the moment is ancient Greek (biblical studies,
Septuaginta LXX, and Greek New Testament). I have these texts on my
computer in a folder in several formats... pdf, unicode 'plaintext',
osis.xml, and XML.
These texts may be found at http://sblgnt.com
I am interested for the moment only in the 'plaintext' stream,
because it is unicode. ( first, in unicode, according to all the doc
there is no such thing as 'plaintext,' so keep that in mind).
When I open the text stream in one of my unicode editors I can see
'most' of the characters in a rudimentary Greek font with accents;
however, I also see many tiny square blocks indicating (I think) that
the code points do *not* have a corresponding character in my unicode
font for that Greek symbol (whatever it is supposed to be).
The point, or question is, how does one go about making sure that
there is a corresponding font glyph to match a specific unicode code
point for display in a particular terminal (editor, browser, whatever) ?
The unicode consortium is very careful to make sure that thousands
of symbols have a unique code point (that's great !) but how do these
thousands of symbols actually get displayed if there is no font
consortium? Are there collections of 'standard' fonts for unicode that
I am not aware? Is there a unix linux package that can be installed
that drops at least 'one' default standard font that will be able to
render all or 'most' (whatever I mean by that) code points in unicode?
Is this a Python issue at all?
kind regards,
m harris
[toc] | [prev] | [next] | [standalone]
| From | Robert Kern <robert.kern@gmail.com> |
|---|---|
| Date | 2011-05-13 15:18 -0500 |
| Message-ID | <mailman.1525.1305317927.9059.python-list@python.org> |
| In reply to | #5322 |
On 5/13/11 2:53 PM, harrismh777 wrote: > The unicode consortium is very careful to make sure that thousands of symbols > have a unique code point (that's great !) but how do these thousands of symbols > actually get displayed if there is no font consortium? Are there collections of > 'standard' fonts for unicode that I am not aware? There are some well-known fonts that try to cover a large section of the Unicode standard. http://en.wikipedia.org/wiki/Unicode_typeface > Is there a unix linux package > that can be installed that drops at least 'one' default standard font that will > be able to render all or 'most' (whatever I mean by that) code points in > unicode? Is this a Python issue at all? Not really. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2011-05-13 21:41 -0400 |
| Message-ID | <mailman.1530.1305337302.9059.python-list@python.org> |
| In reply to | #5322 |
On 5/13/2011 3:53 PM, harrismh777 wrote: > The unicode consortium is very careful to make sure that thousands of > symbols have a unique code point (that's great !) but how do these > thousands of symbols actually get displayed if there is no font > consortium? Are there collections of 'standard' fonts for unicode that I > am not aware? Is there a unix linux package that can be installed that > drops at least 'one' default standard font that will be able to render > all or 'most' (whatever I mean by that) code points in unicode? Is this > a Python issue at all? Easy, practical use of unicode is still a work in progress. -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | harrismh777 <harrismh777@charter.net> |
|---|---|
| Date | 2011-05-14 02:41 -0500 |
| Message-ID | <qmqzp.1592$iv4.747@newsfe09.iad> |
| In reply to | #5329 |
Terry Reedy wrote:
>> Is there a unix linux package that can be installed that
>> drops at least 'one' default standard font that will be able to render
>> all or 'most' (whatever I mean by that) code points in unicode? Is this
>> a Python issue at all?
>
> Easy, practical use of unicode is still a work in progress.
Apparently... the good news for me is that SBL provides their unicode
font here:
http://www.sbl-site.org/educational/biblicalfonts.aspx
I'm getting much closer here, but now the problem is typing. The pain
with unicode fonts is that the glyph is tied to the code point for the
represented character, and not tied to any code point that matches any
keyboard scan code for typing. :-}
So, I can now see the ancient text with accents and aparatus in all of
my editors, but I still cannot type any ancient Greek with my
keyboard... because I have to make up a keymap first. <sigh>
I don't find that SBL (nor Logos Software) has provided keymaps as
yet... rats.
I can read the test with Python though... yessss.
m harris
[toc] | [prev] | [next] | [standalone]
| From | jmfauth <wxjmfauth@gmail.com> |
|---|---|
| Date | 2011-05-14 03:26 -0700 |
| Message-ID | <f275bbd6-e71f-437e-941f-d3cf875f5636@x6g2000yqj.googlegroups.com> |
| In reply to | #5345 |
On 14 mai, 09:41, harrismh777 <harrismh...@charter.net> wrote: > ... > I'm getting much closer here, > ... You should really understand, that Unicode is a domain per se. It is independent from any os's, programming languages or applications. It is up to these tools to be "unicode" compliant. Working in a full unicode mode (at least for texts) is today practically a solved problem. But you have to ensure the whole toolchain is unicode compliant (editors, fonts (OpenType technology), rendering devices, ...). Tip. This list is certainly not the best place to grab informations. I suggest you start by getting informations about XeTeX. XeTeX is the "new" TeX engine working only in a unicode mode. From this starting point, you will fall on plenty web sites speaking about the "unicode world", tools, fonts, ... A variant is to visit sites speaking about *typography*. jmf
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2011-05-14 16:26 -0400 |
| Message-ID | <mailman.1561.1305404825.9059.python-list@python.org> |
| In reply to | #5345 |
On 5/14/2011 3:41 AM, harrismh777 wrote: > Terry Reedy wrote: >> Easy, practical use of unicode is still a work in progress. > > Apparently... the good news for me is that SBL provides their unicode > font here: > > http://www.sbl-site.org/educational/biblicalfonts.aspx > > I'm getting much closer here, but now the problem is typing. The pain > with unicode fonts is that the glyph is tied to the code point for the > represented character, and not tied to any code point that matches any > keyboard scan code for typing. :-} > > So, I can now see the ancient text with accents and aparatus in all of > my editors, but I still cannot type any ancient Greek with my > keyboard... because I have to make up a keymap first. <sigh> > > I don't find that SBL (nor Logos Software) has provided keymaps as > yet... rats. You need what is called, at least with Windows, an IME -- Input Method Editor. These are part of (or associated with) the OS, so they can be used with *any* application that will accept unicode chars (in whatever encoding) rather than just ascii chars. Windows has about a hundred or so, including Greek. I do not know if that includes classical Greek with the extra marks. > I can read the test with Python though... yessss. -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Ben Finney <ben+python@benfinney.id.au> |
|---|---|
| Date | 2011-05-15 09:47 +1000 |
| Message-ID | <87r580hmkm.fsf@benfinney.id.au> |
| In reply to | #5384 |
Terry Reedy <tjreedy@udel.edu> writes: > You need what is called, at least with Windows, an IME -- Input Method > Editor. For a GNOME or KDE environment you want an input method framework; I recommend IBus <URL:http://code.google.com/p/ibus/> which comes with the major GNU+Linux operating systems <URL:http://oswatershed.org/pkg/ibus> <URL:http://packages.debian.org/squeeze/ibus> . Then you have a wide range of input methods available. Many of them are specific to local writing systems. For writing special characters in English text, I use either ‘rfc1345’ or ‘latex’ within IBus. That allows special characters to be typed into any program which communicates with the desktop environment's input routines. Yay, unified input of special characters! Except Emacs :-( which fortunately has ‘ibus-el’ available to work with IBus <URL:http://www.emacswiki.org/emacs/IBusMode> :-). -- \ 己所不欲、勿施于人。| `\ (What is undesirable to you, do not do to others.) | _o__) —孔夫子 Confucius, 551 BCE – 479 BCE | Ben Finney
[toc] | [prev] | [next] | [standalone]
| From | Nobody <nobody@nowhere.com> |
|---|---|
| Date | 2011-05-14 09:34 +0100 |
| Message-ID | <pan.2011.05.14.08.34.20.922000@nowhere.com> |
| In reply to | #5322 |
On Fri, 13 May 2011 14:53:50 -0500, harrismh777 wrote: > The unicode consortium is very careful to make sure that thousands > of symbols have a unique code point (that's great !) but how do these > thousands of symbols actually get displayed if there is no font > consortium? Are there collections of 'standard' fonts for unicode that I > am not aware? Is there a unix linux package that can be installed that > drops at least 'one' default standard font that will be able to render all > or 'most' (whatever I mean by that) code points in unicode? Using the original meaning of "font" (US) or "fount" (commonwealth), you can't have a single font cover the whole of Unicode. A font isn't a random set of glyphs, but a set of glyphs in a common style, which can only practically be achieved for a specific alphabet. You can bundle multiple fonts covering multiple repertoires into a single TTF (etc) file, but there's not much point. In software, the term "font" is commonly used to refer to some ad-hoc mapping between codepoints and glyphs. This typically works by either associating each specific font with a specific repertoire (set of codepoints), or by simply trying each font in order until one is found with the correct glyph. This is a sufficiently common problem that the FontConfig library exists to simplify a large part of it. > Is this a Python issue at all? No.
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2011-05-12 16:42 -0400 |
| Message-ID | <mailman.1497.1305232982.9059.python-list@python.org> |
| In reply to | #5209 |
On 5/12/2011 12:17 PM, Ian Kelly wrote: > On Thu, May 12, 2011 at 1:58 AM, John Machin<sjmachin@lexicon.net> wrote: >> On Thu, May 12, 2011 4:31 pm, harrismh777 wrote: >> >>> >>> So, the UTF-16 UTF-32 is INTERNAL only, for Python >> >> NO. See one of my previous messages. UTF-16 and UTF-32, like UTF-8 are >> encodings for the EXTERNAL representation of Unicode characters in byte >> streams. > > Right. *Under the hood* Python uses UCS-2 (which is not exactly the > same thing as UTF-16, by the way) to represent Unicode strings. I know some people say that, but according to the definitions of the unicode consortium, that is wrong! The earlier UCS-2 *cannot* represent chars in the Supplementary Planes. The later (1996) UTF-16, which Python uses, can. The standard considers 'UCS-2' obsolete long ago. See https://secure.wikimedia.org/wikipedia/en/wiki/UTF-16/UCS-2 or http://www.unicode.org/faq/basic_q.html#14 The latter says: "Q: What is the difference between UCS-2 and UTF-16? A: UCS-2 is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1, before surrogate code points and UTF-16 were added to Version 2.0 of the standard. This term should now be avoided." It goes on: "Sometimes in the past an implementation has been labeled "UCS-2" to indicate that it does not support supplementary characters and doesn't interpret pairs of surrogate code points as characters. Such an implementation would not handle processing of character properties, code point boundaries, collation, etc. for supplementary characters." I know that 16-bit Python *does* use surrogate pairs for supplementary chars and at least some properties work for them. I am not sure exactly what the rest means. > However, this is entirely transparent. To the Python programmer, a > unicode string is just an abstraction of a sequence of code-points. > You don't need to think about UCS-2 at all. The only times you need > to worry about encodings are when you're encoding unicode characters > to byte strings, or decoding bytes to unicode characters, or opening a > stream in text mode; and in those cases the only encoding that matters > is the external one. If one uses unicode chars in the Supplementary Planes above the BMP (the first 2**16), which require surrogate pairs for 16 bit unicode (UTF-16), then the abstraction leaks. -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2011-05-12 16:25 -0600 |
| Message-ID | <mailman.1500.1305239156.9059.python-list@python.org> |
| In reply to | #5209 |
On Thu, May 12, 2011 at 2:42 PM, Terry Reedy <tjreedy@udel.edu> wrote:
> On 5/12/2011 12:17 PM, Ian Kelly wrote:
>> Right. *Under the hood* Python uses UCS-2 (which is not exactly the
>> same thing as UTF-16, by the way) to represent Unicode strings.
>
> I know some people say that, but according to the definitions of the unicode
> consortium, that is wrong! The earlier UCS-2 *cannot* represent chars in the
> Supplementary Planes. The later (1996) UTF-16, which Python uses, can. The
> standard considers 'UCS-2' obsolete long ago. See
>
> https://secure.wikimedia.org/wikipedia/en/wiki/UTF-16/UCS-2
> or http://www.unicode.org/faq/basic_q.html#14
At the first link, in the section _Use in major operating systems and
environments_ it states, "The Python language environment officially
only uses UCS-2 internally since version 2.1, but the UTF-8 decoder to
"Unicode" produces correct UTF-16. Python can be compiled to use UCS-4
(UTF-32) but this is commonly only done on Unix systems."
PEP 100 says:
The internal format for Unicode objects should use a Python
specific fixed format <PythonUnicode> implemented as 'unsigned
short' (or another unsigned numeric type having 16 bits). Byte
order is platform dependent.
This format will hold UTF-16 encodings of the corresponding
Unicode ordinals. The Python Unicode implementation will address
these values as if they were UCS-2 values. UCS-2 and UTF-16 are
the same for all currently defined Unicode character points.
UTF-16 without surrogates provides access to about 64k characters
and covers all characters in the Basic Multilingual Plane (BMP) of
Unicode.
It is the Codec's responsibility to ensure that the data they pass
to the Unicode object constructor respects this assumption. The
constructor does not check the data for Unicode compliance or use
of surrogates.
I'm getting out of my depth here, but that implies to me that while
Python stores UTF-16 and can correctly encode/decode it to UTF-8,
other codecs might only work correctly with UCS-2, and the unicode
class itself ignores surrogate pairs.
Although I'm not sure how much this might have changed since the
original implementation, especially for Python 3.
[toc] | [prev] | [next] | [standalone]
| From | "John Machin" <sjmachin@lexicon.net> |
|---|---|
| Date | 2011-05-12 13:54 +1000 |
| Message-ID | <mailman.1441.1305172465.9059.python-list@python.org> |
| In reply to | #5181 |
On Thu, May 12, 2011 11:22 am, harrismh777 wrote:
> John Machin wrote:
>> (1) You cannot work without using bytes sequences. Files are byte
>> sequences. Web communication is in bytes. You need to (know / assume /
>> be
>> able to extract / guess) the input encoding. You need to encode your
>> output using an encoding that is expected by the consumer (or use an
>> output method that will do it for you).
>>
>> (2) You don't need to use bytes to specify a Unicode code point. Just
>> use
>> an escape sequence e.g. "\u0404" is a Cyrillic character.
>>
>
> Thanks John. In reverse order, I understand point (2). I'm less clear
> on point (1).
>
> If I generate a string of characters that I presume to be ascii/utf-8
> (no \u0404 type characters)
> and write them to a file (stdout) how does
> default encoding affect that file.by default..? I'm not seeing that
> there is anything unusual going on...
About """characters that I presume to be ascii/utf-8 (no \u0404 type
characters)""": All Unicode characters (including U+0404) are encodable in
bytes using UTF-8.
The result of sys.stdout.write(unicode_characters) to a TERMINAL depends
mostly on sys.stdout.encoding. This is likely to be UTF-8 on a
linux/OSX/platform. On a typical American / Western European /[former]
colonies Windows box, this is likely to be cp850 on a Command Prompt
window, and cp1252 in IDLE.
UTF-8: All Unicode characters are encodable in UTF-8. Only problem arises
if the terminal can't render the character -- you'll get spaces or blobs
or boxes with hex digits in them or nothing.
Windows (Command Prompt window): only a small subset of characters can be
encoded in e.g. cp850; anything else causes an exception.
Windows (IDLE): ignores sys.stdout.encoding and renders the characters
itself. Same outcome as *x/UTF-8 above.
If you write directly (or sys.stdout is redirected) to a FILE, the default
encoding is obtained by sys.getdefaultencoding() and is AFAIK ascii unless
the machine's site.py has been fiddled with to make it UTF-8 or something
else.
> If I open the file with vi? If
> I open the file with gedit? emacs?
Any editor will have a default encoding; if that doesn't match the file
encoding, you have a (hopefully obvious) problem if the editor doesn't
detect the mismatch. Consult your editor's docs or HTFF1K.
> Another question... in mail I'm receiving many small blocks that look
> like sprites with four small hex codes, scattered about the mail...
> mostly punctuation, maybe? ... guessing, are these unicode code
> points,
yes
> and if so what is the best way to 'guess' the encoding?
google("chardet") or rummage through the mail headers (but 4 hex digits in
a box are a symptom of inability to render, not necessarily caused by an
incorrect decoding)
... is
> it coded in the stream somewhere...protocol?
Should be.
[toc] | [prev] | [next] | [standalone]
| From | Benjamin Kaplan <benjamin.kaplan@case.edu> |
|---|---|
| Date | 2011-05-11 15:34 -0700 |
| Message-ID | <mailman.1434.1305153267.9059.python-list@python.org> |
| In reply to | #5163 |
On Wed, May 11, 2011 at 2:37 PM, harrismh777 <harrismh777@charter.net> wrote: > hi folks, > I am puzzled by unicode generally, and within the context of python > specifically. For one thing, what do we mean that unicode is used in python > 3.x by default. (I know what default means, I mean, what changed?) > > I think part of my problem is that I'm spoiled (American, ascii heritage) > and have been either stuck in ascii knowingly, or UTF-8 without knowing > (just because the code points lined up). I am confused by the implications > for using 3.x, because I am reading that there are significant things to be > aware of... what? > > On my installation 2.6 sys.maxunicode comes up with 1114111, and my 2.7 > and 3.2 installs come up with 65535 each. So, I am assuming that 2.6 was > compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that the > default compile option for 2.7 & 3.2 (I didn't change anything) is set for > UCS-2 (UTF-16) or 2 byte unicode(?). Do I understand this much correctly? > Not really sure about that, but it doesn't matter anyway. Because even though internally the string is stored as either a UCS-2 or a UCS-4 string, you never see that. You just see this string as a sequence of characters. If you want to turn it into a sequence of bytes, you have to use an encoding. > The books say that the .py sources are UTF-8 by default... and that 3.x is > either UCS-2 or UCS-4. If I use the file handling capabilities of Python in > 3.x (by default) what encoding will be used, and how will that affect the > output? > > If I do not specify any code points above ascii 0xFF does any of this > matter anyway? ASCII only goes up to 0x7F. If you were using UTF-8 bytestrings, then there is a difference for anything over that range. A byte string is a sequence of bytes. A unicode string is a sequence of these mythical abstractions called characters. So a unicode string u'\u00a0' will have a length of 1. Encode that to UTF-8 and you'll find it has a length of 2 (because UTF-8 uses 2 bytes to encode everything over 128- the top bit is used to signal that you need the next byte for this character) If you want the history behind the whole encoding mess, Joel Spolsky wrote a rather amusing article explaining how this all came about: http://www.joelonsoftware.com/articles/Unicode.html And the biggest reason to use Unicode is so that you don't have to worry about your program messing up because someone hands you input in a different encoding than you used.
[toc] | [prev] | [standalone]
Page 2 of 2 — ← Prev page 1 [2]
Back to top | Article view | comp.lang.python
csiph-web