Groups > comp.lang.python > #5163 > unrolled thread

unicode by default

Started by	harrismh777 <harrismh777@charter.net>
First post	2011-05-11 16:37 -0500
Last post	2011-05-11 15:34 -0700
Articles	12 on this page of 32 — 12 participants

Back to article view | Back to comp.lang.python

  unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 16:37 -0500
    Re: unicode by default Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-11 16:09 -0600
      Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 17:51 -0500
        Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 09:32 +1000
          Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 20:22 -0500
            Re: unicode by default MRAB <python@mrabarnett.plus.com> - 2011-05-12 03:31 +0100
              Re: unicode by default Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-05-12 03:16 +0000
                Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 22:44 -0500
                  Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-12 00:12 -0400
                    Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-12 01:43 -0500
                  Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 14:14 +1000
                  Re: unicode by default Benjamin Kaplan <benjamin.kaplan@case.edu> - 2011-05-11 21:14 -0700
                  Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 14:41 +1000
                    Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-12 01:14 -0500
                    Re: unicode by default TheSaint <nobody@nowhere.net.no> - 2011-05-12 20:40 +0800
              Re: unicode by default Ben Finney <ben+python@benfinney.id.au> - 2011-05-12 14:07 +1000
                Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-12 01:31 -0500
                  Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 17:58 +1000
                  Re: unicode by default Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-12 10:17 -0600
                    Re: unicode by default jmfauth <wxjmfauth@gmail.com> - 2011-05-12 23:28 -0700
                      Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-13 14:53 -0500
                        Re: unicode by default Robert Kern <robert.kern@gmail.com> - 2011-05-13 15:18 -0500
                        Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-13 21:41 -0400
                          Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-14 02:41 -0500
                            Re: unicode by default jmfauth <wxjmfauth@gmail.com> - 2011-05-14 03:26 -0700
                            Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-14 16:26 -0400
                              Re: unicode by default Ben Finney <ben+python@benfinney.id.au> - 2011-05-15 09:47 +1000
                        Re: unicode by default Nobody <nobody@nowhere.com> - 2011-05-14 09:34 +0100
                  Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-12 16:42 -0400
                  Re: unicode by default Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-12 16:25 -0600
            Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 13:54 +1000
    Re: unicode by default Benjamin Kaplan <benjamin.kaplan@case.edu> - 2011-05-11 15:34 -0700

Page 2 of 2 — ← Prev page 1 [2]

#5322

From	harrismh777 <harrismh777@charter.net>
Date	2011-05-13 14:53 -0500
Message-ID	<j%fzp.1806$7N5.1387@newsfe04.iad>
In reply to	#5281

jmfauth wrote:
>> to worry about encodings are when you're encoding unicode characters
>> >  to byte strings, or decoding bytes to unicode characters
>
> A small but important correction/clarification:
>
> In Unicode, "unicode" does not encode a*character*. It
> encodes a*code point*, a number, the integer associated
> to the character.
>

That is a huge code-point... pun intended.

... and there is another point that I continue to be somewhat puzzled 
about, and that is the issue of fonts.

    On of my hobbies at the moment is ancient Greek (biblical studies, 
Septuaginta LXX, and Greek New Testament).  I have these texts on my 
computer in a folder in several formats... pdf, unicode 'plaintext', 
osis.xml, and XML.

    These texts may be found at http://sblgnt.com

    I am interested for the moment only in the 'plaintext' stream, 
because it is unicode.  ( first, in unicode, according to all the doc 
there is no such thing as 'plaintext,' so keep that in mind).

    When I open the text stream in one of my unicode editors I can see 
'most' of the characters in a rudimentary Greek font with accents; 
however, I also see many tiny square blocks indicating (I think) that 
the code points do *not* have a corresponding character in my unicode 
font for that Greek symbol (whatever it is supposed to be).

    The point, or question is, how does one go about making sure that 
there is a corresponding font glyph to match a specific unicode code 
point for display in a particular terminal (editor, browser, whatever) ?

    The unicode consortium is very careful to make sure that thousands 
of symbols have a unique code point (that's great !) but how do these 
thousands of symbols actually get displayed if there is no font 
consortium?   Are there collections of 'standard' fonts for unicode that 
I am not aware?  Is there a unix linux package that can be installed 
that drops at least 'one' default standard font that will be able to 
render all or 'most' (whatever I mean by that) code points in unicode? 
  Is this a Python issue at all?


kind regards,
m harris

[toc] | [prev] | [next] | [standalone]

#5323

From	Robert Kern <robert.kern@gmail.com>
Date	2011-05-13 15:18 -0500
Message-ID	<mailman.1525.1305317927.9059.python-list@python.org>
In reply to	#5322

On 5/13/11 2:53 PM, harrismh777 wrote:

> The unicode consortium is very careful to make sure that thousands of symbols
> have a unique code point (that's great !) but how do these thousands of symbols
> actually get displayed if there is no font consortium? Are there collections of
> 'standard' fonts for unicode that I am not aware?

There are some well-known fonts that try to cover a large section of the Unicode 
standard.

   http://en.wikipedia.org/wiki/Unicode_typeface

> Is there a unix linux package
> that can be installed that drops at least 'one' default standard font that will
> be able to render all or 'most' (whatever I mean by that) code points in
> unicode? Is this a Python issue at all?

Not really.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
  that is made terrible by our own mad attempt to interpret it as though it had
  an underlying truth."
   -- Umberto Eco

[toc] | [prev] | [next] | [standalone]

#5329

From	Terry Reedy <tjreedy@udel.edu>
Date	2011-05-13 21:41 -0400
Message-ID	<mailman.1530.1305337302.9059.python-list@python.org>
In reply to	#5322

On 5/13/2011 3:53 PM, harrismh777 wrote:

> The unicode consortium is very careful to make sure that thousands of
> symbols have a unique code point (that's great !) but how do these
> thousands of symbols actually get displayed if there is no font
> consortium? Are there collections of 'standard' fonts for unicode that I
> am not aware? Is there a unix linux package that can be installed that
> drops at least 'one' default standard font that will be able to render
> all or 'most' (whatever I mean by that) code points in unicode? Is this
> a Python issue at all?

Easy, practical use of unicode is still a work in progress.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#5345

From	harrismh777 <harrismh777@charter.net>
Date	2011-05-14 02:41 -0500
Message-ID	<qmqzp.1592$iv4.747@newsfe09.iad>
In reply to	#5329

Terry Reedy wrote:
>> Is there a unix linux package that can be installed that
>> drops at least 'one' default standard font that will be able to render
>> all or 'most' (whatever I mean by that) code points in unicode? Is this
>> a Python issue at all?
>
> Easy, practical use of unicode is still a work in progress.

Apparently...  the good news for me is that SBL provides their unicode 
font here:

      http://www.sbl-site.org/educational/biblicalfonts.aspx

I'm getting much closer here, but now the problem is typing. The pain 
with unicode fonts is that the glyph is tied to the code point for the 
represented character, and not tied to any code point that matches any 
keyboard scan code for typing.   :-}

So, I can now see the ancient text with accents and aparatus in all of 
my editors, but I still cannot type any ancient Greek with my 
keyboard... because I have to make up a keymap first. <sigh>

I don't find that SBL (nor Logos Software) has provided keymaps as 
yet...  rats.

I can read the test with Python though... yessss.

m harris

[toc] | [prev] | [next] | [standalone]

#5361

From	jmfauth <wxjmfauth@gmail.com>
Date	2011-05-14 03:26 -0700
Message-ID	<f275bbd6-e71f-437e-941f-d3cf875f5636@x6g2000yqj.googlegroups.com>
In reply to	#5345

On 14 mai, 09:41, harrismh777 <harrismh...@charter.net> wrote:

> ...
> I'm getting much closer here,
> ...

You should really understand, that Unicode is a domain per
se. It is independent from any os's, programming languages
or applications. It is up to these tools to be "unicode"
compliant.

Working in a full unicode mode (at least for texts) is
today practically a solved problem. But you have to ensure
the whole toolchain is unicode compliant (editors,
fonts (OpenType technology), rendering devices, ...).

Tip. This list is certainly not the best place to grab
informations. I suggest you start by getting informations
about XeTeX. XeTeX is the "new" TeX engine working only
in a unicode mode. From this starting point, you will
fall on plenty web sites speaking about the "unicode
world", tools, fonts, ...

A variant is to visit sites speaking about *typography*.

jmf

[toc] | [prev] | [next] | [standalone]

#5384

From	Terry Reedy <tjreedy@udel.edu>
Date	2011-05-14 16:26 -0400
Message-ID	<mailman.1561.1305404825.9059.python-list@python.org>
In reply to	#5345

On 5/14/2011 3:41 AM, harrismh777 wrote:
> Terry Reedy wrote:

>> Easy, practical use of unicode is still a work in progress.
>
> Apparently... the good news for me is that SBL provides their unicode
> font here:
>
> http://www.sbl-site.org/educational/biblicalfonts.aspx
>
> I'm getting much closer here, but now the problem is typing. The pain
> with unicode fonts is that the glyph is tied to the code point for the
> represented character, and not tied to any code point that matches any
> keyboard scan code for typing. :-}
>
> So, I can now see the ancient text with accents and aparatus in all of
> my editors, but I still cannot type any ancient Greek with my
> keyboard... because I have to make up a keymap first. <sigh>
>
> I don't find that SBL (nor Logos Software) has provided keymaps as
> yet... rats.

You need what is called, at least with Windows, an IME -- Input Method 
Editor. These are part of (or associated with) the OS, so they can be 
used with *any* application that will accept unicode chars (in whatever 
encoding) rather than just ascii chars. Windows has about a hundred or 
so, including Greek. I do not know if that includes classical Greek with 
the extra marks.

> I can read the test with Python though... yessss.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#5393

From	Ben Finney <ben+python@benfinney.id.au>
Date	2011-05-15 09:47 +1000
Message-ID	<87r580hmkm.fsf@benfinney.id.au>
In reply to	#5384

Terry Reedy <tjreedy@udel.edu> writes:

> You need what is called, at least with Windows, an IME -- Input Method
> Editor.

For a GNOME or KDE environment you want an input method framework; I
recommend IBus <URL:http://code.google.com/p/ibus/> which comes with the
major GNU+Linux operating systems <URL:http://oswatershed.org/pkg/ibus>
<URL:http://packages.debian.org/squeeze/ibus> .

Then you have a wide range of input methods available. Many of them are
specific to local writing systems. For writing special characters in
English text, I use either ‘rfc1345’ or ‘latex’ within IBus.

That allows special characters to be typed into any program which
communicates with the desktop environment's input routines. Yay, unified
input of special characters!

Except Emacs :-( which fortunately has ‘ibus-el’ available to work with
IBus <URL:http://www.emacswiki.org/emacs/IBusMode> :-).

-- 
 \                                                 己所不欲、勿施于人。|
  `\                (What is undesirable to you, do not do to others.) |
_o__)                             —孔夫子 Confucius, 551 BCE – 479 BCE |
Ben Finney

[toc] | [prev] | [next] | [standalone]

#5350

From	Nobody <nobody@nowhere.com>
Date	2011-05-14 09:34 +0100
Message-ID	<pan.2011.05.14.08.34.20.922000@nowhere.com>
In reply to	#5322

On Fri, 13 May 2011 14:53:50 -0500, harrismh777 wrote:

>     The unicode consortium is very careful to make sure that thousands
> of symbols have a unique code point (that's great !) but how do these
> thousands of symbols actually get displayed if there is no font
> consortium?   Are there collections of 'standard' fonts for unicode that I
> am not aware?  Is there a unix linux package that can be installed that
> drops at least 'one' default standard font that will be able to render all
> or 'most' (whatever I mean by that) code points in unicode?

Using the original meaning of "font" (US) or "fount" (commonwealth), you
can't have a single font cover the whole of Unicode. A font isn't a random
set of glyphs, but a set of glyphs in a common style, which can only
practically be achieved for a specific alphabet.

You can bundle multiple fonts covering multiple repertoires into a single
TTF (etc) file, but there's not much point.

In software, the term "font" is commonly used to refer to some ad-hoc
mapping between codepoints and glyphs. This typically works by either
associating each specific font with a specific repertoire (set of
codepoints), or by simply trying each font in order until one is found
with the correct glyph.

This is a sufficiently common problem that the FontConfig library exists
to simplify a large part of it.

>   Is this a Python issue at all?

No.

[toc] | [prev] | [next] | [standalone]

#5265

From	Terry Reedy <tjreedy@udel.edu>
Date	2011-05-12 16:42 -0400
Message-ID	<mailman.1497.1305232982.9059.python-list@python.org>
In reply to	#5209

On 5/12/2011 12:17 PM, Ian Kelly wrote:
> On Thu, May 12, 2011 at 1:58 AM, John Machin<sjmachin@lexicon.net>  wrote:
>> On Thu, May 12, 2011 4:31 pm, harrismh777 wrote:
>>
>>>
>>> So, the UTF-16 UTF-32 is INTERNAL only, for Python
>>
>> NO. See one of my previous messages. UTF-16 and UTF-32, like UTF-8 are
>> encodings for the EXTERNAL representation of Unicode characters in byte
>> streams.
>
> Right.  *Under the hood* Python uses UCS-2 (which is not exactly the
> same thing as UTF-16, by the way) to represent Unicode strings.

I know some people say that, but according to the definitions of the 
unicode consortium, that is wrong! The earlier UCS-2 *cannot* represent 
chars in the Supplementary Planes. The later (1996) UTF-16, which Python 
uses, can. The standard considers 'UCS-2' obsolete long ago. See

https://secure.wikimedia.org/wikipedia/en/wiki/UTF-16/UCS-2
or http://www.unicode.org/faq/basic_q.html#14

The latter says: "Q: What is the difference between UCS-2 and UTF-16?
A: UCS-2 is obsolete terminology which refers to a Unicode 
implementation up to Unicode 1.1, before surrogate code points and 
UTF-16 were added to Version 2.0 of the standard. This term should now 
be avoided."

It goes on: "Sometimes in the past an implementation has been labeled 
"UCS-2" to indicate that it does not support supplementary characters 
and doesn't interpret pairs of surrogate code points as characters. Such 
an implementation would not handle processing of character properties, 
code point boundaries, collation, etc. for supplementary characters."

I know that 16-bit Python *does* use surrogate pairs for supplementary 
chars and at least some properties work for them. I am not sure exactly 
what the rest means.

> However, this is entirely transparent.  To the Python programmer, a
> unicode string is just an abstraction of a sequence of code-points.
> You don't need to think about UCS-2 at all.  The only times you need
> to worry about encodings are when you're encoding unicode characters
> to byte strings, or decoding bytes to unicode characters, or opening a
> stream in text mode; and in those cases the only encoding that matters
> is the external one.

If one uses unicode chars in the Supplementary Planes above the BMP (the 
first 2**16), which require surrogate pairs for 16 bit unicode (UTF-16), 
then the abstraction leaks.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#5270

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2011-05-12 16:25 -0600
Message-ID	<mailman.1500.1305239156.9059.python-list@python.org>
In reply to	#5209

On Thu, May 12, 2011 at 2:42 PM, Terry Reedy <tjreedy@udel.edu> wrote:
> On 5/12/2011 12:17 PM, Ian Kelly wrote:
>> Right.  *Under the hood* Python uses UCS-2 (which is not exactly the
>> same thing as UTF-16, by the way) to represent Unicode strings.
>
> I know some people say that, but according to the definitions of the unicode
> consortium, that is wrong! The earlier UCS-2 *cannot* represent chars in the
> Supplementary Planes. The later (1996) UTF-16, which Python uses, can. The
> standard considers 'UCS-2' obsolete long ago. See
>
> https://secure.wikimedia.org/wikipedia/en/wiki/UTF-16/UCS-2
> or http://www.unicode.org/faq/basic_q.html#14

At the first link, in the section _Use in major operating systems and
environments_ it states, "The Python language environment officially
only uses UCS-2 internally since version 2.1, but the UTF-8 decoder to
"Unicode" produces correct UTF-16. Python can be compiled to use UCS-4
(UTF-32) but this is commonly only done on Unix systems."

PEP 100 says:

    The internal format for Unicode objects should use a Python
    specific fixed format <PythonUnicode> implemented as 'unsigned
    short' (or another unsigned numeric type having 16 bits).  Byte
    order is platform dependent.

    This format will hold UTF-16 encodings of the corresponding
    Unicode ordinals.  The Python Unicode implementation will address
    these values as if they were UCS-2 values. UCS-2 and UTF-16 are
    the same for all currently defined Unicode character points.
    UTF-16 without surrogates provides access to about 64k characters
    and covers all characters in the Basic Multilingual Plane (BMP) of
    Unicode.

    It is the Codec's responsibility to ensure that the data they pass
    to the Unicode object constructor respects this assumption.  The
    constructor does not check the data for Unicode compliance or use
    of surrogates.

I'm getting out of my depth here, but that implies to me that while
Python stores UTF-16 and can correctly encode/decode it to UTF-8,
other codecs might only work correctly with UCS-2, and the unicode
class itself ignores surrogate pairs.

Although I'm not sure how much this might have changed since the
original implementation, especially for Python 3.

[toc] | [prev] | [next] | [standalone]

#5191

From	"John Machin" <sjmachin@lexicon.net>
Date	2011-05-12 13:54 +1000
Message-ID	<mailman.1441.1305172465.9059.python-list@python.org>
In reply to	#5181

On Thu, May 12, 2011 11:22 am, harrismh777 wrote:
> John Machin wrote:
>> (1) You cannot work without using bytes sequences. Files are byte
>> sequences. Web communication is in bytes. You need to (know / assume /
>> be
>> able to extract / guess) the input encoding. You need to encode your
>> output using an encoding that is expected by the consumer (or use an
>> output method that will do it for you).
>>
>> (2) You don't need to use bytes to specify a Unicode code point. Just
>> use
>> an escape sequence e.g. "\u0404" is a Cyrillic character.
>>
>
> Thanks John.  In reverse order, I understand point (2). I'm less clear
> on point (1).
>
> If I generate a string of characters that I presume to be ascii/utf-8
> (no \u0404 type characters)
> and write them to a file (stdout) how does
> default encoding affect that file.by default..?   I'm not seeing that
> there is anything unusual going on...

About """characters that I presume to be ascii/utf-8 (no \u0404 type
characters)""": All Unicode characters (including U+0404) are encodable in
bytes using UTF-8.

The result of sys.stdout.write(unicode_characters) to a TERMINAL depends
mostly on sys.stdout.encoding. This is likely to be UTF-8 on a
linux/OSX/platform. On a typical American / Western European /[former]
colonies Windows box, this is likely to be cp850 on a Command Prompt
window, and cp1252 in IDLE.

UTF-8: All Unicode characters are encodable in UTF-8. Only problem arises
if the terminal can't render the character -- you'll get spaces or blobs
or boxes with hex digits in them or nothing.

Windows (Command Prompt window): only a small subset of characters can be
encoded in e.g. cp850; anything else causes an exception.

Windows (IDLE): ignores sys.stdout.encoding and renders the characters
itself. Same outcome as *x/UTF-8 above.

If you write directly (or sys.stdout is redirected) to a FILE, the default
encoding is obtained by sys.getdefaultencoding() and is AFAIK ascii unless
the machine's site.py has been fiddled with to make it UTF-8 or something
else.

>   If I open the file with vi?  If
> I open the file with gedit?  emacs?

Any editor will have a default encoding; if that doesn't match the file
encoding, you have a (hopefully obvious) problem if the editor doesn't
detect the mismatch. Consult your editor's docs or HTFF1K.

> Another question... in mail I'm receiving many small blocks that look
> like sprites with four small hex codes, scattered about the mail...
> mostly punctuation, maybe?   ... guessing, are these unicode code
> points,

yes

> and if so what is the best way to 'guess' the encoding?

google("chardet") or rummage through the mail headers (but 4 hex digits in
a box are a symptom of inability to render, not necessarily caused by an
incorrect decoding)

 ... is
> it coded in the stream somewhere...protocol?

Should be.

[toc] | [prev] | [next] | [standalone]

#5171

From	Benjamin Kaplan <benjamin.kaplan@case.edu>
Date	2011-05-11 15:34 -0700
Message-ID	<mailman.1434.1305153267.9059.python-list@python.org>
In reply to	#5163

On Wed, May 11, 2011 at 2:37 PM, harrismh777 <harrismh777@charter.net> wrote:
> hi folks,
>   I am puzzled by unicode generally, and within the context of python
> specifically. For one thing, what do we mean that unicode is used in python
> 3.x by default. (I know what default means, I mean, what changed?)
>
>   I think part of my problem is that I'm spoiled (American, ascii heritage)
> and have been either stuck in ascii knowingly, or UTF-8 without knowing
> (just because the code points lined up). I am confused by the implications
> for using 3.x, because I am reading that there are significant things to be
> aware of... what?
>
>   On my installation 2.6  sys.maxunicode comes up with 1114111, and my 2.7
> and 3.2 installs come up with 65535 each. So, I am assuming that 2.6 was
> compiled with UCS-4 (UTF-32) option for 4 byte unicode(?) and that the
> default compile option for 2.7 & 3.2 (I didn't change anything) is set for
> UCS-2 (UTF-16) or 2 byte unicode(?).   Do I understand this much correctly?
>

Not really sure about that, but it doesn't matter anyway. Because even
though internally the string is stored as either a UCS-2 or a UCS-4
string, you never see that. You just see this string as a sequence of
characters. If you want to turn it into a sequence of bytes, you have
to use an encoding.

>   The books say that the .py sources are UTF-8 by default... and that 3.x is
> either UCS-2 or UCS-4.  If I use the file handling capabilities of Python in
> 3.x (by default) what encoding will be used, and how will that affect the
> output?
>
>   If I do not specify any code points above ascii 0xFF does any of this
> matter anyway?

ASCII only goes up to 0x7F. If you were using UTF-8 bytestrings, then
there is a difference for anything over that range. A byte string is a
sequence of bytes. A unicode string is a sequence of these mythical
abstractions called characters. So a unicode string u'\u00a0' will
have a length of 1. Encode that to UTF-8 and you'll find it has a
length of 2 (because UTF-8 uses 2 bytes to encode everything over 128-
the top bit is used to signal that you need the next byte for this
character)

 If you want the history behind the whole encoding mess, Joel Spolsky
wrote a rather amusing article explaining how this all came about:
http://www.joelonsoftware.com/articles/Unicode.html

And the biggest reason to use Unicode is so that you don't have to
worry about your program messing up because someone hands you input in
a different encoding than you used.

[toc] | [prev] | [standalone]

Page 2 of 2 — ← Prev page 1 [2]

csiph-web

unicode by default

Contents

#5322

#5323

#5329

#5345

#5361

#5384

#5393

#5350

#5265

#5270

#5191

#5171