Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #105093 > unrolled thread
| Started by | wxjmfauth@gmail.com |
|---|---|
| First post | 2016-03-17 07:34 -0700 |
| Last post | 2016-03-18 11:18 -0700 |
| Articles | 12 on this page of 72 — 18 participants |
Back to article view | Back to comp.lang.python
How to waste computer memory? wxjmfauth@gmail.com - 2016-03-17 07:34 -0700
Re: How to waste computer memory? Rick Johnson <rantingrickjohnson@gmail.com> - 2016-03-17 12:21 -0700
Re: How to waste computer memory? cl@isbd.net - 2016-03-17 20:31 +0000
Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-18 07:42 +1100
Re: How to waste computer memory? Grant Edwards <invalid@invalid.invalid> - 2016-03-17 21:08 +0000
Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-18 08:13 +1100
Re: How to waste computer memory? Paul Rubin <no.email@nospam.invalid> - 2016-03-17 14:30 -0700
Re: How to waste computer memory? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-17 22:32 +0000
Re: How to waste computer memory? cl@isbd.net - 2016-03-17 22:42 +0000
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-17 23:11 +0200
Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-18 08:17 +1100
Re: How to waste computer memory? BartC <bc@freeuk.com> - 2016-03-17 21:26 +0000
Re: How to waste computer memory? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2016-03-17 22:38 +0000
Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-18 10:02 +1100
Re: How to waste computer memory? alister <alister.ware@ntlworld.com> - 2016-03-17 21:37 +0000
Re: How to waste computer memory? alister <alister.ware@ntlworld.com> - 2016-03-17 21:43 +0000
Re: How to waste computer memory? Gene Heskett <gheskett@wdtv.com> - 2016-03-17 20:51 -0400
Re: How to waste computer memory? Rick Johnson <rantingrickjohnson@gmail.com> - 2016-03-17 18:47 -0700
Re: How to waste computer memory? cl@isbd.net - 2016-03-18 10:44 +0000
Re: How to waste computer memory? Gene Heskett <gheskett@wdtv.com> - 2016-03-18 10:11 -0400
Re: How to waste computer memory? Grant Edwards <invalid@invalid.invalid> - 2016-03-19 13:50 +0000
Re: How to waste computer memory? Ian Kelly <ian.g.kelly@gmail.com> - 2016-03-18 01:00 -0600
Re: How to waste computer memory? Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-03-18 10:26 +0200
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-18 17:26 +0200
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-19 03:58 +1100
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-18 23:02 +0200
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-18 23:28 +0200
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 00:03 +0200
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 09:49 +0200
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 10:22 +0200
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 11:40 +0200
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-19 19:38 +1100
Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-19 00:14 -0700
Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-19 02:17 -0700
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-19 19:14 +1100
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 11:31 +0200
Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-19 03:40 -0700
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 13:07 +0200
Re: How to waste computer memory? BartC <bc@freeuk.com> - 2016-03-19 12:24 +0000
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 14:43 +0200
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 01:18 +1100
Re: How to waste computer memory? BartC <bc@freeuk.com> - 2016-03-19 15:14 +0000
Re: How to waste computer memory? BartC <bc@freeuk.com> - 2016-03-19 15:20 +0000
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-19 22:32 +1100
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 14:42 +0200
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 01:39 +1100
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 16:56 +0200
Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-19 07:01 -0700
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 01:56 +1100
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 17:02 +0200
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 02:47 +1100
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-19 18:12 +0200
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 16:01 +1100
Re: How to waste computer memory? Rustom Mody <rustompmody@gmail.com> - 2016-03-19 23:20 -0700
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 22:06 +1100
Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-20 22:22 +1100
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-20 23:14 +1100
Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-20 23:27 +1100
Re: How to waste computer memory? Ben Bacarisse <ben.usenet@bsb.me.uk> - 2016-03-20 14:55 +0000
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-20 17:36 +0200
Re: How to waste computer memory? Random832 <random832@fastmail.com> - 2016-03-20 14:17 -0400
Re: How to waste computer memory? Marko Rauhamaa <marko@pacujo.net> - 2016-03-20 09:30 +0200
Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-18 03:50 -0700
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-18 22:46 +1100
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-18 22:58 +1100
Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-18 12:53 -0700
Re: How to waste computer memory? Chris Angelico <rosuav@gmail.com> - 2016-03-18 23:37 +1100
Re: How to waste computer memory? Ian Kelly <ian.g.kelly@gmail.com> - 2016-03-18 07:57 -0600
Re: How to waste computer memory? Steven D'Aprano <steve@pearwood.info> - 2016-03-19 03:44 +1100
Re: How to waste computer memory? Jussi Piitulainen <jussi.piitulainen@helsinki.fi> - 2016-03-18 20:22 +0200
Re: How to waste computer memory? wxjmfauth@gmail.com - 2016-03-18 13:03 -0700
Re: How to waste computer memory? sohcahtoa82@gmail.com - 2016-03-18 11:18 -0700
Page 4 of 4 — ← Prev page 1 2 3 [4]
| From | Random832 <random832@fastmail.com> |
|---|---|
| Date | 2016-03-20 14:17 -0400 |
| Message-ID | <mailman.408.1458498127.12893.python-list@python.org> |
| In reply to | #105303 |
On Sun, Mar 20, 2016, at 10:55, Ben Bacarisse wrote: > It's 21. The reason being (or at least part of the reason being) that > 21 bits can be UTF-8 encoded in 4 bytes: 11110xxx 10xxxxxx 10xxxxxx > 10xxxxxx (3 + 3*6). The reason is the UTF-16 limit. Prior to that, UTF-8 had no such limit (it could encode up to 31 bits, as six bytes), and it doesn't account for the fact that four bytes can encode up to U+1FFFFF rather than U+10FFFF.
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2016-03-20 09:30 +0200 |
| Message-ID | <87oaa98v2q.fsf@elektro.pacujo.net> |
| In reply to | #105292 |
Steven D'Aprano <steve@pearwood.info>: > On Sun, 20 Mar 2016 03:12 am, Marko Rauhamaa wrote: >> Steven D'Aprano <steve@pearwood.info>: >>> On Sun, 20 Mar 2016 02:02 am, Marko Rauhamaa wrote: >>>> Yes, but UTF-16 produces 16-bit values that are outside Unicode. >>> >>> Show me. >>> >>> Before you answer, if your answer is "surrogate pairs", that is >>> incorrect. Surrogate pairs is how UTF-16 encodes astral characters. >> >> UTF-16 inputs a Unicode stream and produces a stream of 16-bit numbers. >> Thus, the output of UTF-16 is not Unicode. > > [...] > > If your point is that the data you get from running UTF-16 on a > sequence of code points is "not Unicode, but 2-byte words", then I > agree, but I'm not sure why you think that's significant. I say the surrogate characters are not Unicode. You say they are because they are used to encode astral characters. I say that point is irrelevant. I'm saying the surrogate characters are not Unicode because you are not allowed to store or communicate them. They are a hole in the Unicode fabric. They could have—probably should have—specified a UTF-16 encoding for the surrogate characters as well. That would have left the Unicode range uninterrupted. Well, the silver lining is that Python gained a number of extra code points it was free to use for special purposes, although to be faithful to Unicode, Python should refuse to store them. Marko
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2016-03-18 03:50 -0700 |
| Message-ID | <54949dca-af8b-4c7a-8b3e-03492b84bd56@googlegroups.com> |
| In reply to | #105182 |
Le vendredi 18 mars 2016 08:01:05 UTC+1, Ian a écrit : > > jmf has been asked this before, and as I recall he seems to feel that > UTF-8 should be used for all purpose No, not at all. I do not really care about utf-8/16/32. > [...] , ignoring the limitations of > that encoding such as that indexing becomes a O(n) operation Again no. I'm very aware of this, like the conceptors of "Unicode" or "ISO-10646", long time ago.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2016-03-18 22:46 +1100 |
| Message-ID | <56ebea83$0$1599$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #105182 |
On Fri, 18 Mar 2016 06:00 pm, Ian Kelly wrote: > On Thu, Mar 17, 2016 at 1:21 PM, Rick Johnson > <rantingrickjohnson@gmail.com> wrote: >> In the event that i change my mind about Unicode, and/or for >> the sake of others, who may want to know, please provide a >> list of languages that *YOU* think handle Unicode better than >> Python, starting with the best first. Thanks. Better than Python? Easy-peasy: List of languages with Unicode handling which is better than Python = [] I'm not aware of any language with better or more complete Unicode functionality than Python's. (That doesn't necessarily mean that they don't exist.) > jmf has been asked this before, and as I recall he seems to feel that > UTF-8 should be used for all purposes, ignoring the limitations of > that encoding such as that indexing becomes a O(n) operation. Technically, UTF-8 doesn't *necessarily* imply indexing is O(n). For instance, your UTF-8 string might consist of an array of bytes containing the string, plus an array of indexes to the start of each code point. For example, the string: “abcπßЊ•𒀁” (including the quote marks) is 10 code points in length and 22 bytes as UTF-8. Grouping the (hex) bytes for each code point, we have: e2809c 61 62 63 cf80 c39f d08a e280a2 f0928081 e2809d so we could get a O(1) UTF-8 string by recording the bytes (in hex) plus the indexes (in decimal) in which each code point starts: e2809c616263cf80c39fd08ae280a2f0928081e2809d 0 3 4 5 6 8 10 12 15 19 but (assuming each index needs 2 bytes, which supports strings up to 65535 characters in length), that's actually LESS memory efficient than UTF-32: 42 bytes versus 40. > He has > pointed at Go as an example of a language wherein Unicode "just > works", although I think that others do not necessarily agree [1]. I think it is typical of JMF that his idea of a language where Unicode "just works" is one where it *does work at all* (at least not as strings). Python 1.5 strings supported Unicode just as well as Go's string class. In Go, the right way to handle Unicode is to use "runes", not strings. I don't know how well that works though -- I suspect it is still pretty primitive. > [1] https://coderwall.com/p/k7zvyg/dealing-with-unicode-in-go Nice link, thanks! -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2016-03-18 22:58 +1100 |
| Message-ID | <56ebed84$0$1609$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #105200 |
On Fri, 18 Mar 2016 10:46 pm, Steven D'Aprano wrote: > I think it is typical of JMF that his idea of a language where Unicode > "just works" is one where it *does work at all* (at least not as strings). Er, does NOT work at all. > Python 1.5 strings supported Unicode just as well as Go's string class. Since I'm replying to myself, I guess I can take the opportunity to expand on this. Go's concept of strings is, more or less, byte strings: https://blog.golang.org/strings They are handled as an array of bytes and indexing produces bytes. That's exactly the same functionality as Python strings provided in version 1.5. In fairness, Go does provide a second type, "runes", which is equivalent to Python 2.7 unicode using a wide build (i.e. equivalent to UTF-32). -- Steven
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2016-03-18 12:53 -0700 |
| Message-ID | <d3e920a1-4026-4657-843d-0cfb51287dd8@googlegroups.com> |
| In reply to | #105203 |
Le vendredi 18 mars 2016 19:34:27 UTC+1, Terry Reedy a écrit : > > Python's space-saving FSR was developed years after Go was. > > -- The FSR does not save memory, it uses less memory only if one uses *explicitly* chars consuming less memory. Exactly, the opposite of what utf-8/16 do ! (You know how to use an interactive interpreter).
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2016-03-18 23:37 +1100 |
| Message-ID | <mailman.314.1458304644.12893.python-list@python.org> |
| In reply to | #105200 |
On Fri, Mar 18, 2016 at 10:46 PM, Steven D'Aprano <steve@pearwood.info> wrote:
> On Fri, 18 Mar 2016 06:00 pm, Ian Kelly wrote:
>
>> On Thu, Mar 17, 2016 at 1:21 PM, Rick Johnson
>> <rantingrickjohnson@gmail.com> wrote:
>>> In the event that i change my mind about Unicode, and/or for
>>> the sake of others, who may want to know, please provide a
>>> list of languages that *YOU* think handle Unicode better than
>>> Python, starting with the best first. Thanks.
>
> Better than Python? Easy-peasy:
>
> List of languages with Unicode handling which is better than Python = []
>
> I'm not aware of any language with better or more complete Unicode
> functionality than Python's. (That doesn't necessarily mean that they don't
> exist.)
And this also doesn't preclude languages that have *as good* handling
as Python's, of which I know of one off-hand, and there may be any
number. (Trivial case: Take Python 3.5, change the definition of a
block to be { } instead of indentation, and release it as Bracethon
1.0. Voila, a distinct-yet-related language whose Unicode handling is
exactly as good as Python's.)
>> jmf has been asked this before, and as I recall he seems to feel that
>> UTF-8 should be used for all purposes, ignoring the limitations of
>> that encoding such as that indexing becomes a O(n) operation.
>
> Technically, UTF-8 doesn't *necessarily* imply indexing is O(n). For
> instance, your UTF-8 string might consist of an array of bytes containing
> the string, plus an array of indexes to the start of each code point. For
> example, the string:
>
> “abcπßЊ•𒀁”
>
> (including the quote marks) is 10 code points in length and 22 bytes as
> UTF-8. Grouping the (hex) bytes for each code point, we have:
>
> e2809c 61 62 63 cf80 c39f d08a e280a2 f0928081 e2809d
>
> so we could get a O(1) UTF-8 string by recording the bytes (in hex) plus the
> indexes (in decimal) in which each code point starts:
>
> e2809c616263cf80c39fd08ae280a2f0928081e2809d
>
> 0 3 4 5 6 8 10 12 15 19
>
> but (assuming each index needs 2 bytes, which supports strings up to 65535
> characters in length), that's actually LESS memory efficient than UTF-32:
> 42 bytes versus 40.
A lot of strings will have no more than 255 non-ASCII characters in
them. (For example, all strings which no more than 255 total
characters.) You could store, instead of the indexes themselves, a
series of one-byte offsets:
e2809c616263cf80c39fd08ae280a2f0928081e2809d
0 2 2 2 2 3 4 5 7 10
Locating a byte based on its character position is still O(1); you
look up that position in the offset table, add that to your original
character position, and you have the byte location. For strings with
too many non-ASCII codepoints, you'd need some other representation,
but at that point, it might be worth just switching to UTF-32.
Of course, O(1) isn't the ultimate goal to the exclusion of all else.
For a simple sequential parser, indexing might be such a rare
operation that it's okay for it to be O(N), as you're never going to
index more than a few characters from a known position. Or if you're
trying to search a few gig of text, it's entirely possible that
transcoding into an indexable format is a complete waste of time, and
it's better to just work with a stream of bytes straight off the disk.
But for a general string type in a high level language, I'm normally
going to assume that indexing is fairly cheap.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2016-03-18 07:57 -0600 |
| Message-ID | <mailman.318.1458309520.12893.python-list@python.org> |
| In reply to | #105200 |
On Fri, Mar 18, 2016 at 6:37 AM, Chris Angelico <rosuav@gmail.com> wrote: > On Fri, Mar 18, 2016 at 10:46 PM, Steven D'Aprano <steve@pearwood.info> wrote: >> Technically, UTF-8 doesn't *necessarily* imply indexing is O(n). For >> instance, your UTF-8 string might consist of an array of bytes containing >> the string, plus an array of indexes to the start of each code point. For >> example, the string: >> >> “abcπßЊ•𒀁” >> >> (including the quote marks) is 10 code points in length and 22 bytes as >> UTF-8. Grouping the (hex) bytes for each code point, we have: >> >> e2809c 61 62 63 cf80 c39f d08a e280a2 f0928081 e2809d >> >> so we could get a O(1) UTF-8 string by recording the bytes (in hex) plus the >> indexes (in decimal) in which each code point starts: >> >> e2809c616263cf80c39fd08ae280a2f0928081e2809d >> >> 0 3 4 5 6 8 10 12 15 19 >> >> but (assuming each index needs 2 bytes, which supports strings up to 65535 >> characters in length), that's actually LESS memory efficient than UTF-32: >> 42 bytes versus 40. > > A lot of strings will have no more than 255 non-ASCII characters in > them. (For example, all strings which no more than 255 total > characters.) You could store, instead of the indexes themselves, a > series of one-byte offsets: > > e2809c616263cf80c39fd08ae280a2f0928081e2809d > 0 2 2 2 2 3 4 5 7 10 > > Locating a byte based on its character position is still O(1); you > look up that position in the offset table, add that to your original > character position, and you have the byte location. For strings with > too many non-ASCII codepoints, you'd need some other representation, > but at that point, it might be worth just switching to UTF-32. So this uses approximately twice as much memory as the FSR and still requires switching on some form of character width in the implementation? Yeah, I don't think the RUE is going to go for that. 8-)
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2016-03-19 03:44 +1100 |
| Message-ID | <56ec3073$0$1587$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #105142 |
On Sat, 19 Mar 2016 02:31 am, Random832 wrote:
> On Fri, Mar 18, 2016, at 11:17, Ian Kelly wrote:
>> > Just to play devil's advocate, here, why is it so bad for indexing to
>> > be O(n)? Some simple caching is all that's needed to prevent it from
>> > making iteration O(n^2), if that's what you're worried about.
>>
>> What kind of caching do you have in mind?
>
> The byte index of the last character index accessed.
Some people, when faced with a problem, think, I know, I'll use a cache. Now
they have 99 problems (but a bitch ain't one).
> When accessing a
> new index greater than that one, start from there (_maybe_ also support
> iterating backwards in this way, if accessing an index that is much
> closer to the cached one than to zero). And even that's only necessary
> if you actually _care_ about forward iteration by character indices
> (i.e. "for i in range(len(s))"), rather than writing it off as bad
> coding style.
Without locking, this kills thread safety for strings. With locking, it
probably kills performance. Worse, even in single-threaded code, functions
can mess up the cache, destroying any usefulness it may have:
x = mystring[9999] # sets the cache to index 9999
spam(mystring) # calls mystring[0], setting the cache back to 0
y = mystring[10000]
And I don't understand this meme that indexing strings is not important.
Have people never (say) taken a slice of a string, or a look-ahead, or
something similar?
i = mystring.find(":")
next_char = mystring[i+1]
# Strip the first and last chars from a string
mystring[1:-1]
Perhaps you might argue that for some applications, O(N) string operations
don't matter. After all, in Linux, terminals usually are set to UTF-8, and
people can copy and paste substrings out of the buffer, etc. So there may
be a case to say that for applications with line-oriented buffers typically
containing 70 or 170 characters per line, like a terminal, UTF-8 is
perfectly adequate. I have no argument against that.
But I think that for a programming language which may be dealing with
strings of multiple tens of megabytes in size, I'm skeptical that UTF-8
doesn't cause a performance hit for at least some operations.
>> It's not the only drawback, either. If you want to know anything about
>> the characters in the string that you're looking at, you need to know
>> their codepoints.
>
> Nonsense. That depends on what you want to know about it. You can
> extract a single character from a string, as a string, without knowing
> anything about it except what range the first byte is in. You can use
> this string directly as an index to a hash table containing information
> such as unicode properties, names, etc.
I don't understand your comment. If I give you the index of the character,
how do you know where its first byte is? With UTF-8, character i can be
anywhere between byte i and 4*i.
(And by character I actually mean code point -- let's not get into arguments
about normalisation.)
>> If the string is simple UCS-2, that's easy.
Hmmm, well, nobody uses UCS-2 any more, since that only covers the first
65536 code points. Rather, languages like Javascript and Java, and the
Windows OS, use UTF-16, which is a *variable width* extension to UCS-2. I
don't know about Windows, but Javascript implements this badly, so that
4-byte UTF-16 code points are treated as *two* surrogate code points
instead of the single code point they are meant to be. We can get the same
result in Python 2.7 narrow builds:
py> s = u'\U0010FFFF' # definitely a single code point
py> len(s) # WTF?
2
py> s[0] # a surrogate.
u'\udbff'
py> s[1] # another surrogate
u'\udfff'
The only fixed-width encoding of the entire Unicode character set is UTF-32.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Jussi Piitulainen <jussi.piitulainen@helsinki.fi> |
|---|---|
| Date | 2016-03-18 20:22 +0200 |
| Message-ID | <lf5egb78x30.fsf@ling.helsinki.fi> |
| In reply to | #105226 |
Steven D'Aprano writes:
> And I don't understand this meme that indexing strings is not
> important. Have people never (say) taken a slice of a string, or a
> look-ahead, or something similar?
>
> i = mystring.find(":")
> next_char = mystring[i+1]
The point is that O(1) indexing and slicing *can be done* in UTF-8! Once
you've found the byte index of that ':', the next index can be computed
locally.
Here's a UTF-8 string in Julia, with some 2-byte characters. Indexing is
by byte (1-based).
julia> s = "Heinäpaalin kierittäminen turvaköyttä pitkin on ältsin vaikeeta.";
The index of the 'ö' is found in O(n) time, same as you find the ":" in
Python:
julia> search(s, 'ö')
35
There are methods for finding the next valid index, in O(1) time:
julia> nextind(s, 35)
37
julia> next(s, 35)
('ö',37)
The character at a known index is found in O(1) time:
julia> s[37]
'y'
The index of the previous character is found in O(1) time:
julia> prevind(s, 35)
34
The index of the first space after the 'ö' is again found in O(n) time,
just like it would be in Python:
julia> search(s, ' ', 35)
42
The index of the character preceding that space is found in O(1) time;
this one is a two-byte character, so it's not the byte at 41 but at 40:
julia> prevind(s, 42)
40
And slicing involves no search:
julia> s[34:40]
"köyttä"
> # Strip the first and last chars from a string
> mystring[1:-1]
julia> s[nextind(s, start(s)):prevind(s, endof(s))]
"einäpaalin kierittäminen turvaköyttä pitkin on ältsin vaikeeta"
That's a bit wordy but there's no O(n) search involved.
(I may not know the simplest way to do this in Julia. Or the above may
be the simplest way out of the box. Not sure. And not too worried.)
> But I think that for a programming language which may be dealing with
> strings of multiple tens of megabytes in size, I'm skeptical that
> UTF-8 doesn't cause a performance hit for at least some operations.
Possibly. I've never had an unstructured string of such size. Who works
with such?
I am processing hundreds of such text files right now. Thousands. Why
would anyone know the millionth character in any of them? It's just not
meaningful. The meaningful questions involve positions that are already
somehow identified.
>>> It's not the only drawback, either. If you want to know anything
>>> about the characters in the string that you're looking at, you need
>>> to know their codepoints.
>>
>> Nonsense. That depends on what you want to know about it. You can
>> extract a single character from a string, as a string, without
>> knowing anything about it except what range the first byte is in. You
>> can use this string directly as an index to a hash table containing
>> information such as unicode properties, names, etc.
>
> I don't understand your comment. If I give you the index of the
> character, how do you know where its first byte is? With UTF-8,
> character i can be anywhere between byte i and 4*i.
If you give me the byte index, then the character is at that index.
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2016-03-18 13:03 -0700 |
| Message-ID | <a5cb0df1-3b14-448b-ab24-8f234620c5d6@googlegroups.com> |
| In reply to | #105231 |
Le vendredi 18 mars 2016 19:23:13 UTC+1, Jussi Piitulainen a écrit : > > If you give me the byte index, then the character is at that index. Yes, this is probably the most important point of the utf-X constructs. Given a stream of bytes and an index position, it is always possible to recover the "corresponding" encoded code point [*], without doing any "real" decoding. [*] I deliberately wrote *CODE POINT* and not *CHAR*.
[toc] | [prev] | [next] | [standalone]
| From | sohcahtoa82@gmail.com |
|---|---|
| Date | 2016-03-18 11:18 -0700 |
| Message-ID | <eb49e08d-a744-440e-bff6-730bcf288c15@googlegroups.com> |
| In reply to | #105093 |
On Thursday, March 17, 2016 at 7:34:46 AM UTC-7, wxjm...@gmail.com wrote: > Very simple. Use Python and its (buggy) character encoding > model. > > How to save memory? > It's also very simple. Use a programming language, which > handles Unicode correctly. *looks at the other messages in this thread* ... Whatever happened to the idea of not feeding the trolls? This image depicts this thread perfectly: http://i.imgur.com/sVtpZDK.png
[toc] | [prev] | [standalone]
Page 4 of 4 — ← Prev page 1 2 3 [4]
Back to top | Article view | comp.lang.python
csiph-web