Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #50503 > unrolled thread
| Started by | Devyn Collier Johnson <devyncjohnson@gmail.com> |
|---|---|
| First post | 2013-07-11 19:44 -0400 |
| Last post | 2013-07-18 13:17 -0700 |
| Articles | 20 on this page of 136 — 25 participants |
Back to article view | Back to comp.lang.python
RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-11 19:44 -0400
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-12 02:23 -0700
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-12 19:27 +1000
Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-12 10:39 +0100
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-12 19:40 +1000
Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-12 06:45 -0400
Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-12 16:59 +0100
Re: RE Module Performance Peter Otten <__peter__@web.de> - 2013-07-12 18:15 +0200
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-13 02:21 +1000
Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-12 13:58 -0400
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-13 05:37 +0000
Re: RE Module Performance 88888 Dihedral <dihedral88888@gmail.com> - 2013-07-14 11:17 -0700
Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-15 06:06 -0400
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-15 12:36 +0000
Dihedral Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-15 08:52 -0400
Re: Dihedral Joel Goldstick <joel.goldstick@gmail.com> - 2013-07-15 09:03 -0400
Re: Dihedral Wayne Werner <wayne@waynewerner.com> - 2013-07-15 17:43 -0500
Re: Dihedral Fábio Santos <fabiosantosart@gmail.com> - 2013-07-15 23:54 +0100
Re: Dihedral Chris Angelico <rosuav@gmail.com> - 2013-07-16 08:59 +1000
Re: Dihedral Tim Delaney <timothy.c.delaney@gmail.com> - 2013-07-16 16:06 +1000
Re: Dihedral Stefan Behnel <stefan_ml@behnel.de> - 2013-07-24 20:08 +0200
Re: Dihedral Chris Angelico <rosuav@gmail.com> - 2013-07-25 04:23 +1000
Re: Dihedral Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-07-24 20:15 -0400
Re: RE Module Performance Tim Delaney <timothy.c.delaney@gmail.com> - 2013-07-13 08:16 +1000
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-12 17:13 -0600
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-24 06:40 -0700
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-24 23:48 +1000
Re: RE Module Performance David Hutto <dwightdhutto@gmail.com> - 2013-07-24 10:17 -0400
Re: RE Module Performance David Hutto <dwightdhutto@gmail.com> - 2013-07-24 10:19 -0400
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 00:34 +1000
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 07:02 +0000
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 17:39 +1000
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-24 08:47 -0600
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-25 02:27 -0700
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 20:14 +1000
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-25 12:07 -0700
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-26 05:18 +1000
RE: RE Module Performance "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2013-07-25 19:30 +0000
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-25 21:06 -0600
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-24 09:00 -0600
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 05:56 +0000
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 00:56 +1000
Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-24 13:52 -0400
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 04:15 +1000
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 07:15 +0000
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 17:58 +1000
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 09:22 +0000
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 20:07 +1000
Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-24 18:09 -0400
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 08:19 +1000
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-24 16:59 -0600
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 09:24 +1000
Re: RE Module Performance Serhiy Storchaka <storchaka@gmail.com> - 2013-07-25 08:49 +0300
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 15:58 +1000
Re: RE Module Performance Jeremy Sanders <jeremy@jeremysanders.net> - 2013-07-25 14:36 +0100
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 15:26 +0000
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-26 01:36 +1000
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 17:18 +0000
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-26 03:27 +1000
Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-25 15:45 -0500
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-26 02:48 +0000
Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-25 21:20 -0600
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-26 06:36 -0700
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-26 08:46 -0700
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-27 06:28 +0000
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-27 03:37 +0000
Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-26 22:12 -0600
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-27 05:04 +0000
Re: RE Module Performance Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-07-27 12:13 -0400
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-26 06:19 -0700
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-25 21:09 -0600
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-26 06:21 -0700
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-26 20:05 -0600
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-27 11:21 -0700
Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-27 21:53 -0600
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-28 11:13 -0700
Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-28 20:04 +0100
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-28 12:30 -0700
Re: RE Module Performance Lele Gaifax <lele@metapensiero.it> - 2013-07-28 22:45 +0200
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-28 22:01 +0200
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-30 07:01 -0700
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-30 16:38 +0200
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-30 15:45 +0100
Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-30 17:13 +0100
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-30 18:39 +0200
Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-30 18:14 +0100
Re: RE Module Performance Neil Hodgson <nhodgson@iinet.net.au> - 2013-07-31 13:09 +1000
Re: RE Module Performance Tim Delaney <timothy.c.delaney@gmail.com> - 2013-07-31 03:27 +1000
Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-30 18:40 +0100
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-30 20:19 +0200
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-30 12:09 -0700
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-30 21:04 +0100
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-30 21:54 -0600
Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-31 05:45 +0000
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-31 08:17 +0100
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-31 13:15 -0700
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-31 21:41 +0100
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-31 10:11 +0200
Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-31 01:32 -0700
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-31 10:59 +0200
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-31 08:44 -0600
Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-30 17:05 -0400
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-30 21:30 -0600
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-31 09:23 +0200
Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-31 08:27 -0600
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-28 10:45 +0200
FSR and unicode compliance - was Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-28 09:52 -0600
Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-28 12:23 -0700
Re: FSR and unicode compliance - was Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-28 20:44 +0100
Re: FSR and unicode compliance - was Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-28 21:55 +0200
Re: FSR and unicode compliance - was Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-28 20:52 +0000
Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 04:43 -0700
Re: FSR and unicode compliance - was Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-29 12:57 +0100
Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 05:56 -0700
Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 07:20 -0700
Re: FSR and unicode compliance - was Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-29 15:49 +0100
Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 09:31 -0700
Re: FSR and unicode compliance - was Re: RE Module Performance Heiko Wundram <modelnine@modelnine.org> - 2013-07-29 14:06 +0200
Re: FSR and unicode compliance - was Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-29 08:43 -0400
Re: FSR and unicode compliance - was Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-28 18:03 +0100
Re: FSR and unicode compliance - was Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-28 13:36 -0400
Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 06:36 -0700
Re: FSR and unicode compliance - was Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-28 19:03 +0100
Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-28 19:19 +0100
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-28 19:29 +0100
Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-28 15:06 -0400
Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-28 23:14 +0100
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-28 20:51 +0200
Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-29 00:07 +0100
Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-26 22:38 +0200
Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-25 09:44 -0400
Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-25 15:53 -0500
Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-13 00:16 +0100
Re: RE Module Performance Tim Delaney <timothy.c.delaney@gmail.com> - 2013-07-14 05:34 +1000
Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-16 06:30 -0400
Re: RE Module Performance 88888 Dihedral <dihedral88888@gmail.com> - 2013-07-18 13:17 -0700
Page 5 of 7 — ← Prev page 1 2 3 4 [5] 6 7 Next page →
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-07-30 07:01 -0700 |
| Message-ID | <43ce1b65-9d6d-47dd-b209-9a3bbafc0b8c@googlegroups.com> |
| In reply to | #51380 |
Le dimanche 28 juillet 2013 05:53:22 UTC+2, Ian a écrit :
> On Sat, Jul 27, 2013 at 12:21 PM, <wxjmfauth@gmail.com> wrote:
>
> > Back to utf. utfs are not only elements of a unique set of encoded
>
> > code points. They have an interesting feature. Each "utf chunk"
>
> > holds intrisically the character (in fact the code point) it is
>
> > supposed to represent. In utf-32, the obvious case, it is just
>
> > the code point. In utf-8, that's the first chunk which helps and
>
> > utf-16 is a mixed case (utf-8 / utf-32). In other words, in an
>
> > implementation using bytes, for any pointer position it is always
>
> > possible to find the corresponding encoded code point and from this
>
> > the corresponding character without any "programmed" information. See
>
> > my editor example, how to find the char under the caret? In fact,
>
> > a silly example, how can the caret can be positioned or moved, if
>
> > the underlying corresponding encoded code point can not be
>
> > dicerned!
>
>
>
> Yes, given a pointer location into a utf-8 or utf-16 string, it is
>
> easy to determine the identity of the code point at that location.
>
> But this is not often a useful operation, save for resynchronization
>
> in the case that the string data is corrupted. The caret of an editor
>
> does not conceptually correspond to a pointer location, but to a
>
> character index. Given a particular character index (e.g. 127504), an
>
> editor must be able to determine the identity and/or the memory
>
> location of the character at that index, and for UTF-8 and UTF-16
>
> without an auxiliary data structure that is a O(n) operation.
>
>
------
Same conceptual mistake as Steven's example with its buffers,
the buffer does not know it holds characters.
This is not the point to discuss.
-----
I am pretty sure that once you have typed your 127504
ascii characters, you are very happy the buffer of your
editor does not waste time in reencoding the buffer as
soon as you enter an €, the 125505th char. Sorry, I wanted
to say z instead of euro, just to show that backspacing the
last char and reentering a new char implies twice a reencoding.
Somebody wrote "FSR" is just an optimization. Yes, but in case
of an editor à la FSR, this optimization take place everytime you
enter a char. Your poor editor, in fact the FSR, is finally
spending its time in optimizing and finally it optimizes nothing.
(It is even worse).
If you type correctly a z instead of an €, it is not necessary
to reencode the buffer. Problem, you do you know that you do
not have to reencode? simple just check it, and by just checking
it wastes time to test it you have to optimized or not and hurt
a little bit more what is supposed to be an optimization.
Do not confuse the process of optimisation and the result of
optimization (funny, it's like the utf's).
There is a trick to make the editor to know if it has
to be "optimized". Just put some flag somewhere. Then
you fall on the "Houston" syndrome. Houston, we got a
problem, our buffer consumes much more bytes than expected.
>>> sys.getsizeof('€')
40
>>> sys.getsizeof('a')
26
Now the good news. In an editor à la FSR, the
"composition" is not so important. You know,
"practicality beats purity". The hard job
is the text rendering engine and the handling
of the font (even in a raw unicode editor).
And as these tools are luckily not woking à la FSR
(probably because they understand the coding
of the characters), your editor is still working
not so badly.
jmf
[toc] | [prev] | [next] | [standalone]
| From | Antoon Pardon <antoon.pardon@rece.vub.ac.be> |
|---|---|
| Date | 2013-07-30 16:38 +0200 |
| Message-ID | <mailman.5311.1375195157.3114.python-list@python.org> |
| In reply to | #51558 |
Op 30-07-13 16:01, wxjmfauth@gmail.com schreef: > > I am pretty sure that once you have typed your 127504 > ascii characters, you are very happy the buffer of your > editor does not waste time in reencoding the buffer as > soon as you enter an €, the 125505th char. Sorry, I wanted > to say z instead of euro, just to show that backspacing the > last char and reentering a new char implies twice a reencoding. Using a single string as an editor buffer is a bad idea in python for the simple reason that strings are immutable. So adding characters would mean continuously copying the string buffer into a new string with the next character added. Copying 127504 characters into a new string will not make that much of a difference whether the octets are just copied to octets or are unpacked into 32 bit words. > Somebody wrote "FSR" is just an optimization. Yes, but in case > of an editor à la FSR, this optimization take place everytime you > enter a char. Your poor editor, in fact the FSR, is finally > spending its time in optimizing and finally it optimizes nothing. > (It is even worse). Even if you would do it this way, it would *not* take place every time you enter a char. Once your buffer would contain a wide character, it would just need to convert the single character that is added after each keystroke. It would not need to convert the whole buffer after each key stroke. > If you type correctly a z instead of an €, it is not necessary > to reencode the buffer. Problem, you do you know that you do > not have to reencode? simple just check it, and by just checking > it wastes time to test it you have to optimized or not and hurt > a little bit more what is supposed to be an optimization. Your scenario is totally unrealistic. First of all because of the immutable nature of python strings, second because you suggest that real time usage would result in frequent conversions which is highly unlikely. -- Antoon Pardon
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-07-30 15:45 +0100 |
| Message-ID | <mailman.5312.1375195560.3114.python-list@python.org> |
| In reply to | #51558 |
On Tue, Jul 30, 2013 at 3:01 PM, <wxjmfauth@gmail.com> wrote:
> I am pretty sure that once you have typed your 127504
> ascii characters, you are very happy the buffer of your
> editor does not waste time in reencoding the buffer as
> soon as you enter an €, the 125505th char. Sorry, I wanted
> to say z instead of euro, just to show that backspacing the
> last char and reentering a new char implies twice a reencoding.
You're still thinking that the editor's buffer is a Python string. As
I've shown earlier, this is a really bad idea, and that has nothing to
do with FSR/PEP 393. An immutable string is *horribly* inefficient at
this; if you want to keep concatenating onto a string, the recommended
method is a list of strings that gets join()d at the end, and the same
technique works well here. Here's a little demo class that could make
the basis for such a system:
class EditorBuffer:
def __init__(self,fn):
self.fn=fn
self.buffer=[open(fn).read()]
def insert(self,pos,char):
if pos==0:
# Special case: insertion at beginning of buffer
if len(self.buffer[0])>1024: self.buffer.insert(0,char)
else: self.buffer[0]=char+self.buffer[0]
return
for idx,part in enumerate(self.buffer):
l=len(part)
if pos>l:
pos-=l
continue
if pos<l:
# Cursor is somewhere inside this string
splitme=self.buffer[idx]
self.buffer[idx:idx+1]=splitme[:pos],splitme[pos:]
l=pos
# Cursor is now at the end of this string
if l>1024: self.buffer[idx:idx+1]=self.buffer[idx],char
else: self.buffer[idx]+=char
return
raise ValueError("Cannot insert past end of buffer")
def __str__(self):
return ''.join(self.buffer)
def save(self):
open(fn,"w").write(str(self))
It guarantees that inserts will never need to resize more than 1KB of
text. As a real basis for an editor, it still sucks, but it's purely
to prove this one point.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2013-07-30 17:13 +0100 |
| Message-ID | <mailman.5321.1375200818.3114.python-list@python.org> |
| In reply to | #51558 |
On 30/07/2013 15:38, Antoon Pardon wrote: > Op 30-07-13 16:01, wxjmfauth@gmail.com schreef: >> >> I am pretty sure that once you have typed your 127504 ascii >> characters, you are very happy the buffer of your editor does not >> waste time in reencoding the buffer as soon as you enter an €, the >> 125505th char. Sorry, I wanted to say z instead of euro, just to >> show that backspacing the last char and reentering a new char >> implies twice a reencoding. > > Using a single string as an editor buffer is a bad idea in python for > the simple reason that strings are immutable. Using a single string as an editor buffer is a bad idea in _any_ language because an insertion would require all the following characters to be moved. > So adding characters would mean continuously copying the string > buffer into a new string with the next character added. Copying > 127504 characters into a new string will not make that much of a > difference whether the octets are just copied to octets or are > unpacked into 32 bit words. > >> Somebody wrote "FSR" is just an optimization. Yes, but in case of >> an editor à la FSR, this optimization take place everytime you >> enter a char. Your poor editor, in fact the FSR, is finally >> spending its time in optimizing and finally it optimizes nothing. >> (It is even worse). > > Even if you would do it this way, it would *not* take place every > time you enter a char. Once your buffer would contain a wide > character, it would just need to convert the single character that is > added after each keystroke. It would not need to convert the whole > buffer after each key stroke. > >> If you type correctly a z instead of an €, it is not necessary to >> reencode the buffer. Problem, you do you know that you do not have >> to reencode? simple just check it, and by just checking it wastes >> time to test it you have to optimized or not and hurt a little bit >> more what is supposed to be an optimization. > > Your scenario is totally unrealistic. First of all because of the > immutable nature of python strings, second because you suggest that > real time usage would result in frequent conversions which is highly > unlikely. > What you would have is a list of mutable chunks. Inserting into a chunk would be fast, and a chunk would be split if it's already full. Also, small adjacent chunks would be joined together. Finally, a chunk could use FSR to reduce memory usage.
[toc] | [prev] | [next] | [standalone]
| From | Antoon Pardon <antoon.pardon@rece.vub.ac.be> |
|---|---|
| Date | 2013-07-30 18:39 +0200 |
| Message-ID | <mailman.5323.1375202438.3114.python-list@python.org> |
| In reply to | #51558 |
Op 30-07-13 18:13, MRAB schreef: > On 30/07/2013 15:38, Antoon Pardon wrote: >> Op 30-07-13 16:01, wxjmfauth@gmail.com schreef: >>> >>> I am pretty sure that once you have typed your 127504 ascii >>> characters, you are very happy the buffer of your editor does not >>> waste time in reencoding the buffer as soon as you enter an €, the >>> 125505th char. Sorry, I wanted to say z instead of euro, just to >>> show that backspacing the last char and reentering a new char >>> implies twice a reencoding. >> >> Using a single string as an editor buffer is a bad idea in python for >> the simple reason that strings are immutable. > > Using a single string as an editor buffer is a bad idea in _any_ > language because an insertion would require all the following > characters to be moved. Not if you use a gap buffer. -- Antoon Pardon.
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2013-07-30 18:14 +0100 |
| Message-ID | <mailman.5328.1375204498.3114.python-list@python.org> |
| In reply to | #51558 |
On 30/07/2013 17:39, Antoon Pardon wrote: > Op 30-07-13 18:13, MRAB schreef: >> On 30/07/2013 15:38, Antoon Pardon wrote: >>> Op 30-07-13 16:01, wxjmfauth@gmail.com schreef: >>>> >>>> I am pretty sure that once you have typed your 127504 ascii >>>> characters, you are very happy the buffer of your editor does not >>>> waste time in reencoding the buffer as soon as you enter an €, the >>>> 125505th char. Sorry, I wanted to say z instead of euro, just to >>>> show that backspacing the last char and reentering a new char >>>> implies twice a reencoding. >>> >>> Using a single string as an editor buffer is a bad idea in python for >>> the simple reason that strings are immutable. >> >> Using a single string as an editor buffer is a bad idea in _any_ >> language because an insertion would require all the following >> characters to be moved. > > Not if you use a gap buffer. > The disadvantage there is that when you move the cursor you must move characters around. For example, what if the cursor was at the start and you wanted to move it to the end? Also, when the gap has been filled, you need to make a new one.
[toc] | [prev] | [next] | [standalone]
| From | Neil Hodgson <nhodgson@iinet.net.au> |
|---|---|
| Date | 2013-07-31 13:09 +1000 |
| Message-ID | <OaWdnR_YnbQuHWXMnZ2dnUVZ_vWdnZ2d@westnet.com.au> |
| In reply to | #51586 |
MRAB:
> The disadvantage there is that when you move the cursor you must move
> characters around. For example, what if the cursor was at the start and
> you wanted to move it to the end? Also, when the gap has been filled,
> you need to make a new one.
The normal technique is to only move the gap when text is added or
removed, not when the cursor moves. Code that reads the contents, such
as for display, handles the gap by checking the requested position and
using a different offset when the position is after the gap.
Gap buffers work well because changes are generally close to the
previous change, so require moving only a relatively small amount of
text. Even an occasional move of the whole contents won't cause too much
trouble for interactivity with current processors moving multiple
megabytes per millisecond.
Neil
[toc] | [prev] | [next] | [standalone]
| From | Tim Delaney <timothy.c.delaney@gmail.com> |
|---|---|
| Date | 2013-07-31 03:27 +1000 |
| Message-ID | <mailman.5329.1375205232.3114.python-list@python.org> |
| In reply to | #51558 |
[Multipart message — attachments visible in raw view] — view raw
On 31 July 2013 00:01, <wxjmfauth@gmail.com> wrote: > > I am pretty sure that once you have typed your 127504 > ascii characters, you are very happy the buffer of your > editor does not waste time in reencoding the buffer as > soon as you enter an €, the 125505th char. Sorry, I wanted > to say z instead of euro, just to show that backspacing the > last char and reentering a new char implies twice a reencoding. > And here we come to the root of your complete misunderstanding and mischaracterisation of the FSR. You don't appear to understand that strings in Python are immutable and that to add a character to an existing string requires copying the entire string + new character. In your hypothetical situation above, you have already performed 127504 copy + new character operations before you ever get to a single widening operation. The overhead of the copy + new character repeated 127504 times dwarfs the overhead of a single widening operation. Given your misunderstanding, it's no surprise that you are focused on microbenchmarks that demonstrate that copying entire strings and adding a character can be slower in some situations than others. When the only use case you have is implementing the buffer of an editor using an immutable string I can fully understand why you would be concerned about the performance of adding and removing individual characters. However, in that case *you're focused on the wrong problem*. Until you can demonstrate an understanding that doing the above in any language which has immutable strings is completely insane you will have no credibility and the only interest anyone will pay to your posts is refuting your FUD so that people new to the language are not driven off by you. Tim Delaney
[toc] | [prev] | [next] | [standalone]
| From | Joshua Landau <joshua@landau.ws> |
|---|---|
| Date | 2013-07-30 18:40 +0100 |
| Message-ID | <mailman.5330.1375206059.3114.python-list@python.org> |
| In reply to | #51558 |
[Multipart message — attachments visible in raw view] — view raw
On 30 July 2013 17:39, Antoon Pardon <antoon.pardon@rece.vub.ac.be> wrote: > Op 30-07-13 18:13, MRAB schreef: > > On 30/07/2013 15:38, Antoon Pardon wrote: >> >>> Op 30-07-13 16:01, wxjmfauth@gmail.com schreef: >>> >>>> >>>> I am pretty sure that once you have typed your 127504 ascii >>>> characters, you are very happy the buffer of your editor does not >>>> waste time in reencoding the buffer as soon as you enter an €, the >>>> 125505th char. Sorry, I wanted to say z instead of euro, just to >>>> show that backspacing the last char and reentering a new char >>>> implies twice a reencoding. >>>> >>> >>> Using a single string as an editor buffer is a bad idea in python for >>> the simple reason that strings are immutable. >>> >> >> Using a single string as an editor buffer is a bad idea in _any_ >> language because an insertion would require all the following >> characters to be moved. >> > > Not if you use a gap buffer. Additionally, who says a language couldn't use, say, B-Trees for all of its list-like types, including strings?
[toc] | [prev] | [next] | [standalone]
| From | Antoon Pardon <antoon.pardon@rece.vub.ac.be> |
|---|---|
| Date | 2013-07-30 20:19 +0200 |
| Message-ID | <mailman.5334.1375208411.3114.python-list@python.org> |
| In reply to | #51558 |
Op 30-07-13 19:14, MRAB schreef: > On 30/07/2013 17:39, Antoon Pardon wrote: >> Op 30-07-13 18:13, MRAB schreef: >>> On 30/07/2013 15:38, Antoon Pardon wrote: >>>> Op 30-07-13 16:01, wxjmfauth@gmail.com schreef: >>>>> >>>>> I am pretty sure that once you have typed your 127504 ascii >>>>> characters, you are very happy the buffer of your editor does not >>>>> waste time in reencoding the buffer as soon as you enter an €, the >>>>> 125505th char. Sorry, I wanted to say z instead of euro, just to >>>>> show that backspacing the last char and reentering a new char >>>>> implies twice a reencoding. >>>> >>>> Using a single string as an editor buffer is a bad idea in python for >>>> the simple reason that strings are immutable. >>> >>> Using a single string as an editor buffer is a bad idea in _any_ >>> language because an insertion would require all the following >>> characters to be moved. >> >> Not if you use a gap buffer. >> > The disadvantage there is that when you move the cursor you must move > characters around. For example, what if the cursor was at the start and > you wanted to move it to the end? Also, when the gap has been filled, > you need to make a new one. So? Why are you making this a point of discussion? I was not aware that the pro and cons of various editor buffer implemantations was relevant to the point I was trying to make. If you prefer an other data structure in the editor you are working on, I will not dissuade you. -- Antoon Pardon
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-07-30 12:09 -0700 |
| Message-ID | <39155ddf-437c-459e-ad7c-dd841810a592@googlegroups.com> |
| In reply to | #51594 |
Matable, immutable, copyint + xxx, bufferint, O(n) ....
Yes, but conceptualy the reencoding happen sometime, somewhere.
The internal "ucs-2" will never automagically be transformed
into "ucs-4" (eg).
>>> timeit.timeit("'a'*10000 +'€'")
7.087220684719967
>>> timeit.timeit("'a'*10000 +'z'")
1.5685214234430873
>>> timeit.timeit("z = 'a'*10000; z = z +'€'")
7.169538866162213
>>> timeit.timeit("z = 'a'*10000; z = z +'z'")
1.5815893830557286
>>> timeit.timeit("z = 'a'*10000; z += 'z'")
1.606955741596181
>>> timeit.timeit("z = 'a'*10000; z += '€'")
7.160483334521416
And do not forget, in a pure utf coding scheme, your
char or a char will *never* be larger than 4 bytes.
>>> sys.getsizeof('a')
26
>>> sys.getsizeof('\U000101000')
48
jmf
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-07-30 21:04 +0100 |
| Message-ID | <mailman.5339.1375214662.3114.python-list@python.org> |
| In reply to | #51598 |
On Tue, Jul 30, 2013 at 8:09 PM, <wxjmfauth@gmail.com> wrote:
> Matable, immutable, copyint + xxx, bufferint, O(n) ....
> Yes, but conceptualy the reencoding happen sometime, somewhere.
> The internal "ucs-2" will never automagically be transformed
> into "ucs-4" (eg).
But probably not on the entire document. With even a brainless scheme
like I posted code for, no more than 1024 bytes will need to be
recoded at a time (except in some odd edge cases, and even then, no
more than once for any given file).
> And do not forget, in a pure utf coding scheme, your
> char or a char will *never* be larger than 4 bytes.
>
>>>> sys.getsizeof('a')
> 26
>>>> sys.getsizeof('\U000101000')
> 48
Yeah, you have a few odd issues like, oh, I dunno, GC overhead,
reference count, object class, and string length, all stored somewhere
there. Honestly jmf, if you want raw assembly you know where to get
it.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Michael Torrie <torriem@gmail.com> |
|---|---|
| Date | 2013-07-30 21:54 -0600 |
| Message-ID | <mailman.5349.1375242848.3114.python-list@python.org> |
| In reply to | #51598 |
On 07/30/2013 01:09 PM, wxjmfauth@gmail.com wrote: > Matable, immutable, copyint + xxx, bufferint, O(n) .... > Yes, but conceptualy the reencoding happen sometime, somewhere. > The internal "ucs-2" will never automagically be transformed > into "ucs-4" (eg). So what major python project are you working on where you've found FSR in general to be a problem? Maybe we can help you work out a more appropriate data structure and algorithm to use. But if you're not developing something, and not developing in Python, perhaps you should withdraw and let us use our horrible FSR in peace, because it doesn't seem to bother the vast majority of python programmers, and does not bother some large python projects out there. In fact I think most of us welcome integrated, correct, full unicode.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-07-31 05:45 +0000 |
| Message-ID | <51f8a46e$0$30000$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #51598 |
On Tue, 30 Jul 2013 12:09:11 -0700, wxjmfauth wrote:
> And do not forget, in a pure utf coding scheme, your char or a char will
> *never* be larger than 4 bytes.
>
>>>> sys.getsizeof('a')
> 26
>>>> sys.getsizeof('\U000101000')
> 48
Neither character above is larger than 4 bytes. You forgot to deduct the
size of the object header. Python is a high-level object-oriented
language, if you care about minimizing every possible byte, you should
use a low-level language like C. Then you can give every character 21
bits, and be happy that you don't waste even one bit.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-07-31 08:17 +0100 |
| Message-ID | <mailman.5355.1375255056.3114.python-list@python.org> |
| In reply to | #51623 |
On Wed, Jul 31, 2013 at 6:45 AM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > if you care about minimizing every possible byte, you should > use a low-level language like C. Then you can give every character 21 > bits, and be happy that you don't waste even one bit. Could go better! Since not every character has been assigned, and some are specifically banned (eg U+FFFE and U+D800-U+DFFF), you could cut them out of your representation system and save memory! ChrisA
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-07-31 13:15 -0700 |
| Message-ID | <7a4be3ec-4665-4262-9cc6-286362fe2932@googlegroups.com> |
| In reply to | #51623 |
Le mercredi 31 juillet 2013 07:45:18 UTC+2, Steven D'Aprano a écrit :
> On Tue, 30 Jul 2013 12:09:11 -0700, wxjmfauth wrote:
>
>
>
> > And do not forget, in a pure utf coding scheme, your char or a char will
>
> > *never* be larger than 4 bytes.
>
> >
>
> >>>> sys.getsizeof('a')
>
> > 26
>
> >>>> sys.getsizeof('\U000101000')
>
> > 48
>
>
>
> Neither character above is larger than 4 bytes. You forgot to deduct the
>
> size of the object header. Python is a high-level object-oriented
>
> language, if you care about minimizing every possible byte, you should
>
> use a low-level language like C. Then you can give every character 21
>
> bits, and be happy that you don't waste even one bit.
>
>
>
>
>
> --
>
> Steven
... char never consumes or requires more than 4 bytes ...
jmf
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-07-31 21:41 +0100 |
| Message-ID | <mailman.58.1375303310.1251.python-list@python.org> |
| In reply to | #51700 |
On Wed, Jul 31, 2013 at 9:15 PM, <wxjmfauth@gmail.com> wrote: > ... char never consumes or requires more than 4 bytes ... > The integer 5 should be able to be stored in 3 bits. >>> sys.getsizeof(5) 14 Clearly Python is doing something really horribly wrong here. In fact, sys.getsizeof needs to be changed to return a float, to allow it to more properly reflect these important facts. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Antoon Pardon <antoon.pardon@rece.vub.ac.be> |
|---|---|
| Date | 2013-07-31 10:11 +0200 |
| Message-ID | <mailman.0.1375258272.1251.python-list@python.org> |
| In reply to | #51598 |
Op 30-07-13 21:09, wxjmfauth@gmail.com schreef:
> Matable, immutable, copyint + xxx, bufferint, O(n) ....
> Yes, but conceptualy the reencoding happen sometime, somewhere.
Which is a far cry from your previous claim that it happened
every time you enter a char.
This of course make your case harder to argue. Because the
impact of something that happens sometime, somewhere is
vastly less than something that happens everytime you enter
a char.
> The internal "ucs-2" will never automagically be transformed
> into "ucs-4" (eg).
It will just start producing wrong results when someone starts
using characters that don't fit into ucs-2.
>>>> timeit.timeit("'a'*10000 +'€'")
> 7.087220684719967
>>>> timeit.timeit("'a'*10000 +'z'")
> 1.5685214234430873
>>>> timeit.timeit("z = 'a'*10000; z = z +'€'")
> 7.169538866162213
>>>> timeit.timeit("z = 'a'*10000; z = z +'z'")
> 1.5815893830557286
>>>> timeit.timeit("z = 'a'*10000; z += 'z'")
> 1.606955741596181
>>>> timeit.timeit("z = 'a'*10000; z += '€'")
> 7.160483334521416
>
>
> And do not forget, in a pure utf coding scheme, your
> char or a char will *never* be larger than 4 bytes.
>
>>>> sys.getsizeof('a')
> 26
>>>> sys.getsizeof('\U000101000')
> 48
Nonsense.
>>> sys.getsizeof('a'.encode('utf-8'))
18
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-07-31 01:32 -0700 |
| Message-ID | <797da2f0-5f62-43b9-ab4d-c5eb8d6c64a2@googlegroups.com> |
| In reply to | #51630 |
FSR:
===
The 'a' in 'a€' and 'a\U0001d11e:
>>> ['{:#010b}'.format(c) for c in 'a€'.encode('utf-16-be')]
['0b00000000', '0b01100001', '0b00100000', '0b10101100']
>>> ['{:#010b}'.format(c) for c in 'a\U0001d11e'.encode('utf-32-be')]
['0b00000000', '0b00000000', '0b00000000', '0b01100001',
'0b00000000', '0b00000001', '0b11010001', '0b00011110']
Has to be done.
sys.getsizeof('a€')
42
sys.getsizeof('a\U0001d11e')
48
sys.getsizeof('aa')
27
Unicode/utf*
============
i) ("primary key") Create and use a unique set of encoded
code points.
ii) ("secondary key") Depending of the wish,
memory/performance: utf-8/16/32
Two advantages at the light of the above example:
iii) The "a" has never to be reencoded.
iv) An "a" size never exceeds 4 bytes.
Hard job to solve/satisfy i), ii), iii) and iv) at the same time.
Is is possible? ;-) The solution is in the problem.
jmf
[toc] | [prev] | [next] | [standalone]
| From | Antoon Pardon <antoon.pardon@rece.vub.ac.be> |
|---|---|
| Date | 2013-07-31 10:59 +0200 |
| Message-ID | <mailman.2.1375261163.1251.python-list@python.org> |
| In reply to | #51632 |
Op 31-07-13 10:32, wxjmfauth@gmail.com schreef:
> Unicode/utf*
> ============
>
> i) ("primary key") Create and use a unique set of encoded
> code points.
FSR does this.
>>> st1 = 'a€'
>>> st2 = 'aa'
>>> ord(st1[0])
97
>>> ord(st2[0])
97
>>>
> ii) ("secondary key") Depending of the wish,
> memory/performance: utf-8/16/32
Whose wish? I don't know any language that allows the
programmer choose the internal representation of its
strings. If it is the designers choice FSR does this,
if it is the programmers choice, I don't see why
this is necessary for compliance.
> Two advantages at the light of the above example:
> iii) The "a" has never to be reencoded.
FSR: check. Using a container with wider slots is not a reëncoding.
If such widening is encoding then your 'choice' between utf-8/16/32
implies that it will also have to reencode when it changes from
utf-8 to utf-16 or utf-32.
> iv) An "a" size never exceeds 4 bytes.
FSR: check.
> Hard job to solve/satisfy i), ii), iii) and iv) at the same time.
> Is is possible? ;-) The solution is in the problem.
Mayby you should use bytes or bytearrays if that is really what you want.
--
Antoon Pardon
[toc] | [prev] | [next] | [standalone]
Page 5 of 7 — ← Prev page 1 2 3 4 [5] 6 7 Next page →
Back to top | Article view | comp.lang.python
csiph-web