Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #50503 > unrolled thread

RE Module Performance

Started byDevyn Collier Johnson <devyncjohnson@gmail.com>
First post2013-07-11 19:44 -0400
Last post2013-07-18 13:17 -0700
Articles 20 on this page of 136 — 25 participants

Back to article view | Back to comp.lang.python


Contents

  RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-11 19:44 -0400
    Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-12 02:23 -0700
      Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-12 19:27 +1000
      Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-12 10:39 +0100
      Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-12 19:40 +1000
      Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-12 06:45 -0400
      Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-12 16:59 +0100
      Re: RE Module Performance Peter Otten <__peter__@web.de> - 2013-07-12 18:15 +0200
      Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-13 02:21 +1000
      Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-12 13:58 -0400
        Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-13 05:37 +0000
          Re: RE Module Performance 88888 Dihedral <dihedral88888@gmail.com> - 2013-07-14 11:17 -0700
            Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-15 06:06 -0400
              Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-15 12:36 +0000
                Dihedral Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-15 08:52 -0400
                Re: Dihedral Joel Goldstick <joel.goldstick@gmail.com> - 2013-07-15 09:03 -0400
                Re: Dihedral Wayne Werner <wayne@waynewerner.com> - 2013-07-15 17:43 -0500
                Re: Dihedral Fábio Santos <fabiosantosart@gmail.com> - 2013-07-15 23:54 +0100
                Re: Dihedral Chris Angelico <rosuav@gmail.com> - 2013-07-16 08:59 +1000
                Re: Dihedral Tim Delaney <timothy.c.delaney@gmail.com> - 2013-07-16 16:06 +1000
                Re: Dihedral Stefan Behnel <stefan_ml@behnel.de> - 2013-07-24 20:08 +0200
                Re: Dihedral Chris Angelico <rosuav@gmail.com> - 2013-07-25 04:23 +1000
                Re: Dihedral Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-07-24 20:15 -0400
      Re: RE Module Performance Tim Delaney <timothy.c.delaney@gmail.com> - 2013-07-13 08:16 +1000
      Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-12 17:13 -0600
        Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-24 06:40 -0700
          Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-24 23:48 +1000
          Re: RE Module Performance David Hutto <dwightdhutto@gmail.com> - 2013-07-24 10:17 -0400
          Re: RE Module Performance David Hutto <dwightdhutto@gmail.com> - 2013-07-24 10:19 -0400
          Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 00:34 +1000
            Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 07:02 +0000
              Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 17:39 +1000
          Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-24 08:47 -0600
            Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-25 02:27 -0700
              Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 20:14 +1000
                Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-25 12:07 -0700
                  Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-26 05:18 +1000
                  RE: RE Module Performance "Prasad, Ramit" <ramit.prasad@jpmorgan.com> - 2013-07-25 19:30 +0000
                  Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-25 21:06 -0600
          Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-24 09:00 -0600
            Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 05:56 +0000
          Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 00:56 +1000
          Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-24 13:52 -0400
          Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 04:15 +1000
            Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 07:15 +0000
              Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 17:58 +1000
                Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 09:22 +0000
                  Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 20:07 +1000
          Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-24 18:09 -0400
          Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 08:19 +1000
          Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-24 16:59 -0600
          Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 09:24 +1000
          Re: RE Module Performance Serhiy Storchaka <storchaka@gmail.com> - 2013-07-25 08:49 +0300
          Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-25 15:58 +1000
          Re: RE Module Performance Jeremy Sanders <jeremy@jeremysanders.net> - 2013-07-25 14:36 +0100
            Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 15:26 +0000
              Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-26 01:36 +1000
                Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-25 17:18 +0000
                  Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-26 03:27 +1000
                  Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-25 15:45 -0500
                    Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-26 02:48 +0000
                      Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-25 21:20 -0600
                        Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-26 06:36 -0700
                        Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-26 08:46 -0700
                          Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-27 06:28 +0000
                        Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-27 03:37 +0000
                          Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-26 22:12 -0600
                            Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-27 05:04 +0000
                          Re: RE Module Performance Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-07-27 12:13 -0400
                    Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-26 06:19 -0700
                  Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-25 21:09 -0600
                    Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-26 06:21 -0700
                      Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-26 20:05 -0600
                        Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-27 11:21 -0700
                          Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-27 21:53 -0600
                            Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-28 11:13 -0700
                              Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-28 20:04 +0100
                                Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-28 12:30 -0700
                                  Re: RE Module Performance Lele Gaifax <lele@metapensiero.it> - 2013-07-28 22:45 +0200
                                  Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-28 22:01 +0200
                            Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-30 07:01 -0700
                              Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-30 16:38 +0200
                              Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-30 15:45 +0100
                              Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-30 17:13 +0100
                              Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-30 18:39 +0200
                              Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-30 18:14 +0100
                                Re: RE Module Performance Neil Hodgson <nhodgson@iinet.net.au> - 2013-07-31 13:09 +1000
                              Re: RE Module Performance Tim Delaney <timothy.c.delaney@gmail.com> - 2013-07-31 03:27 +1000
                              Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-30 18:40 +0100
                              Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-30 20:19 +0200
                                Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-30 12:09 -0700
                                  Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-30 21:04 +0100
                                  Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-30 21:54 -0600
                                  Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-31 05:45 +0000
                                    Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-31 08:17 +0100
                                    Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-31 13:15 -0700
                                      Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-31 21:41 +0100
                                  Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-31 10:11 +0200
                                    Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-31 01:32 -0700
                                      Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-31 10:59 +0200
                                      Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-31 08:44 -0600
                              Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-30 17:05 -0400
                              Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-30 21:30 -0600
                              Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-31 09:23 +0200
                              Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-31 08:27 -0600
                          Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-28 10:45 +0200
                          FSR and unicode compliance - was Re: RE Module Performance Michael Torrie <torriem@gmail.com> - 2013-07-28 09:52 -0600
                            Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-28 12:23 -0700
                              Re: FSR and unicode compliance - was Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-28 20:44 +0100
                              Re: FSR and unicode compliance - was Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-28 21:55 +0200
                              Re: FSR and unicode compliance - was Re: RE Module Performance Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-07-28 20:52 +0000
                                Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 04:43 -0700
                                  Re: FSR and unicode compliance - was Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-29 12:57 +0100
                                    Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 05:56 -0700
                                    Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 07:20 -0700
                                      Re: FSR and unicode compliance - was Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-29 15:49 +0100
                                        Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 09:31 -0700
                                  Re: FSR and unicode compliance - was Re: RE Module Performance Heiko Wundram <modelnine@modelnine.org> - 2013-07-29 14:06 +0200
                                  Re: FSR and unicode compliance - was Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-29 08:43 -0400
                          Re: FSR and unicode compliance - was Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-28 18:03 +0100
                          Re: FSR and unicode compliance - was Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-28 13:36 -0400
                            Re: FSR and unicode compliance - was Re: RE Module Performance wxjmfauth@gmail.com - 2013-07-29 06:36 -0700
                          Re: FSR and unicode compliance - was Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-28 19:03 +0100
                          Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-28 19:19 +0100
                          Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-28 19:29 +0100
                          Re: RE Module Performance Terry Reedy <tjreedy@udel.edu> - 2013-07-28 15:06 -0400
                          Re: RE Module Performance Joshua Landau <joshua@landau.ws> - 2013-07-28 23:14 +0100
                          Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-28 20:51 +0200
                          Re: RE Module Performance Chris Angelico <rosuav@gmail.com> - 2013-07-29 00:07 +0100
                      Re: RE Module Performance Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-07-26 22:38 +0200
          Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-25 09:44 -0400
          Re: RE Module Performance Ian Kelly <ian.g.kelly@gmail.com> - 2013-07-25 15:53 -0500
      Re: RE Module Performance MRAB <python@mrabarnett.plus.com> - 2013-07-13 00:16 +0100
      Re: RE Module Performance Tim Delaney <timothy.c.delaney@gmail.com> - 2013-07-14 05:34 +1000
      Re: RE Module Performance Devyn Collier Johnson <devyncjohnson@gmail.com> - 2013-07-16 06:30 -0400
        Re: RE Module Performance 88888 Dihedral <dihedral88888@gmail.com> - 2013-07-18 13:17 -0700

Page 5 of 7 — ← Prev page 1 2 3 4 [5] 6 7  Next page →


#51558

Fromwxjmfauth@gmail.com
Date2013-07-30 07:01 -0700
Message-ID<43ce1b65-9d6d-47dd-b209-9a3bbafc0b8c@googlegroups.com>
In reply to#51380
Le dimanche 28 juillet 2013 05:53:22 UTC+2, Ian a écrit :
> On Sat, Jul 27, 2013 at 12:21 PM,  <wxjmfauth@gmail.com> wrote:
> 
> > Back to utf. utfs are not only elements of a unique set of encoded
> 
> > code points. They have an interesting feature. Each "utf chunk"
> 
> > holds intrisically the character (in fact the code point) it is
> 
> > supposed to represent. In utf-32, the obvious case, it is just
> 
> > the code point. In utf-8, that's the first chunk which helps and
> 
> > utf-16 is a mixed case (utf-8 / utf-32). In other words, in an
> 
> > implementation using bytes, for any pointer position it is always
> 
> > possible to find the corresponding encoded code point and from this
> 
> > the corresponding character without any "programmed" information. See
> 
> > my editor example, how to find the char under the caret? In fact,
> 
> > a silly example, how can the caret can be positioned or moved, if
> 
> > the underlying corresponding encoded code point can not be
> 
> > dicerned!
> 
> 
> 
> Yes, given a pointer location into a utf-8 or utf-16 string, it is
> 
> easy to determine the identity of the code point at that location.
> 
> But this is not often a useful operation, save for resynchronization
> 
> in the case that the string data is corrupted.  The caret of an editor
> 
> does not conceptually correspond to a pointer location, but to a
> 
> character index.  Given a particular character index (e.g. 127504), an
> 
> editor must be able to determine the identity and/or the memory
> 
> location of the character at that index, and for UTF-8 and UTF-16
> 
> without an auxiliary data structure that is a O(n) operation.
> 
> 
------

Same conceptual mistake as Steven's example with its buffers,
the buffer does not know it holds characters.
This is not the point to discuss.

-----

I am pretty sure that once you have typed your 127504
ascii characters, you are very happy the buffer of your
editor does not waste time in reencoding the buffer as
soon as you enter an €, the 125505th char. Sorry, I wanted
to say z instead of euro, just to show that backspacing the
last char and reentering a new char implies twice a reencoding.

Somebody wrote "FSR" is just an optimization. Yes, but in case
of an editor à la FSR, this optimization take place everytime you
enter a char. Your poor editor, in fact the FSR, is finally
spending its time in optimizing and finally it optimizes nothing.
(It is even worse).

If you type correctly a z instead of an €, it is not necessary
to reencode the buffer. Problem, you do you know that you do
not have to reencode? simple just check it, and by just checking
it wastes time to test it you have to optimized or not and hurt
a little bit more what is supposed to be an optimization.

Do not confuse the process of optimisation and the result of
optimization (funny, it's like the utf's).

There is a trick to make the editor to know if it has
to be "optimized". Just put some flag somewhere. Then
you fall on the "Houston" syndrome. Houston, we got a
problem, our buffer consumes much more bytes than expected.

>>> sys.getsizeof('€')
40
>>> sys.getsizeof('a')
26

Now the good news. In an editor à la FSR, the
"composition" is not so important. You know,
"practicality beats purity". The hard job
is the text rendering engine and the handling
of the font (even in a raw unicode editor).
And as these tools are luckily not woking à la FSR
(probably because they understand the coding
of the characters), your editor is still working
not so badly.

jmf

[toc] | [prev] | [next] | [standalone]


#51562

FromAntoon Pardon <antoon.pardon@rece.vub.ac.be>
Date2013-07-30 16:38 +0200
Message-ID<mailman.5311.1375195157.3114.python-list@python.org>
In reply to#51558
Op 30-07-13 16:01, wxjmfauth@gmail.com schreef:
> 
> I am pretty sure that once you have typed your 127504
> ascii characters, you are very happy the buffer of your
> editor does not waste time in reencoding the buffer as
> soon as you enter an €, the 125505th char. Sorry, I wanted
> to say z instead of euro, just to show that backspacing the
> last char and reentering a new char implies twice a reencoding.

Using a single string as an editor buffer is a bad idea in python
for the simple reason that strings are immutable. So adding
characters would mean continuously copying the string buffer
into a new string with the next character added. Copying
127504 characters into a new string will not make that much
of a difference whether the octets are just copied to octets
or are unpacked into 32 bit words.

> Somebody wrote "FSR" is just an optimization. Yes, but in case
> of an editor à la FSR, this optimization take place everytime you
> enter a char. Your poor editor, in fact the FSR, is finally
> spending its time in optimizing and finally it optimizes nothing.
> (It is even worse).

Even if you would do it this way, it would *not* take place
every time you enter a char. Once your buffer would contain
a wide character, it would just need to convert the single
character that is added after each keystroke. It would not
need to convert the whole buffer after each key stroke.

> If you type correctly a z instead of an €, it is not necessary
> to reencode the buffer. Problem, you do you know that you do
> not have to reencode? simple just check it, and by just checking
> it wastes time to test it you have to optimized or not and hurt
> a little bit more what is supposed to be an optimization.

Your scenario is totally unrealistic. First of all because of
the immutable nature of python strings, second because you
suggest that real time usage would result in frequent conversions
which is highly unlikely.

-- 
Antoon Pardon

[toc] | [prev] | [next] | [standalone]


#51563

FromChris Angelico <rosuav@gmail.com>
Date2013-07-30 15:45 +0100
Message-ID<mailman.5312.1375195560.3114.python-list@python.org>
In reply to#51558
On Tue, Jul 30, 2013 at 3:01 PM,  <wxjmfauth@gmail.com> wrote:
> I am pretty sure that once you have typed your 127504
> ascii characters, you are very happy the buffer of your
> editor does not waste time in reencoding the buffer as
> soon as you enter an €, the 125505th char. Sorry, I wanted
> to say z instead of euro, just to show that backspacing the
> last char and reentering a new char implies twice a reencoding.

You're still thinking that the editor's buffer is a Python string. As
I've shown earlier, this is a really bad idea, and that has nothing to
do with FSR/PEP 393. An immutable string is *horribly* inefficient at
this; if you want to keep concatenating onto a string, the recommended
method is a list of strings that gets join()d at the end, and the same
technique works well here. Here's a little demo class that could make
the basis for such a system:

class EditorBuffer:
	def __init__(self,fn):
		self.fn=fn
		self.buffer=[open(fn).read()]
	def insert(self,pos,char):
		if pos==0:
			# Special case: insertion at beginning of buffer
			if len(self.buffer[0])>1024: self.buffer.insert(0,char)
			else: self.buffer[0]=char+self.buffer[0]
			return
		for idx,part in enumerate(self.buffer):
			l=len(part)
			if pos>l:
				pos-=l
				continue
			if pos<l:
				# Cursor is somewhere inside this string
				splitme=self.buffer[idx]
				self.buffer[idx:idx+1]=splitme[:pos],splitme[pos:]
				l=pos
			# Cursor is now at the end of this string
			if l>1024: self.buffer[idx:idx+1]=self.buffer[idx],char
			else: self.buffer[idx]+=char
			return
		raise ValueError("Cannot insert past end of buffer")
	def __str__(self):
		return ''.join(self.buffer)
	def save(self):
		open(fn,"w").write(str(self))

It guarantees that inserts will never need to resize more than 1KB of
text. As a real basis for an editor, it still sucks, but it's purely
to prove this one point.

ChrisA

[toc] | [prev] | [next] | [standalone]


#51578

FromMRAB <python@mrabarnett.plus.com>
Date2013-07-30 17:13 +0100
Message-ID<mailman.5321.1375200818.3114.python-list@python.org>
In reply to#51558
On 30/07/2013 15:38, Antoon Pardon wrote:
> Op 30-07-13 16:01, wxjmfauth@gmail.com schreef:
>>
>> I am pretty sure that once you have typed your 127504 ascii
>> characters, you are very happy the buffer of your editor does not
>> waste time in reencoding the buffer as soon as you enter an €, the
>> 125505th char. Sorry, I wanted to say z instead of euro, just to
>> show that backspacing the last char and reentering a new char
>> implies twice a reencoding.
>
> Using a single string as an editor buffer is a bad idea in python for
> the simple reason that strings are immutable.

Using a single string as an editor buffer is a bad idea in _any_
language because an insertion would require all the following
characters to be moved.

> So adding characters would mean continuously copying the string
> buffer into a new string with the next character added. Copying
> 127504 characters into a new string will not make that much of a
> difference whether the octets are just copied to octets or are
> unpacked into 32 bit words.
>
>> Somebody wrote "FSR" is just an optimization. Yes, but in case of
>> an editor à la FSR, this optimization take place everytime you
>> enter a char. Your poor editor, in fact the FSR, is finally
>> spending its time in optimizing and finally it optimizes nothing.
>> (It is even worse).
>
> Even if you would do it this way, it would *not* take place every
> time you enter a char. Once your buffer would contain a wide
> character, it would just need to convert the single character that is
> added after each keystroke. It would not need to convert the whole
> buffer after each key stroke.
>
>> If you type correctly a z instead of an €, it is not necessary to
>> reencode the buffer. Problem, you do you know that you do not have
>> to reencode? simple just check it, and by just checking it wastes
>> time to test it you have to optimized or not and hurt a little bit
>> more what is supposed to be an optimization.
>
> Your scenario is totally unrealistic. First of all because of the
> immutable nature of python strings, second because you suggest that
> real time usage would result in frequent conversions which is highly
> unlikely.
>
What you would have is a list of mutable chunks.

Inserting into a chunk would be fast, and a chunk would be split if
it's already full. Also, small adjacent chunks would be joined together.

Finally, a chunk could use FSR to reduce memory usage.

[toc] | [prev] | [next] | [standalone]


#51581

FromAntoon Pardon <antoon.pardon@rece.vub.ac.be>
Date2013-07-30 18:39 +0200
Message-ID<mailman.5323.1375202438.3114.python-list@python.org>
In reply to#51558
Op 30-07-13 18:13, MRAB schreef:
> On 30/07/2013 15:38, Antoon Pardon wrote:
>> Op 30-07-13 16:01, wxjmfauth@gmail.com schreef:
>>>
>>> I am pretty sure that once you have typed your 127504 ascii
>>> characters, you are very happy the buffer of your editor does not
>>> waste time in reencoding the buffer as soon as you enter an €, the
>>> 125505th char. Sorry, I wanted to say z instead of euro, just to
>>> show that backspacing the last char and reentering a new char
>>> implies twice a reencoding.
>>
>> Using a single string as an editor buffer is a bad idea in python for
>> the simple reason that strings are immutable.
>
> Using a single string as an editor buffer is a bad idea in _any_
> language because an insertion would require all the following
> characters to be moved.

Not if you use a gap buffer.

-- 
Antoon Pardon.

[toc] | [prev] | [next] | [standalone]


#51586

FromMRAB <python@mrabarnett.plus.com>
Date2013-07-30 18:14 +0100
Message-ID<mailman.5328.1375204498.3114.python-list@python.org>
In reply to#51558
On 30/07/2013 17:39, Antoon Pardon wrote:
> Op 30-07-13 18:13, MRAB schreef:
>> On 30/07/2013 15:38, Antoon Pardon wrote:
>>> Op 30-07-13 16:01, wxjmfauth@gmail.com schreef:
>>>>
>>>> I am pretty sure that once you have typed your 127504 ascii
>>>> characters, you are very happy the buffer of your editor does not
>>>> waste time in reencoding the buffer as soon as you enter an €, the
>>>> 125505th char. Sorry, I wanted to say z instead of euro, just to
>>>> show that backspacing the last char and reentering a new char
>>>> implies twice a reencoding.
>>>
>>> Using a single string as an editor buffer is a bad idea in python for
>>> the simple reason that strings are immutable.
>>
>> Using a single string as an editor buffer is a bad idea in _any_
>> language because an insertion would require all the following
>> characters to be moved.
>
> Not if you use a gap buffer.
>
The disadvantage there is that when you move the cursor you must move
characters around. For example, what if the cursor was at the start and
you wanted to move it to the end? Also, when the gap has been filled,
you need to make a new one.

[toc] | [prev] | [next] | [standalone]


#51615

FromNeil Hodgson <nhodgson@iinet.net.au>
Date2013-07-31 13:09 +1000
Message-ID<OaWdnR_YnbQuHWXMnZ2dnUVZ_vWdnZ2d@westnet.com.au>
In reply to#51586
MRAB:

> The disadvantage there is that when you move the cursor you must move
> characters around. For example, what if the cursor was at the start and
> you wanted to move it to the end? Also, when the gap has been filled,
> you need to make a new one.

    The normal technique is to only move the gap when text is added or 
removed, not when the cursor moves. Code that reads the contents, such 
as for display, handles the gap by checking the requested position and 
using a different offset when the position is after the gap.

    Gap buffers work well because changes are generally close to the 
previous change, so require moving only a relatively small amount of 
text. Even an occasional move of the whole contents won't cause too much 
trouble for interactivity with current processors moving multiple 
megabytes per millisecond.

    Neil

[toc] | [prev] | [next] | [standalone]


#51587

FromTim Delaney <timothy.c.delaney@gmail.com>
Date2013-07-31 03:27 +1000
Message-ID<mailman.5329.1375205232.3114.python-list@python.org>
In reply to#51558

[Multipart message — attachments visible in raw view] — view raw

On 31 July 2013 00:01, <wxjmfauth@gmail.com> wrote:

>
> I am pretty sure that once you have typed your 127504
> ascii characters, you are very happy the buffer of your
> editor does not waste time in reencoding the buffer as
> soon as you enter an €, the 125505th char. Sorry, I wanted
> to say z instead of euro, just to show that backspacing the
> last char and reentering a new char implies twice a reencoding.
>

And here we come to the root of your complete misunderstanding and
mischaracterisation of the FSR. You don't appear to understand that
strings in Python are immutable and that to add a character to an
existing string requires copying the entire string + new character. In
your hypothetical situation above, you have already performed 127504
copy + new character operations before you ever get to a single widening
operation. The overhead of the copy + new character repeated 127504
times dwarfs the overhead of a single widening operation.

Given your misunderstanding, it's no surprise that you are focused on
microbenchmarks that demonstrate that copying entire strings and adding
a character can be slower in some situations than others. When the only
use case you have is implementing the buffer of an editor using an
immutable string I can fully understand why you would be concerned about
the performance of adding and removing individual characters. However,
in that case *you're focused on the wrong problem*.

Until you can demonstrate an understanding that doing the above in any
language which has immutable strings is completely insane you will have
no credibility and the only interest anyone will pay to your posts is
refuting your FUD so that people new to the language are not driven off
by you.

Tim Delaney

[toc] | [prev] | [next] | [standalone]


#51588

FromJoshua Landau <joshua@landau.ws>
Date2013-07-30 18:40 +0100
Message-ID<mailman.5330.1375206059.3114.python-list@python.org>
In reply to#51558

[Multipart message — attachments visible in raw view] — view raw

On 30 July 2013 17:39, Antoon Pardon <antoon.pardon@rece.vub.ac.be> wrote:

> Op 30-07-13 18:13, MRAB schreef:
>
>  On 30/07/2013 15:38, Antoon Pardon wrote:
>>
>>> Op 30-07-13 16:01, wxjmfauth@gmail.com schreef:
>>>
>>>>
>>>> I am pretty sure that once you have typed your 127504 ascii
>>>> characters, you are very happy the buffer of your editor does not
>>>> waste time in reencoding the buffer as soon as you enter an €, the
>>>> 125505th char. Sorry, I wanted to say z instead of euro, just to
>>>> show that backspacing the last char and reentering a new char
>>>> implies twice a reencoding.
>>>>
>>>
>>> Using a single string as an editor buffer is a bad idea in python for
>>> the simple reason that strings are immutable.
>>>
>>
>> Using a single string as an editor buffer is a bad idea in _any_
>> language because an insertion would require all the following
>> characters to be moved.
>>
>
> Not if you use a gap buffer.


Additionally, who says a language couldn't use, say, B-Trees for all of its
list-like types, including strings?

[toc] | [prev] | [next] | [standalone]


#51594

FromAntoon Pardon <antoon.pardon@rece.vub.ac.be>
Date2013-07-30 20:19 +0200
Message-ID<mailman.5334.1375208411.3114.python-list@python.org>
In reply to#51558
Op 30-07-13 19:14, MRAB schreef:
> On 30/07/2013 17:39, Antoon Pardon wrote:
>> Op 30-07-13 18:13, MRAB schreef:
>>> On 30/07/2013 15:38, Antoon Pardon wrote:
>>>> Op 30-07-13 16:01, wxjmfauth@gmail.com schreef:
>>>>>
>>>>> I am pretty sure that once you have typed your 127504 ascii
>>>>> characters, you are very happy the buffer of your editor does not
>>>>> waste time in reencoding the buffer as soon as you enter an €, the
>>>>> 125505th char. Sorry, I wanted to say z instead of euro, just to
>>>>> show that backspacing the last char and reentering a new char
>>>>> implies twice a reencoding.
>>>>
>>>> Using a single string as an editor buffer is a bad idea in python for
>>>> the simple reason that strings are immutable.
>>>
>>> Using a single string as an editor buffer is a bad idea in _any_
>>> language because an insertion would require all the following
>>> characters to be moved.
>>
>> Not if you use a gap buffer.
>>
> The disadvantage there is that when you move the cursor you must move
> characters around. For example, what if the cursor was at the start and
> you wanted to move it to the end? Also, when the gap has been filled,
> you need to make a new one.

So? Why are you making this a point of discussion? I was not aware that
the pro and cons of various editor buffer implemantations was relevant
to the point I was trying to make.

If you prefer an other data structure in the editor you are working on,
I will not dissuade you.

-- 
Antoon Pardon

[toc] | [prev] | [next] | [standalone]


#51598

Fromwxjmfauth@gmail.com
Date2013-07-30 12:09 -0700
Message-ID<39155ddf-437c-459e-ad7c-dd841810a592@googlegroups.com>
In reply to#51594
Matable, immutable, copyint + xxx, bufferint, O(n) ....
Yes, but conceptualy the reencoding happen sometime, somewhere.
The internal "ucs-2" will never automagically be transformed
into "ucs-4" (eg).

>>> timeit.timeit("'a'*10000 +'€'")
7.087220684719967
>>> timeit.timeit("'a'*10000 +'z'")
1.5685214234430873
>>> timeit.timeit("z = 'a'*10000; z = z +'€'")
7.169538866162213
>>> timeit.timeit("z = 'a'*10000; z = z +'z'")
1.5815893830557286
>>> timeit.timeit("z = 'a'*10000; z += 'z'")
1.606955741596181
>>> timeit.timeit("z = 'a'*10000; z += '€'")
7.160483334521416


And do not forget, in a pure utf coding scheme, your
char or a char will *never* be larger than 4 bytes.

>>> sys.getsizeof('a')
26
>>> sys.getsizeof('\U000101000')
48


jmf

[toc] | [prev] | [next] | [standalone]


#51602

FromChris Angelico <rosuav@gmail.com>
Date2013-07-30 21:04 +0100
Message-ID<mailman.5339.1375214662.3114.python-list@python.org>
In reply to#51598
On Tue, Jul 30, 2013 at 8:09 PM,  <wxjmfauth@gmail.com> wrote:
> Matable, immutable, copyint + xxx, bufferint, O(n) ....
> Yes, but conceptualy the reencoding happen sometime, somewhere.
> The internal "ucs-2" will never automagically be transformed
> into "ucs-4" (eg).

But probably not on the entire document. With even a brainless scheme
like I posted code for, no more than 1024 bytes will need to be
recoded at a time (except in some odd edge cases, and even then, no
more than once for any given file).

> And do not forget, in a pure utf coding scheme, your
> char or a char will *never* be larger than 4 bytes.
>
>>>> sys.getsizeof('a')
> 26
>>>> sys.getsizeof('\U000101000')
> 48

Yeah, you have a few odd issues like, oh, I dunno, GC overhead,
reference count, object class, and string length, all stored somewhere
there. Honestly jmf, if you want raw assembly you know where to get
it.

ChrisA

[toc] | [prev] | [next] | [standalone]


#51618

FromMichael Torrie <torriem@gmail.com>
Date2013-07-30 21:54 -0600
Message-ID<mailman.5349.1375242848.3114.python-list@python.org>
In reply to#51598
On 07/30/2013 01:09 PM, wxjmfauth@gmail.com wrote:
> Matable, immutable, copyint + xxx, bufferint, O(n) ....
> Yes, but conceptualy the reencoding happen sometime, somewhere.
> The internal "ucs-2" will never automagically be transformed
> into "ucs-4" (eg).

So what major python project are you working on where you've found FSR
in general to be a problem?  Maybe we can help you work out a more
appropriate data structure and algorithm to use.

But if you're not developing something, and not developing in Python,
perhaps you should withdraw and let us use our horrible FSR in peace,
because it doesn't seem to bother the vast majority of python
programmers, and does not bother some large python projects out there.
In fact I think most of us welcome integrated, correct, full unicode.

[toc] | [prev] | [next] | [standalone]


#51623

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-07-31 05:45 +0000
Message-ID<51f8a46e$0$30000$c3e8da3$5496439d@news.astraweb.com>
In reply to#51598
On Tue, 30 Jul 2013 12:09:11 -0700, wxjmfauth wrote:

> And do not forget, in a pure utf coding scheme, your char or a char will
> *never* be larger than 4 bytes.
> 
>>>> sys.getsizeof('a')
> 26
>>>> sys.getsizeof('\U000101000')
> 48

Neither character above is larger than 4 bytes. You forgot to deduct the 
size of the object header. Python is a high-level object-oriented 
language, if you care about minimizing every possible byte, you should 
use a low-level language like C. Then you can give every character 21 
bits, and be happy that you don't waste even one bit.


-- 
Steven

[toc] | [prev] | [next] | [standalone]


#51627

FromChris Angelico <rosuav@gmail.com>
Date2013-07-31 08:17 +0100
Message-ID<mailman.5355.1375255056.3114.python-list@python.org>
In reply to#51623
On Wed, Jul 31, 2013 at 6:45 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> if you care about minimizing every possible byte, you should
> use a low-level language like C. Then you can give every character 21
> bits, and be happy that you don't waste even one bit.

Could go better! Since not every character has been assigned, and some
are specifically banned (eg U+FFFE and U+D800-U+DFFF), you could cut
them out of your representation system and save memory!

ChrisA

[toc] | [prev] | [next] | [standalone]


#51700

Fromwxjmfauth@gmail.com
Date2013-07-31 13:15 -0700
Message-ID<7a4be3ec-4665-4262-9cc6-286362fe2932@googlegroups.com>
In reply to#51623
Le mercredi 31 juillet 2013 07:45:18 UTC+2, Steven D'Aprano a écrit :
> On Tue, 30 Jul 2013 12:09:11 -0700, wxjmfauth wrote:
> 
> 
> 
> > And do not forget, in a pure utf coding scheme, your char or a char will
> 
> > *never* be larger than 4 bytes.
> 
> > 
> 
> >>>> sys.getsizeof('a')
> 
> > 26
> 
> >>>> sys.getsizeof('\U000101000')
> 
> > 48
> 
> 
> 
> Neither character above is larger than 4 bytes. You forgot to deduct the 
> 
> size of the object header. Python is a high-level object-oriented 
> 
> language, if you care about minimizing every possible byte, you should 
> 
> use a low-level language like C. Then you can give every character 21 
> 
> bits, and be happy that you don't waste even one bit.
> 
> 
> 
> 
> 
> -- 
> 
> Steven

... char never consumes or requires more than 4 bytes ...

jmf

[toc] | [prev] | [next] | [standalone]


#51706

FromChris Angelico <rosuav@gmail.com>
Date2013-07-31 21:41 +0100
Message-ID<mailman.58.1375303310.1251.python-list@python.org>
In reply to#51700
On Wed, Jul 31, 2013 at 9:15 PM,  <wxjmfauth@gmail.com> wrote:
> ... char never consumes or requires more than 4 bytes ...
>

The integer 5 should be able to be stored in 3 bits.

>>> sys.getsizeof(5)
14

Clearly Python is doing something really horribly wrong here. In fact,
sys.getsizeof needs to be changed to return a float, to allow it to
more properly reflect these important facts.

ChrisA

[toc] | [prev] | [next] | [standalone]


#51630

FromAntoon Pardon <antoon.pardon@rece.vub.ac.be>
Date2013-07-31 10:11 +0200
Message-ID<mailman.0.1375258272.1251.python-list@python.org>
In reply to#51598
Op 30-07-13 21:09, wxjmfauth@gmail.com schreef:
> Matable, immutable, copyint + xxx, bufferint, O(n) ....
> Yes, but conceptualy the reencoding happen sometime, somewhere.

Which is a far cry from your previous claim that it happened
every time you enter a char.

This of course make your case harder to argue. Because the
impact of something that happens sometime, somewhere is
vastly less than something that happens everytime you enter
a char.

> The internal "ucs-2" will never automagically be transformed
> into "ucs-4" (eg).

It will just start producing wrong results when someone starts
using characters that don't fit into ucs-2.


>>>> timeit.timeit("'a'*10000 +'€'")
> 7.087220684719967
>>>> timeit.timeit("'a'*10000 +'z'")
> 1.5685214234430873
>>>> timeit.timeit("z = 'a'*10000; z = z +'€'")
> 7.169538866162213
>>>> timeit.timeit("z = 'a'*10000; z = z +'z'")
> 1.5815893830557286
>>>> timeit.timeit("z = 'a'*10000; z += 'z'")
> 1.606955741596181
>>>> timeit.timeit("z = 'a'*10000; z += '€'")
> 7.160483334521416
> 
> 
> And do not forget, in a pure utf coding scheme, your
> char or a char will *never* be larger than 4 bytes.
> 
>>>> sys.getsizeof('a')
> 26
>>>> sys.getsizeof('\U000101000')
> 48

Nonsense.

>>> sys.getsizeof('a'.encode('utf-8'))
18



[toc] | [prev] | [next] | [standalone]


#51632

Fromwxjmfauth@gmail.com
Date2013-07-31 01:32 -0700
Message-ID<797da2f0-5f62-43b9-ab4d-c5eb8d6c64a2@googlegroups.com>
In reply to#51630
FSR:
===

The 'a' in 'a€' and 'a\U0001d11e:

>>> ['{:#010b}'.format(c) for c in 'a€'.encode('utf-16-be')]
['0b00000000', '0b01100001', '0b00100000', '0b10101100']
>>> ['{:#010b}'.format(c) for c in 'a\U0001d11e'.encode('utf-32-be')]
['0b00000000', '0b00000000', '0b00000000', '0b01100001',
'0b00000000', '0b00000001', '0b11010001', '0b00011110']

Has to be done.

sys.getsizeof('a€')
42
sys.getsizeof('a\U0001d11e')
48
sys.getsizeof('aa')
27


Unicode/utf*
============

i) ("primary key") Create and use a unique set of encoded
code points.
ii) ("secondary key") Depending of the wish,
memory/performance: utf-8/16/32

Two advantages at the light of the above example:
iii) The "a" has never to be reencoded.
iv) An "a" size never exceeds 4 bytes.

Hard job to solve/satisfy i), ii), iii) and iv) at the same time.
Is is possible? ;-) The solution is in the problem.

jmf

[toc] | [prev] | [next] | [standalone]


#51633

FromAntoon Pardon <antoon.pardon@rece.vub.ac.be>
Date2013-07-31 10:59 +0200
Message-ID<mailman.2.1375261163.1251.python-list@python.org>
In reply to#51632
Op 31-07-13 10:32, wxjmfauth@gmail.com schreef:
> Unicode/utf*
> ============
>
> i) ("primary key") Create and use a unique set of encoded
> code points.

FSR does this.

>>> st1 = 'a€'
>>> st2 = 'aa'
>>> ord(st1[0])
97
>>> ord(st2[0])
97
>>>

> ii) ("secondary key") Depending of the wish,
> memory/performance: utf-8/16/32

Whose wish? I don't know any language that allows the
programmer choose the internal representation of its
strings. If it is the designers choice FSR does this,
if it is the programmers choice, I don't see why
this is necessary for compliance.

> Two advantages at the light of the above example:
> iii) The "a" has never to be reencoded.

FSR: check. Using a container with wider slots is not a reëncoding.
If such widening is encoding then your 'choice' between utf-8/16/32
implies that it will also have to reencode when it changes from
utf-8 to utf-16 or utf-32.

> iv) An "a" size never exceeds 4 bytes.

FSR: check.

> Hard job to solve/satisfy i), ii), iii) and iv) at the same time.
> Is is possible? ;-) The solution is in the problem.

Mayby you should use bytes or bytearrays if that is really what you want.

-- 
Antoon Pardon

[toc] | [prev] | [next] | [standalone]


Page 5 of 7 — ← Prev page 1 2 3 4 [5] 6 7  Next page →

Back to top | Article view | comp.lang.python


csiph-web