Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #6811 > unrolled thread

Re: how to avoid leading white spaces

Started byChris Rebert <clp2@rebertia.com>
First post2011-06-01 10:11 -0700
Last post2011-06-05 04:17 -0700
Articles 20 on this page of 64 — 19 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.


Contents

  Re: how to avoid leading white spaces Chris Rebert <clp2@rebertia.com> - 2011-06-01 10:11 -0700
    Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-01 12:39 -0700
      Re: how to avoid leading white spaces Karim <karim.liateni@free.fr> - 2011-06-01 22:34 +0200
      Re: how to avoid leading white spaces Neil Cerutti <neilc@norwich.edu> - 2011-06-02 13:21 +0000
        Re: how to avoid leading white spaces Roy Smith <roy@panix.com> - 2011-06-02 21:57 -0400
          Re: how to avoid leading white spaces MRAB <python@mrabarnett.plus.com> - 2011-06-03 03:41 +0100
          Re: how to avoid leading white spaces Chris Torek <nospam@torek.net> - 2011-06-03 02:58 +0000
            Re: how to avoid leading white spaces Roy Smith <roy@panix.com> - 2011-06-02 23:44 -0400
              Re: how to avoid leading white spaces Chris Angelico <rosuav@gmail.com> - 2011-06-03 13:52 +1000
              Re: how to avoid leading white spaces Chris Angelico <rosuav@gmail.com> - 2011-06-03 13:54 +1000
              Re: how to avoid leading white spaces Chris Torek <nospam@torek.net> - 2011-06-03 04:30 +0000
                Re: how to avoid leading white spaces Nobody <nobody@nowhere.com> - 2011-06-03 14:11 +0100
            Re: how to avoid leading white spaces Nobody <nobody@nowhere.com> - 2011-06-03 14:18 +0100
            Re: how to avoid leading white spaces Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2011-06-04 13:41 +1200
              Re: how to avoid leading white spaces Nobody <nobody@nowhere.com> - 2011-06-04 20:44 +0100
            Re: how to avoid leading white spaces Ian <hobson42@gmail.com> - 2011-06-06 22:04 +0100
              Re: how to avoid leading white spaces Chris Torek <nospam@torek.net> - 2011-06-09 02:32 +0000
          Re: how to avoid leading white spaces Thorsten Kampe <thorsten@thorstenkampe.de> - 2011-06-03 10:32 +0200
        Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-03 05:51 -0700
          Re: how to avoid leading white spaces Neil Cerutti <neilc@norwich.edu> - 2011-06-03 13:17 +0000
            Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-03 08:14 -0700
          Re: how to avoid leading white spaces Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-03 14:25 +0000
            Re: how to avoid leading white spaces "D'Arcy J.M. Cain" <darcy@druid.net> - 2011-06-03 10:58 -0400
            Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-03 12:29 -0700
              Re: how to avoid leading white spaces Neil Cerutti <neilc@norwich.edu> - 2011-06-03 20:49 +0000
                Re: how to avoid leading white spaces Chris Torek <nospam@torek.net> - 2011-06-03 21:45 +0000
                  Re: how to avoid leading white spaces Ethan Furman <ethan@stoneleaf.us> - 2011-06-03 15:11 -0700
                  Re: how to avoid leading white spaces MRAB <python@mrabarnett.plus.com> - 2011-06-03 23:38 +0100
                  Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-05 22:47 -0700
                Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-05 22:44 -0700
                  Re: how to avoid leading white spaces Neil Cerutti <neilc@norwich.edu> - 2011-06-06 16:08 +0000
                    Re: how to avoid leading white spaces Ian Kelly <ian.g.kelly@gmail.com> - 2011-06-06 10:29 -0600
                      Re: how to avoid leading white spaces Neil Cerutti <neilc@norwich.edu> - 2011-06-06 17:17 +0000
                        Re: how to avoid leading white spaces Ian Kelly <ian.g.kelly@gmail.com> - 2011-06-06 11:40 -0600
                          Re: how to avoid leading white spaces Neil Cerutti <neilc@norwich.edu> - 2011-06-06 17:56 +0000
                    Re: how to avoid leading white spaces Ethan Furman <ethan@stoneleaf.us> - 2011-06-06 10:48 -0700
                    Re: how to avoid leading white spaces Ian Kelly <ian.g.kelly@gmail.com> - 2011-06-06 11:42 -0600
              Re: how to avoid leading white spaces Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-04 02:05 +0000
                Re: how to avoid leading white spaces MRAB <python@mrabarnett.plus.com> - 2011-06-04 03:24 +0100
                  Re: how to avoid leading white spaces Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-04 04:59 +0000
                Re: how to avoid leading white spaces Roy Smith <roy@panix.com> - 2011-06-03 22:30 -0400
                  Re: how to avoid leading white spaces Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-04 05:14 +0000
                    Re: how to avoid leading white spaces Roy Smith <roy@panix.com> - 2011-06-04 09:39 -0400
                      Re: how to avoid leading white spaces Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-05 00:44 +0000
                    Re: how to avoid leading white spaces rusi <rustompmody@gmail.com> - 2011-06-04 09:36 -0700
                    Re: how to avoid leading white spaces Nobody <nobody@nowhere.com> - 2011-06-04 21:02 +0100
                      Re: how to avoid leading white spaces Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-05 01:01 +0000
                  Re: how to avoid leading white spaces Chris Angelico <rosuav@gmail.com> - 2011-06-04 16:04 +1000
                Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-05 23:03 -0700
                  Re: how to avoid leading white spaces Chris Torek <nospam@torek.net> - 2011-06-06 07:11 +0000
                    Re: how to avoid leading white spaces "Octavian Rasnita" <orasnita@gmail.com> - 2011-06-06 11:51 +0300
                    Re: how to avoid leading white spaces Chris Angelico <rosuav@gmail.com> - 2011-06-06 19:01 +1000
                    Re: how to avoid leading white spaces rusi <rustompmody@gmail.com> - 2011-06-06 07:33 -0700
                      Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-07 11:37 -0700
                        Re: how to avoid leading white spaces Roy Smith <roy@panix.com> - 2011-06-07 20:30 -0400
                          Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-08 07:38 -0700
                            Re: how to avoid leading white spaces rusi <rustompmody@gmail.com> - 2011-06-08 09:14 -0700
                        Re: how to avoid leading white spaces rusi <rustompmody@gmail.com> - 2011-06-08 01:27 -0700
                  Re: how to avoid leading white spaces Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-06 15:29 +0000
                    Re: how to avoid leading white spaces Ian Kelly <ian.g.kelly@gmail.com> - 2011-06-06 10:06 -0600
                    Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-07 09:00 -0700
                      Re: how to avoid leading white spaces Duncan Booth <duncan.booth@invalid.invalid> - 2011-06-08 09:01 +0000
                        Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-08 07:39 -0700
            Re: how to avoid leading white spaces rusi <rustompmody@gmail.com> - 2011-06-05 04:17 -0700

Page 1 of 4  [1] 2 3 4  Next page →


#6811 — Re: how to avoid leading white spaces

FromChris Rebert <clp2@rebertia.com>
Date2011-06-01 10:11 -0700
SubjectRe: how to avoid leading white spaces
Message-ID<mailman.2373.1306948264.9059.python-list@python.org>
On Wed, Jun 1, 2011 at 12:31 AM, rakesh kumar
<rakeshkumar.techie@gmail.com> wrote:
>
> Hi
>
> i have a file which contains data
>
> //ACCDJ         EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
> //         UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCDJ       '
> //ACCT          EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
> //         UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCT        '
> //ACCUM         EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
> //         UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCUM       '
> //ACCUM1        EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
> //         UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCUM1      '
>
> i want to cut the white spaces which are in between single quotes after TABLE=.
>
> for example :
>                                'ACCT[spaces] '
>                                'ACCUM           '
>                                'ACCUM1         '
> the above is the output of another python script but its having a leading spaces.

Er, you mean trailing spaces. Since this is easy enough to be
homework, I will only give an outline:

1. Use str.index() and str.rindex() to find the positions of the
starting and ending single-quotes in the line.
2. Use slicing to extract the inside of the quoted string.
3. Use str.rstrip() to remove the trailing spaces from the extracted string.
4. Use slicing and concatenation to join together the rest of the line
with the now-stripped inner string.

Relevant docs: http://docs.python.org/library/stdtypes.html#string-methods

Cheers,
Chris
--
http://rebertia.com

[toc] | [next] | [standalone]


#6821

From"rurpy@yahoo.com" <rurpy@yahoo.com>
Date2011-06-01 12:39 -0700
Message-ID<9e861b0e-e768-401b-b5ca-190f20830a08@s9g2000yqm.googlegroups.com>
In reply to#6811
On Jun 1, 11:11 am, Chris Rebert <c...@rebertia.com> wrote:
> On Wed, Jun 1, 2011 at 12:31 AM, rakesh kumar
> > Hi
> >
> > i have a file which contains data
> >
> > //ACCDJ         EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
> > //         UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCDJ       '
> > //ACCT          EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
> > //         UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCT        '
> > //ACCUM         EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
> > //         UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCUM       '
> > //ACCUM1        EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
> > //         UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCUM1      '
> >
> > i want to cut the white spaces which are in between single quotes after TABLE=.
> >
> > for example :
> >                                'ACCT[spaces] '
> >                                'ACCUM           '
> >                                'ACCUM1         '
> > the above is the output of another python script but its having a leading spaces.
>
> Er, you mean trailing spaces. Since this is easy enough to be
> homework, I will only give an outline:
>
> 1. Use str.index() and str.rindex() to find the positions of the
> starting and ending single-quotes in the line.
> 2. Use slicing to extract the inside of the quoted string.
> 3. Use str.rstrip() to remove the trailing spaces from the extracted string.
> 4. Use slicing and concatenation to join together the rest of the line
> with the now-stripped inner string.
>
> Relevant docs:http://docs.python.org/library/stdtypes.html#string-methods

For some odd reason (perhaps because they are used a lot in Perl),
this groups seems to have a great aversion to regular expressions.
Too bad because this is a typical problem where their use is the
best solution.

    import re
    f = open ("your file")
    for line in f:
        fixed = re.sub (r"(TABLE='\S+)\s+'$", r"\1'", line)
        print fixed,

(The above is for Python-2, adjust as needed for Python-3)

[toc] | [prev] | [next] | [standalone]


#6823

FromKarim <karim.liateni@free.fr>
Date2011-06-01 22:34 +0200
Message-ID<mailman.2379.1306960465.9059.python-list@python.org>
In reply to#6821
On 06/01/2011 09:39 PM, rurpy@yahoo.com wrote:
> On Jun 1, 11:11 am, Chris Rebert<c...@rebertia.com>  wrote:
>> On Wed, Jun 1, 2011 at 12:31 AM, rakesh kumar
>>> Hi
>>>
>>> i have a file which contains data
>>>
>>> //ACCDJ         EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
>>> //         UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCDJ       '
>>> //ACCT          EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
>>> //         UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCT        '
>>> //ACCUM         EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
>>> //         UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCUM       '
>>> //ACCUM1        EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
>>> //         UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCUM1      '
>>>
>>> i want to cut the white spaces which are in between single quotes after TABLE=.
>>>
>>> for example :
>>>                                 'ACCT[spaces] '
>>>                                 'ACCUM           '
>>>                                 'ACCUM1         '
>>> the above is the output of another python script but its having a leading spaces.
>> Er, you mean trailing spaces. Since this is easy enough to be
>> homework, I will only give an outline:
>>
>> 1. Use str.index() and str.rindex() to find the positions of the
>> starting and ending single-quotes in the line.
>> 2. Use slicing to extract the inside of the quoted string.
>> 3. Use str.rstrip() to remove the trailing spaces from the extracted string.
>> 4. Use slicing and concatenation to join together the rest of the line
>> with the now-stripped inner string.
>>
>> Relevant docs:http://docs.python.org/library/stdtypes.html#string-methods
> For some odd reason (perhaps because they are used a lot in Perl),
> this groups seems to have a great aversion to regular expressions.
> Too bad because this is a typical problem where their use is the
> best solution.
>
>      import re
>      f = open ("your file")
>      for line in f:
>          fixed = re.sub (r"(TABLE='\S+)\s+'$", r"\1'", line)
>          print fixed,
>
> (The above is for Python-2, adjust as needed for Python-3)
Rurpy,
Your solution is neat.
Simple is better than complicated... (at list for this simple issue)


[toc] | [prev] | [next] | [standalone]


#6863

FromNeil Cerutti <neilc@norwich.edu>
Date2011-06-02 13:21 +0000
Message-ID<94ph22FrhvU5@mid.individual.net>
In reply to#6821
On 2011-06-01, rurpy@yahoo.com <rurpy@yahoo.com> wrote:
> For some odd reason (perhaps because they are used a lot in
> Perl), this groups seems to have a great aversion to regular
> expressions. Too bad because this is a typical problem where
> their use is the best solution.

Python's str methods, when they're sufficent, are usually more
efficient.

Perl integrated regular expressions, while Python relegated them
to a library.

There are thus a large class of problems that are best solve with
regular expressions in Perl, but str methods in Python.

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]


#6902

FromRoy Smith <roy@panix.com>
Date2011-06-02 21:57 -0400
Message-ID<roy-E2FA6F.21571602062011@news.panix.com>
In reply to#6863
In article <94ph22FrhvU5@mid.individual.net>,
 Neil Cerutti <neilc@norwich.edu> wrote:

> On 2011-06-01, rurpy@yahoo.com <rurpy@yahoo.com> wrote:
> > For some odd reason (perhaps because they are used a lot in
> > Perl), this groups seems to have a great aversion to regular
> > expressions. Too bad because this is a typical problem where
> > their use is the best solution.
> 
> Python's str methods, when they're sufficent, are usually more
> efficient.

I was all set to say, "prove it!" when I decided to try an experiment.  
Much to my surprise, for at least one common case, this is indeed 
correct.

-------------------------------------------------
#!/usr/bin/env python                                                                                             

import timeit

text = '''Lorem ipsum dolor sit amet, consectetur adipiscing                                                      
elit. Mauris congue risus et purus lobortis facilisis. In                                                         
nec quam dolor, non blandit tellus. Suspendisse tempus,                                                           
sapien ac mattis volutpat, lectus elit auctor lacus, vitae                                                        
accumsan nunc elit in ligula. Curabitur quis mauris                                                               
neque. Etiam auctor eleifend arcu in egestas. Pellentesque                                                        
non mauris sit amet nulla aliquam hendrerit pretium id                                                            
arcu. Ut fringilla tempor lorem eget tincidunt. Duis nibh                                                         
nisi, iaculis sed scelerisque in, facilisis quis                                                                  
dui. Aliquam varius diam in turpis auctor dapibus. Fusce                                                          
aliquet erat vestibulum mauris volutpat id laoreet enim                                                           
fermentum. Nam at justo nibh, ut vulputate dui.                                                                   
libero. Nunc ac risus justo, in sodales erat.                                                                     
'''
text = ' '.join(text.split())

t1 = timeit.Timer("'laoreet' in text",
                 "text = '%s'" % text)
t2 = timeit.Timer("pattern.search(text)",
                  "import re; pattern = re.compile('laoreet'); text = 
'%s'" % text)
print t1.timeit()
print t2.timeit()
-------------------------------------------------
./contains.py
0.990975856781
1.91417002678
-------------------------------------------------

> Perl integrated regular expressions, while Python relegated them
> to a library.

The same way Python relegates most everything to a library :-)

[toc] | [prev] | [next] | [standalone]


#6905

FromMRAB <python@mrabarnett.plus.com>
Date2011-06-03 03:41 +0100
Message-ID<mailman.2412.1307068910.9059.python-list@python.org>
In reply to#6902
On 03/06/2011 02:57, Roy Smith wrote:
> In article<94ph22FrhvU5@mid.individual.net>,
>   Neil Cerutti<neilc@norwich.edu>  wrote:
>
>> On 2011-06-01, rurpy@yahoo.com<rurpy@yahoo.com>  wrote:
>>> For some odd reason (perhaps because they are used a lot in
>>> Perl), this groups seems to have a great aversion to regular
>>> expressions. Too bad because this is a typical problem where
>>> their use is the best solution.
>>
>> Python's str methods, when they're sufficent, are usually more
>> efficient.
>
> I was all set to say, "prove it!" when I decided to try an experiment.
> Much to my surprise, for at least one common case, this is indeed
> correct.
>
[snip]

I've tested it on my PC for Python 2.7 (bytestring) and Python 3.1
(Unicode) and included the "regex" module on PyPI:

Python 2.7:
0.949936333562
4.31320052965
1.14035334748

Python 3.1:
1.27268308633
4.2509511537
1.16866839819

[toc] | [prev] | [next] | [standalone]


#6906

FromChris Torek <nospam@torek.net>
Date2011-06-03 02:58 +0000
Message-ID<is9ikg083h@news1.newsguy.com>
In reply to#6902
>In article <94ph22FrhvU5@mid.individual.net>
> Neil Cerutti <neilc@norwich.edu> wrote:
>> Python's str methods, when they're sufficent, are usually more
>> efficient.

In article <roy-E2FA6F.21571602062011@news.panix.com>
Roy Smith  <roy@panix.com> replied:
>I was all set to say, "prove it!" when I decided to try an experiment.  
>Much to my surprise, for at least one common case, this is indeed 
>correct.
 [big snip]
>t1 = timeit.Timer("'laoreet' in text",
>                 "text = '%s'" % text)
>t2 = timeit.Timer("pattern.search(text)",
>                  "import re; pattern = re.compile('laoreet'); text = 
>'%s'" % text)
>print t1.timeit()
>print t2.timeit()
>-------------------------------------------------
>./contains.py
>0.990975856781
>1.91417002678
>-------------------------------------------------

This is a bit surprising, since both "s1 in s2" and re.search()
could use a Boyer-Moore-based algorithm for a sufficiently-long
fixed string, and the time required should be proportional to that
needed to set up the skip table.  The re.compile() gets to re-use
the table every time.  (I suppose "in" could as well, with some sort
of cache of recently-built tables.)

Boyer-Moore search is roughly O(M/N) where M is the length of the
text being searched and N is the length of the string being sought.
(However, it depends on the form of the string, e.g., searching
for "ababa" is not as good as searching for "abcde".)

Python might be penalized by its use of Unicode here, since a
Boyer-Moore table for a full 16-bit Unicode string would need
65536 entries (one per possible ord() value).  However, if the
string being sought is all single-byte values, a 256-element
table suffices; re.compile(), at least, could scan the pattern
and choose an appropriate underlying search algorithm.

There is an interesting article here as well:
   http://effbot.org/zone/stringlib.htm
-- 
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W)  +1 801 277 2603
email: gmail (figure it out)      http://web.torek.net/torek/index.html

[toc] | [prev] | [next] | [standalone]


#6907

FromRoy Smith <roy@panix.com>
Date2011-06-02 23:44 -0400
Message-ID<roy-751FAC.23443902062011@news.panix.com>
In reply to#6906
In article <is9ikg083h@news1.newsguy.com>,
 Chris Torek <nospam@torek.net> wrote:

> Python might be penalized by its use of Unicode here, since a
> Boyer-Moore table for a full 16-bit Unicode string would need
> 65536 entries (one per possible ord() value).

I'm not sure what you mean by "full 16-bit Unicode string"?  Isn't 
unicode inherently 32 bit?  Or at least 20-something bit?  Things like 
UTF-16 are just one way to encode it.

In any case, while I could imagine building a 2^16 entry jump table, 
clearly it's infeasible (with today's hardware) to build a 2^32 entry 
table. But, there's nothing that really requires you to build a table at 
all.  If I understand the algorithm right, all that's really required is 
that you can map a character to a shift value.

For an 8 bit character set, an indexed jump table makes sense.  For a 
larger character set, I would imagine you would do some heuristic 
pre-processing to see if your search string consisted only of characters 
in one unicode plane and use that fact to build a table which only 
indexes that plane.  Or, maybe use a hash table instead of a regular 
indexed table.  Not as fast, but only slower by a small constant factor, 
which is not a horrendous price to pay in a fully i18n world :-)

[toc] | [prev] | [next] | [standalone]


#6908

FromChris Angelico <rosuav@gmail.com>
Date2011-06-03 13:52 +1000
Message-ID<mailman.2413.1307073127.9059.python-list@python.org>
In reply to#6907
On Fri, Jun 3, 2011 at 1:44 PM, Roy Smith <roy@panix.com> wrote:
> In article <is9ikg083h@news1.newsguy.com>,
>  Chris Torek <nospam@torek.net> wrote:
>
>> Python might be penalized by its use of Unicode here, since a
>> Boyer-Moore table for a full 16-bit Unicode string would need
>> 65536 entries (one per possible ord() value).
>
> I'm not sure what you mean by "full 16-bit Unicode string"?  Isn't
> unicode inherently 32 bit?  Or at least 20-something bit?  Things like
> UTF-16 are just one way to encode it.

The size of a Unicode character is like the size of a number. It's not
defined in terms of a maximum. However, Unicode planes 0-2 have all
the defined printable characters, and there are only 16 planes in
total, so (since each plane is 2^16 characters) that kinda makes
Unicode 18-bit or 20-bit. UTF-16 / UCS-2, therefore, uses two 16-bit
numbers to store a 20-bit number. Why do I get the feeling I've met
that before...

Chris Angelico
136E:0100 CD 20   INT 20

[toc] | [prev] | [next] | [standalone]


#6909

FromChris Angelico <rosuav@gmail.com>
Date2011-06-03 13:54 +1000
Message-ID<mailman.2414.1307073267.9059.python-list@python.org>
In reply to#6907
On Fri, Jun 3, 2011 at 1:52 PM, Chris Angelico <rosuav@gmail.com> wrote:
> However, Unicode planes 0-2 have all
> the defined printable characters

PS. I'm fully aware that there are ranges defined in page 14 / E.
They're non-printing characters, and unlikely to be part of a text
string, although it is possible. So you can't shortcut things and
treat Unicode as 18-bit numbers; has to be 20-bit. Doesn't have to be
32-bit unless that's really convenient.

Chris Angelico

[toc] | [prev] | [next] | [standalone]


#6913

FromChris Torek <nospam@torek.net>
Date2011-06-03 04:30 +0000
Message-ID<is9o1m0kur@news4.newsguy.com>
In reply to#6907
>In article <is9ikg083h@news1.newsguy.com>,
> Chris Torek <nospam@torek.net> wrote:
>> Python might be penalized by its use of Unicode here, since a
>> Boyer-Moore table for a full 16-bit Unicode string would need
>> 65536 entries (one per possible ord() value).

In article <roy-751FAC.23443902062011@news.panix.com>
Roy Smith  <roy@panix.com> wrote:
>I'm not sure what you mean by "full 16-bit Unicode string"?  Isn't 
>unicode inherently 32 bit?

Well, not exactly.  As I understand it, Python is normally built
with a 16-bit "unicode character" type though (using either UCS-2
or UTF-16 internally; but I admit I have been far too lazy to look
up stuff like surrogates here :-) ).

>In any case, while I could imagine building a 2^16 entry jump table, 
>clearly it's infeasible (with today's hardware) to build a 2^32 entry 
>table. But, there's nothing that really requires you to build a table at 
>all.  If I understand the algorithm right, all that's really required is 
>that you can map a character to a shift value.

Right.  See the URL I included for an example.  The point here,
though, is ... well:

>For an 8 bit character set, an indexed jump table makes sense.  For a 
>larger character set, I would imagine you would do some heuristic 
>pre-processing to see if your search string consisted only of characters 
>in one unicode plane and use that fact to build a table which only 
>indexes that plane.  Or, maybe use a hash table instead of a regular 
>indexed table.

Just so.  You have to pay for one scan through the string to build
a hash-table of offsets -- an expense similar to that for building
the 256-entry 8-bit table, perhaps, depending on string length --
but then you pay again for each character looked-at, since:

    skip = hashed_lookup(table, this_char);

is a more complex operation than:

    skip = table[this_char];

(where table is a simple array, hence the C-style semicolons: this
is not Python pseudo-code :-) ).  Hence, a "penalty".

>Not as fast, but only slower by a small constant factor, 
>which is not a horrendous price to pay in a fully i18n world :-)

Indeed.
-- 
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W)  +1 801 277 2603
email: gmail (figure it out)      http://web.torek.net/torek/index.html

[toc] | [prev] | [next] | [standalone]


#6942

FromNobody <nobody@nowhere.com>
Date2011-06-03 14:11 +0100
Message-ID<pan.2011.06.03.13.11.53.844000@nowhere.com>
In reply to#6913
On Fri, 03 Jun 2011 04:30:46 +0000, Chris Torek wrote:

>>I'm not sure what you mean by "full 16-bit Unicode string"?  Isn't 
>>unicode inherently 32 bit?
> 
> Well, not exactly.  As I understand it, Python is normally built
> with a 16-bit "unicode character" type though

It's normally 32-bit on platforms where wchar_t is 32-bit (e.g. Linux).

[toc] | [prev] | [next] | [standalone]


#6944

FromNobody <nobody@nowhere.com>
Date2011-06-03 14:18 +0100
Message-ID<pan.2011.06.03.13.18.36.891000@nowhere.com>
In reply to#6906
On Fri, 03 Jun 2011 02:58:24 +0000, Chris Torek wrote:

> Python might be penalized by its use of Unicode here, since a
> Boyer-Moore table for a full 16-bit Unicode string would need
> 65536 entries (one per possible ord() value).  However, if the
> string being sought is all single-byte values, a 256-element
> table suffices; re.compile(), at least, could scan the pattern
> and choose an appropriate underlying search algorithm.

The table can be truncated or compressed at the cost of having to map
codepoints to table indices. Or use a hash table instead of an array.

[toc] | [prev] | [next] | [standalone]


#6990

FromGregory Ewing <greg.ewing@canterbury.ac.nz>
Date2011-06-04 13:41 +1200
Message-ID<94tgqfF4tiU1@mid.individual.net>
In reply to#6906
Chris Torek wrote:
> Python might be penalized by its use of Unicode here, since a
> Boyer-Moore table for a full 16-bit Unicode string would need
> 65536 entries

But is there any need for the Boyer-Moore algorithm to
operate on characters?

Seems to me you could just as well chop the UTF-16 up
into bytes and apply Boyer-Moore to them, and it would
work about as well.

-- 
Greg

[toc] | [prev] | [next] | [standalone]


#7019

FromNobody <nobody@nowhere.com>
Date2011-06-04 20:44 +0100
Message-ID<pan.2011.06.04.19.44.55.938000@nowhere.com>
In reply to#6990
On Sat, 04 Jun 2011 13:41:33 +1200, Gregory Ewing wrote:

>> Python might be penalized by its use of Unicode here, since a
>> Boyer-Moore table for a full 16-bit Unicode string would need
>> 65536 entries
> 
> But is there any need for the Boyer-Moore algorithm to
> operate on characters?
> 
> Seems to me you could just as well chop the UTF-16 up
> into bytes and apply Boyer-Moore to them, and it would
> work about as well.

No, because that won't care about alignment. E.g. on a big-endian
architecture, if you search for '\u2345' in the string '\u0123\u4567', it
will find a match (at an offset of 1 byte).

[toc] | [prev] | [next] | [standalone]


#7111

FromIan <hobson42@gmail.com>
Date2011-06-06 22:04 +0100
Message-ID<mailman.2508.1307394262.9059.python-list@python.org>
In reply to#6906
On 03/06/2011 03:58, Chris Torek wrote:
>
>> -------------------------------------------------
> This is a bit surprising, since both "s1 in s2" and re.search()
> could use a Boyer-Moore-based algorithm for a sufficiently-long
> fixed string, and the time required should be proportional to that
> needed to set up the skip table.  The re.compile() gets to re-use
> the table every time.
Is that true?  My immediate thought is that Boyer-Moore would quickly give
the number of characters to skip, but skipping them would be slow because
UTF8 encoded characters are variable sized, and the string would have to be
walked anyway.

Or am I misunderstanding something.

Ian


[toc] | [prev] | [next] | [standalone]


#7267

FromChris Torek <nospam@torek.net>
Date2011-06-09 02:32 +0000
Message-ID<ispbb8024r6@news2.newsguy.com>
In reply to#7111
>On 03/06/2011 03:58, Chris Torek wrote:
>>> -------------------------------------------------
>> This is a bit surprising, since both "s1 in s2" and re.search()
>> could use a Boyer-Moore-based algorithm for a sufficiently-long
>> fixed string, and the time required should be proportional to that
>> needed to set up the skip table.  The re.compile() gets to re-use
>> the table every time.

In article <mailman.2508.1307394262.9059.python-list@python.org>
Ian  <hobson42@gmail.com> wrote:
>Is that true?  My immediate thought is that Boyer-Moore would quickly give
>the number of characters to skip, but skipping them would be slow because
>UTF8 encoded characters are variable sized, and the string would have to be
>walked anyway.

As I understand it, strings in python 3 are Unicode internally and
(apparently) use wchar_t.  Byte strings in python 3 are of course
byte strings, not UTF-8 encoded.

>Or am I misunderstanding something.

Here's python 2.7 on a Linux box:

    >>> print sys.getsizeof('a'), sys.getsizeof('ab'), sys.getsizeof('abc')
    38 39 40
    >>> print sys.getsizeof(u'a'), sys.getsizeof(u'ab'), sys.getsizeof(u'abc')
    56 60 64

This implies that strings in Python 2.x are just byte strings (same
as b"..." in Python 3.x) and never actually contain unicode; and
unicode strings (same as "..." in Python 3.x) use 4-byte "characters"
per that box's wchar_t.
-- 
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W)  +1 801 277 2603
email: gmail (figure it out)      http://web.torek.net/torek/index.html

[toc] | [prev] | [next] | [standalone]


#6928

FromThorsten Kampe <thorsten@thorstenkampe.de>
Date2011-06-03 10:32 +0200
Message-ID<MPG.2852ba74fbd084598981d@news.individual.de>
In reply to#6902
* Roy Smith (Thu, 02 Jun 2011 21:57:16 -0400)
> In article <94ph22FrhvU5@mid.individual.net>,
>  Neil Cerutti <neilc@norwich.edu> wrote:
> > On 2011-06-01, rurpy@yahoo.com <rurpy@yahoo.com> wrote:
> > > For some odd reason (perhaps because they are used a lot in
> > > Perl), this groups seems to have a great aversion to regular
> > > expressions. Too bad because this is a typical problem where
> > > their use is the best solution.
> > 
> > Python's str methods, when they're sufficent, are usually more
> > efficient.
> 
> I was all set to say, "prove it!" when I decided to try an experiment.  
> Much to my surprise, for at least one common case, this is indeed 
> correct.
> [...]
> t1 = timeit.Timer("'laoreet' in text",
>                  "text = '%s'" % text)
> t2 = timeit.Timer("pattern.search(text)",
>                   "import re; pattern = re.compile('laoreet'); text = 
> '%s'" % text)
> print t1.timeit()
> print t2.timeit()
> -------------------------------------------------
> ./contains.py
> 0.990975856781
> 1.91417002678
> -------------------------------------------------

Strange that a lot of people (still) automatically associate 
"efficiency" with "takes two seconds to run instead of one" (which I 
guess no one really cares about).

Efficiency is much better measured in which time it saves you to write 
and maintain the code in a readable way.

Thorsten

[toc] | [prev] | [next] | [standalone]


#6940

From"rurpy@yahoo.com" <rurpy@yahoo.com>
Date2011-06-03 05:51 -0700
Message-ID<bc814b92-82f1-4fca-9282-c22bfafb3cae@j23g2000yqc.googlegroups.com>
In reply to#6863
On 06/02/2011 07:21 AM, Neil Cerutti wrote:
> > On 2011-06-01, rurpy@yahoo.com <rurpy@yahoo.com> wrote:
>> >> For some odd reason (perhaps because they are used a lot in
>> >> Perl), this groups seems to have a great aversion to regular
>> >> expressions. Too bad because this is a typical problem where
>> >> their use is the best solution.
> >
> > Python's str methods, when they're sufficent, are usually more
> > efficient.

Unfortunately, except for the very simplest cases, they are often
not sufficient.  I often find myself changing, for example, a
startwith() to a RE when I realize that the input can contain mixed
case or that I have to treat commas as well as spaces as delimiters.
After doing this a number of times, one starts to use an RE right
from the get go unless one is VERY sure that there will be no
requirements creep.

And to regurgitate the mantra frequently used to defend Python when
it is criticized for being slow, the real question should be, are
REs fast enough?  The answer almost always is yes.

> > Perl integrated regular expressions, while Python relegated them
> > to a library.

Which means that one needs an one extra "import re" line that is
not required in Perl.

Since RE strings are complied and cached, one often need not compile
them explicitly.  Using match results is often requires more lines
than in Perl:
   m = re.match (...)
   if m: do something with m
rather than Perl's
   if m/.../ {do something with capture group globals}
Any true Python fan should not find this a problem, the stock
response being, "what's the matter, your Enter key broken?"

> > There are thus a large class of problems that are best solve with
> > regular expressions in Perl, but str methods in Python.

Guess that depends on what one's definition of "large" is.

There are a few simple things, admittedly common, that Python
provides functions for that Perl uses REs for: replace(), for
example.  But so what?  I don't know if Perl does it or not but
there is no reason why functions called with string arguments or
REs with no "magic" characters can't be optimized to something
about as efficient as a corresponding Python function.  Such uses
are likely to be naively counted as "using an RE in Perl".

I would agree though that the selection of string manipulation
functions in Perl are not as nice or orthogonal as in Python, and
that this contributes to a tendency to use REs in Perl when one
doesn't need to.  But that is a programmer tradeoff (as in Python)
between fast-coding/slow-execution and slow-coding/fast-execution.
I for one would use Perl's index() and substr() to identify and
manipulate fixed patterns when performance was an issue.
One runs into the same tradeoff in Python pretty quickly too
so I'm not sure I'd call that space between the two languages
"large".

The other tradeoff, applying both to Perl and Python is with
maintenance.  As mentioned above, even when today's requirements
can be solved with some code involving several string functions,
indexes, and conditionals, when those requirements change, it is
usually a lot harder to modify that code than a RE.

In short, although your observations are true to some extent, they
are not sufficient to justify the anti-RE attitude often seen here.

[toc] | [prev] | [next] | [standalone]


#6943

FromNeil Cerutti <neilc@norwich.edu>
Date2011-06-03 13:17 +0000
Message-ID<94s587Fs2eU1@mid.individual.net>
In reply to#6940
On 2011-06-03, rurpy@yahoo.com <rurpy@yahoo.com> wrote:
> The other tradeoff, applying both to Perl and Python is with
> maintenance.  As mentioned above, even when today's
> requirements can be solved with some code involving several
> string functions, indexes, and conditionals, when those
> requirements change, it is usually a lot harder to modify that
> code than a RE.
>
> In short, although your observations are true to some extent,
> they are not sufficient to justify the anti-RE attitude often
> seen here.

Very good article. Thanks. I mostly wanted to combat the notion
that that the alleged anti-RE attitude here might be caused by an
opposition to Perl culture.

I contend that the anti-RE attitude sometimes seen here is caused
by dissatisfaction with regexes in general combined with an
aversion to the re module. I agree that it's not that bad, but
it's clunky enough that it does contribute to making it my last
resort.

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]


Page 1 of 4  [1] 2 3 4  Next page →

Back to top | Article view | comp.lang.python


csiph-web