Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #6811 > unrolled thread
| Started by | Chris Rebert <clp2@rebertia.com> |
|---|---|
| First post | 2011-06-01 10:11 -0700 |
| Last post | 2011-06-05 04:17 -0700 |
| Articles | 20 on this page of 64 — 19 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: how to avoid leading white spaces Chris Rebert <clp2@rebertia.com> - 2011-06-01 10:11 -0700
Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-01 12:39 -0700
Re: how to avoid leading white spaces Karim <karim.liateni@free.fr> - 2011-06-01 22:34 +0200
Re: how to avoid leading white spaces Neil Cerutti <neilc@norwich.edu> - 2011-06-02 13:21 +0000
Re: how to avoid leading white spaces Roy Smith <roy@panix.com> - 2011-06-02 21:57 -0400
Re: how to avoid leading white spaces MRAB <python@mrabarnett.plus.com> - 2011-06-03 03:41 +0100
Re: how to avoid leading white spaces Chris Torek <nospam@torek.net> - 2011-06-03 02:58 +0000
Re: how to avoid leading white spaces Roy Smith <roy@panix.com> - 2011-06-02 23:44 -0400
Re: how to avoid leading white spaces Chris Angelico <rosuav@gmail.com> - 2011-06-03 13:52 +1000
Re: how to avoid leading white spaces Chris Angelico <rosuav@gmail.com> - 2011-06-03 13:54 +1000
Re: how to avoid leading white spaces Chris Torek <nospam@torek.net> - 2011-06-03 04:30 +0000
Re: how to avoid leading white spaces Nobody <nobody@nowhere.com> - 2011-06-03 14:11 +0100
Re: how to avoid leading white spaces Nobody <nobody@nowhere.com> - 2011-06-03 14:18 +0100
Re: how to avoid leading white spaces Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2011-06-04 13:41 +1200
Re: how to avoid leading white spaces Nobody <nobody@nowhere.com> - 2011-06-04 20:44 +0100
Re: how to avoid leading white spaces Ian <hobson42@gmail.com> - 2011-06-06 22:04 +0100
Re: how to avoid leading white spaces Chris Torek <nospam@torek.net> - 2011-06-09 02:32 +0000
Re: how to avoid leading white spaces Thorsten Kampe <thorsten@thorstenkampe.de> - 2011-06-03 10:32 +0200
Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-03 05:51 -0700
Re: how to avoid leading white spaces Neil Cerutti <neilc@norwich.edu> - 2011-06-03 13:17 +0000
Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-03 08:14 -0700
Re: how to avoid leading white spaces Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-03 14:25 +0000
Re: how to avoid leading white spaces "D'Arcy J.M. Cain" <darcy@druid.net> - 2011-06-03 10:58 -0400
Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-03 12:29 -0700
Re: how to avoid leading white spaces Neil Cerutti <neilc@norwich.edu> - 2011-06-03 20:49 +0000
Re: how to avoid leading white spaces Chris Torek <nospam@torek.net> - 2011-06-03 21:45 +0000
Re: how to avoid leading white spaces Ethan Furman <ethan@stoneleaf.us> - 2011-06-03 15:11 -0700
Re: how to avoid leading white spaces MRAB <python@mrabarnett.plus.com> - 2011-06-03 23:38 +0100
Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-05 22:47 -0700
Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-05 22:44 -0700
Re: how to avoid leading white spaces Neil Cerutti <neilc@norwich.edu> - 2011-06-06 16:08 +0000
Re: how to avoid leading white spaces Ian Kelly <ian.g.kelly@gmail.com> - 2011-06-06 10:29 -0600
Re: how to avoid leading white spaces Neil Cerutti <neilc@norwich.edu> - 2011-06-06 17:17 +0000
Re: how to avoid leading white spaces Ian Kelly <ian.g.kelly@gmail.com> - 2011-06-06 11:40 -0600
Re: how to avoid leading white spaces Neil Cerutti <neilc@norwich.edu> - 2011-06-06 17:56 +0000
Re: how to avoid leading white spaces Ethan Furman <ethan@stoneleaf.us> - 2011-06-06 10:48 -0700
Re: how to avoid leading white spaces Ian Kelly <ian.g.kelly@gmail.com> - 2011-06-06 11:42 -0600
Re: how to avoid leading white spaces Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-04 02:05 +0000
Re: how to avoid leading white spaces MRAB <python@mrabarnett.plus.com> - 2011-06-04 03:24 +0100
Re: how to avoid leading white spaces Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-04 04:59 +0000
Re: how to avoid leading white spaces Roy Smith <roy@panix.com> - 2011-06-03 22:30 -0400
Re: how to avoid leading white spaces Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-04 05:14 +0000
Re: how to avoid leading white spaces Roy Smith <roy@panix.com> - 2011-06-04 09:39 -0400
Re: how to avoid leading white spaces Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-05 00:44 +0000
Re: how to avoid leading white spaces rusi <rustompmody@gmail.com> - 2011-06-04 09:36 -0700
Re: how to avoid leading white spaces Nobody <nobody@nowhere.com> - 2011-06-04 21:02 +0100
Re: how to avoid leading white spaces Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-05 01:01 +0000
Re: how to avoid leading white spaces Chris Angelico <rosuav@gmail.com> - 2011-06-04 16:04 +1000
Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-05 23:03 -0700
Re: how to avoid leading white spaces Chris Torek <nospam@torek.net> - 2011-06-06 07:11 +0000
Re: how to avoid leading white spaces "Octavian Rasnita" <orasnita@gmail.com> - 2011-06-06 11:51 +0300
Re: how to avoid leading white spaces Chris Angelico <rosuav@gmail.com> - 2011-06-06 19:01 +1000
Re: how to avoid leading white spaces rusi <rustompmody@gmail.com> - 2011-06-06 07:33 -0700
Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-07 11:37 -0700
Re: how to avoid leading white spaces Roy Smith <roy@panix.com> - 2011-06-07 20:30 -0400
Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-08 07:38 -0700
Re: how to avoid leading white spaces rusi <rustompmody@gmail.com> - 2011-06-08 09:14 -0700
Re: how to avoid leading white spaces rusi <rustompmody@gmail.com> - 2011-06-08 01:27 -0700
Re: how to avoid leading white spaces Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-06-06 15:29 +0000
Re: how to avoid leading white spaces Ian Kelly <ian.g.kelly@gmail.com> - 2011-06-06 10:06 -0600
Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-07 09:00 -0700
Re: how to avoid leading white spaces Duncan Booth <duncan.booth@invalid.invalid> - 2011-06-08 09:01 +0000
Re: how to avoid leading white spaces "rurpy@yahoo.com" <rurpy@yahoo.com> - 2011-06-08 07:39 -0700
Re: how to avoid leading white spaces rusi <rustompmody@gmail.com> - 2011-06-05 04:17 -0700
Page 1 of 4 [1] 2 3 4 Next page →
| From | Chris Rebert <clp2@rebertia.com> |
|---|---|
| Date | 2011-06-01 10:11 -0700 |
| Subject | Re: how to avoid leading white spaces |
| Message-ID | <mailman.2373.1306948264.9059.python-list@python.org> |
On Wed, Jun 1, 2011 at 12:31 AM, rakesh kumar <rakeshkumar.techie@gmail.com> wrote: > > Hi > > i have a file which contains data > > //ACCDJ EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB, > // UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCDJ ' > //ACCT EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB, > // UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCT ' > //ACCUM EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB, > // UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCUM ' > //ACCUM1 EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB, > // UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCUM1 ' > > i want to cut the white spaces which are in between single quotes after TABLE=. > > for example : > 'ACCT[spaces] ' > 'ACCUM ' > 'ACCUM1 ' > the above is the output of another python script but its having a leading spaces. Er, you mean trailing spaces. Since this is easy enough to be homework, I will only give an outline: 1. Use str.index() and str.rindex() to find the positions of the starting and ending single-quotes in the line. 2. Use slicing to extract the inside of the quoted string. 3. Use str.rstrip() to remove the trailing spaces from the extracted string. 4. Use slicing and concatenation to join together the rest of the line with the now-stripped inner string. Relevant docs: http://docs.python.org/library/stdtypes.html#string-methods Cheers, Chris -- http://rebertia.com
[toc] | [next] | [standalone]
| From | "rurpy@yahoo.com" <rurpy@yahoo.com> |
|---|---|
| Date | 2011-06-01 12:39 -0700 |
| Message-ID | <9e861b0e-e768-401b-b5ca-190f20830a08@s9g2000yqm.googlegroups.com> |
| In reply to | #6811 |
On Jun 1, 11:11 am, Chris Rebert <c...@rebertia.com> wrote:
> On Wed, Jun 1, 2011 at 12:31 AM, rakesh kumar
> > Hi
> >
> > i have a file which contains data
> >
> > //ACCDJ EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
> > // UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCDJ '
> > //ACCT EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
> > // UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCT '
> > //ACCUM EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
> > // UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCUM '
> > //ACCUM1 EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
> > // UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCUM1 '
> >
> > i want to cut the white spaces which are in between single quotes after TABLE=.
> >
> > for example :
> > 'ACCT[spaces] '
> > 'ACCUM '
> > 'ACCUM1 '
> > the above is the output of another python script but its having a leading spaces.
>
> Er, you mean trailing spaces. Since this is easy enough to be
> homework, I will only give an outline:
>
> 1. Use str.index() and str.rindex() to find the positions of the
> starting and ending single-quotes in the line.
> 2. Use slicing to extract the inside of the quoted string.
> 3. Use str.rstrip() to remove the trailing spaces from the extracted string.
> 4. Use slicing and concatenation to join together the rest of the line
> with the now-stripped inner string.
>
> Relevant docs:http://docs.python.org/library/stdtypes.html#string-methods
For some odd reason (perhaps because they are used a lot in Perl),
this groups seems to have a great aversion to regular expressions.
Too bad because this is a typical problem where their use is the
best solution.
import re
f = open ("your file")
for line in f:
fixed = re.sub (r"(TABLE='\S+)\s+'$", r"\1'", line)
print fixed,
(The above is for Python-2, adjust as needed for Python-3)
[toc] | [prev] | [next] | [standalone]
| From | Karim <karim.liateni@free.fr> |
|---|---|
| Date | 2011-06-01 22:34 +0200 |
| Message-ID | <mailman.2379.1306960465.9059.python-list@python.org> |
| In reply to | #6821 |
On 06/01/2011 09:39 PM, rurpy@yahoo.com wrote:
> On Jun 1, 11:11 am, Chris Rebert<c...@rebertia.com> wrote:
>> On Wed, Jun 1, 2011 at 12:31 AM, rakesh kumar
>>> Hi
>>>
>>> i have a file which contains data
>>>
>>> //ACCDJ EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
>>> // UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCDJ '
>>> //ACCT EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
>>> // UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCT '
>>> //ACCUM EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
>>> // UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCUM '
>>> //ACCUM1 EXEC DB2UNLDC,DFLID=&DFLID,PARMLIB=&PARMLIB,
>>> // UNLDSYST=&UNLDSYST,DATABAS=MBQV1D0A,TABLE='ACCUM1 '
>>>
>>> i want to cut the white spaces which are in between single quotes after TABLE=.
>>>
>>> for example :
>>> 'ACCT[spaces] '
>>> 'ACCUM '
>>> 'ACCUM1 '
>>> the above is the output of another python script but its having a leading spaces.
>> Er, you mean trailing spaces. Since this is easy enough to be
>> homework, I will only give an outline:
>>
>> 1. Use str.index() and str.rindex() to find the positions of the
>> starting and ending single-quotes in the line.
>> 2. Use slicing to extract the inside of the quoted string.
>> 3. Use str.rstrip() to remove the trailing spaces from the extracted string.
>> 4. Use slicing and concatenation to join together the rest of the line
>> with the now-stripped inner string.
>>
>> Relevant docs:http://docs.python.org/library/stdtypes.html#string-methods
> For some odd reason (perhaps because they are used a lot in Perl),
> this groups seems to have a great aversion to regular expressions.
> Too bad because this is a typical problem where their use is the
> best solution.
>
> import re
> f = open ("your file")
> for line in f:
> fixed = re.sub (r"(TABLE='\S+)\s+'$", r"\1'", line)
> print fixed,
>
> (The above is for Python-2, adjust as needed for Python-3)
Rurpy,
Your solution is neat.
Simple is better than complicated... (at list for this simple issue)
[toc] | [prev] | [next] | [standalone]
| From | Neil Cerutti <neilc@norwich.edu> |
|---|---|
| Date | 2011-06-02 13:21 +0000 |
| Message-ID | <94ph22FrhvU5@mid.individual.net> |
| In reply to | #6821 |
On 2011-06-01, rurpy@yahoo.com <rurpy@yahoo.com> wrote: > For some odd reason (perhaps because they are used a lot in > Perl), this groups seems to have a great aversion to regular > expressions. Too bad because this is a typical problem where > their use is the best solution. Python's str methods, when they're sufficent, are usually more efficient. Perl integrated regular expressions, while Python relegated them to a library. There are thus a large class of problems that are best solve with regular expressions in Perl, but str methods in Python. -- Neil Cerutti
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2011-06-02 21:57 -0400 |
| Message-ID | <roy-E2FA6F.21571602062011@news.panix.com> |
| In reply to | #6863 |
In article <94ph22FrhvU5@mid.individual.net>,
Neil Cerutti <neilc@norwich.edu> wrote:
> On 2011-06-01, rurpy@yahoo.com <rurpy@yahoo.com> wrote:
> > For some odd reason (perhaps because they are used a lot in
> > Perl), this groups seems to have a great aversion to regular
> > expressions. Too bad because this is a typical problem where
> > their use is the best solution.
>
> Python's str methods, when they're sufficent, are usually more
> efficient.
I was all set to say, "prove it!" when I decided to try an experiment.
Much to my surprise, for at least one common case, this is indeed
correct.
-------------------------------------------------
#!/usr/bin/env python
import timeit
text = '''Lorem ipsum dolor sit amet, consectetur adipiscing
elit. Mauris congue risus et purus lobortis facilisis. In
nec quam dolor, non blandit tellus. Suspendisse tempus,
sapien ac mattis volutpat, lectus elit auctor lacus, vitae
accumsan nunc elit in ligula. Curabitur quis mauris
neque. Etiam auctor eleifend arcu in egestas. Pellentesque
non mauris sit amet nulla aliquam hendrerit pretium id
arcu. Ut fringilla tempor lorem eget tincidunt. Duis nibh
nisi, iaculis sed scelerisque in, facilisis quis
dui. Aliquam varius diam in turpis auctor dapibus. Fusce
aliquet erat vestibulum mauris volutpat id laoreet enim
fermentum. Nam at justo nibh, ut vulputate dui.
libero. Nunc ac risus justo, in sodales erat.
'''
text = ' '.join(text.split())
t1 = timeit.Timer("'laoreet' in text",
"text = '%s'" % text)
t2 = timeit.Timer("pattern.search(text)",
"import re; pattern = re.compile('laoreet'); text =
'%s'" % text)
print t1.timeit()
print t2.timeit()
-------------------------------------------------
./contains.py
0.990975856781
1.91417002678
-------------------------------------------------
> Perl integrated regular expressions, while Python relegated them
> to a library.
The same way Python relegates most everything to a library :-)
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2011-06-03 03:41 +0100 |
| Message-ID | <mailman.2412.1307068910.9059.python-list@python.org> |
| In reply to | #6902 |
On 03/06/2011 02:57, Roy Smith wrote: > In article<94ph22FrhvU5@mid.individual.net>, > Neil Cerutti<neilc@norwich.edu> wrote: > >> On 2011-06-01, rurpy@yahoo.com<rurpy@yahoo.com> wrote: >>> For some odd reason (perhaps because they are used a lot in >>> Perl), this groups seems to have a great aversion to regular >>> expressions. Too bad because this is a typical problem where >>> their use is the best solution. >> >> Python's str methods, when they're sufficent, are usually more >> efficient. > > I was all set to say, "prove it!" when I decided to try an experiment. > Much to my surprise, for at least one common case, this is indeed > correct. > [snip] I've tested it on my PC for Python 2.7 (bytestring) and Python 3.1 (Unicode) and included the "regex" module on PyPI: Python 2.7: 0.949936333562 4.31320052965 1.14035334748 Python 3.1: 1.27268308633 4.2509511537 1.16866839819
[toc] | [prev] | [next] | [standalone]
| From | Chris Torek <nospam@torek.net> |
|---|---|
| Date | 2011-06-03 02:58 +0000 |
| Message-ID | <is9ikg083h@news1.newsguy.com> |
| In reply to | #6902 |
>In article <94ph22FrhvU5@mid.individual.net>
> Neil Cerutti <neilc@norwich.edu> wrote:
>> Python's str methods, when they're sufficent, are usually more
>> efficient.
In article <roy-E2FA6F.21571602062011@news.panix.com>
Roy Smith <roy@panix.com> replied:
>I was all set to say, "prove it!" when I decided to try an experiment.
>Much to my surprise, for at least one common case, this is indeed
>correct.
[big snip]
>t1 = timeit.Timer("'laoreet' in text",
> "text = '%s'" % text)
>t2 = timeit.Timer("pattern.search(text)",
> "import re; pattern = re.compile('laoreet'); text =
>'%s'" % text)
>print t1.timeit()
>print t2.timeit()
>-------------------------------------------------
>./contains.py
>0.990975856781
>1.91417002678
>-------------------------------------------------
This is a bit surprising, since both "s1 in s2" and re.search()
could use a Boyer-Moore-based algorithm for a sufficiently-long
fixed string, and the time required should be proportional to that
needed to set up the skip table. The re.compile() gets to re-use
the table every time. (I suppose "in" could as well, with some sort
of cache of recently-built tables.)
Boyer-Moore search is roughly O(M/N) where M is the length of the
text being searched and N is the length of the string being sought.
(However, it depends on the form of the string, e.g., searching
for "ababa" is not as good as searching for "abcde".)
Python might be penalized by its use of Unicode here, since a
Boyer-Moore table for a full 16-bit Unicode string would need
65536 entries (one per possible ord() value). However, if the
string being sought is all single-byte values, a 256-element
table suffices; re.compile(), at least, could scan the pattern
and choose an appropriate underlying search algorithm.
There is an interesting article here as well:
http://effbot.org/zone/stringlib.htm
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: gmail (figure it out) http://web.torek.net/torek/index.html
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2011-06-02 23:44 -0400 |
| Message-ID | <roy-751FAC.23443902062011@news.panix.com> |
| In reply to | #6906 |
In article <is9ikg083h@news1.newsguy.com>, Chris Torek <nospam@torek.net> wrote: > Python might be penalized by its use of Unicode here, since a > Boyer-Moore table for a full 16-bit Unicode string would need > 65536 entries (one per possible ord() value). I'm not sure what you mean by "full 16-bit Unicode string"? Isn't unicode inherently 32 bit? Or at least 20-something bit? Things like UTF-16 are just one way to encode it. In any case, while I could imagine building a 2^16 entry jump table, clearly it's infeasible (with today's hardware) to build a 2^32 entry table. But, there's nothing that really requires you to build a table at all. If I understand the algorithm right, all that's really required is that you can map a character to a shift value. For an 8 bit character set, an indexed jump table makes sense. For a larger character set, I would imagine you would do some heuristic pre-processing to see if your search string consisted only of characters in one unicode plane and use that fact to build a table which only indexes that plane. Or, maybe use a hash table instead of a regular indexed table. Not as fast, but only slower by a small constant factor, which is not a horrendous price to pay in a fully i18n world :-)
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2011-06-03 13:52 +1000 |
| Message-ID | <mailman.2413.1307073127.9059.python-list@python.org> |
| In reply to | #6907 |
On Fri, Jun 3, 2011 at 1:44 PM, Roy Smith <roy@panix.com> wrote: > In article <is9ikg083h@news1.newsguy.com>, > Chris Torek <nospam@torek.net> wrote: > >> Python might be penalized by its use of Unicode here, since a >> Boyer-Moore table for a full 16-bit Unicode string would need >> 65536 entries (one per possible ord() value). > > I'm not sure what you mean by "full 16-bit Unicode string"? Isn't > unicode inherently 32 bit? Or at least 20-something bit? Things like > UTF-16 are just one way to encode it. The size of a Unicode character is like the size of a number. It's not defined in terms of a maximum. However, Unicode planes 0-2 have all the defined printable characters, and there are only 16 planes in total, so (since each plane is 2^16 characters) that kinda makes Unicode 18-bit or 20-bit. UTF-16 / UCS-2, therefore, uses two 16-bit numbers to store a 20-bit number. Why do I get the feeling I've met that before... Chris Angelico 136E:0100 CD 20 INT 20
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2011-06-03 13:54 +1000 |
| Message-ID | <mailman.2414.1307073267.9059.python-list@python.org> |
| In reply to | #6907 |
On Fri, Jun 3, 2011 at 1:52 PM, Chris Angelico <rosuav@gmail.com> wrote: > However, Unicode planes 0-2 have all > the defined printable characters PS. I'm fully aware that there are ranges defined in page 14 / E. They're non-printing characters, and unlikely to be part of a text string, although it is possible. So you can't shortcut things and treat Unicode as 18-bit numbers; has to be 20-bit. Doesn't have to be 32-bit unless that's really convenient. Chris Angelico
[toc] | [prev] | [next] | [standalone]
| From | Chris Torek <nospam@torek.net> |
|---|---|
| Date | 2011-06-03 04:30 +0000 |
| Message-ID | <is9o1m0kur@news4.newsguy.com> |
| In reply to | #6907 |
>In article <is9ikg083h@news1.newsguy.com>,
> Chris Torek <nospam@torek.net> wrote:
>> Python might be penalized by its use of Unicode here, since a
>> Boyer-Moore table for a full 16-bit Unicode string would need
>> 65536 entries (one per possible ord() value).
In article <roy-751FAC.23443902062011@news.panix.com>
Roy Smith <roy@panix.com> wrote:
>I'm not sure what you mean by "full 16-bit Unicode string"? Isn't
>unicode inherently 32 bit?
Well, not exactly. As I understand it, Python is normally built
with a 16-bit "unicode character" type though (using either UCS-2
or UTF-16 internally; but I admit I have been far too lazy to look
up stuff like surrogates here :-) ).
>In any case, while I could imagine building a 2^16 entry jump table,
>clearly it's infeasible (with today's hardware) to build a 2^32 entry
>table. But, there's nothing that really requires you to build a table at
>all. If I understand the algorithm right, all that's really required is
>that you can map a character to a shift value.
Right. See the URL I included for an example. The point here,
though, is ... well:
>For an 8 bit character set, an indexed jump table makes sense. For a
>larger character set, I would imagine you would do some heuristic
>pre-processing to see if your search string consisted only of characters
>in one unicode plane and use that fact to build a table which only
>indexes that plane. Or, maybe use a hash table instead of a regular
>indexed table.
Just so. You have to pay for one scan through the string to build
a hash-table of offsets -- an expense similar to that for building
the 256-entry 8-bit table, perhaps, depending on string length --
but then you pay again for each character looked-at, since:
skip = hashed_lookup(table, this_char);
is a more complex operation than:
skip = table[this_char];
(where table is a simple array, hence the C-style semicolons: this
is not Python pseudo-code :-) ). Hence, a "penalty".
>Not as fast, but only slower by a small constant factor,
>which is not a horrendous price to pay in a fully i18n world :-)
Indeed.
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: gmail (figure it out) http://web.torek.net/torek/index.html
[toc] | [prev] | [next] | [standalone]
| From | Nobody <nobody@nowhere.com> |
|---|---|
| Date | 2011-06-03 14:11 +0100 |
| Message-ID | <pan.2011.06.03.13.11.53.844000@nowhere.com> |
| In reply to | #6913 |
On Fri, 03 Jun 2011 04:30:46 +0000, Chris Torek wrote: >>I'm not sure what you mean by "full 16-bit Unicode string"? Isn't >>unicode inherently 32 bit? > > Well, not exactly. As I understand it, Python is normally built > with a 16-bit "unicode character" type though It's normally 32-bit on platforms where wchar_t is 32-bit (e.g. Linux).
[toc] | [prev] | [next] | [standalone]
| From | Nobody <nobody@nowhere.com> |
|---|---|
| Date | 2011-06-03 14:18 +0100 |
| Message-ID | <pan.2011.06.03.13.18.36.891000@nowhere.com> |
| In reply to | #6906 |
On Fri, 03 Jun 2011 02:58:24 +0000, Chris Torek wrote: > Python might be penalized by its use of Unicode here, since a > Boyer-Moore table for a full 16-bit Unicode string would need > 65536 entries (one per possible ord() value). However, if the > string being sought is all single-byte values, a 256-element > table suffices; re.compile(), at least, could scan the pattern > and choose an appropriate underlying search algorithm. The table can be truncated or compressed at the cost of having to map codepoints to table indices. Or use a hash table instead of an array.
[toc] | [prev] | [next] | [standalone]
| From | Gregory Ewing <greg.ewing@canterbury.ac.nz> |
|---|---|
| Date | 2011-06-04 13:41 +1200 |
| Message-ID | <94tgqfF4tiU1@mid.individual.net> |
| In reply to | #6906 |
Chris Torek wrote: > Python might be penalized by its use of Unicode here, since a > Boyer-Moore table for a full 16-bit Unicode string would need > 65536 entries But is there any need for the Boyer-Moore algorithm to operate on characters? Seems to me you could just as well chop the UTF-16 up into bytes and apply Boyer-Moore to them, and it would work about as well. -- Greg
[toc] | [prev] | [next] | [standalone]
| From | Nobody <nobody@nowhere.com> |
|---|---|
| Date | 2011-06-04 20:44 +0100 |
| Message-ID | <pan.2011.06.04.19.44.55.938000@nowhere.com> |
| In reply to | #6990 |
On Sat, 04 Jun 2011 13:41:33 +1200, Gregory Ewing wrote: >> Python might be penalized by its use of Unicode here, since a >> Boyer-Moore table for a full 16-bit Unicode string would need >> 65536 entries > > But is there any need for the Boyer-Moore algorithm to > operate on characters? > > Seems to me you could just as well chop the UTF-16 up > into bytes and apply Boyer-Moore to them, and it would > work about as well. No, because that won't care about alignment. E.g. on a big-endian architecture, if you search for '\u2345' in the string '\u0123\u4567', it will find a match (at an offset of 1 byte).
[toc] | [prev] | [next] | [standalone]
| From | Ian <hobson42@gmail.com> |
|---|---|
| Date | 2011-06-06 22:04 +0100 |
| Message-ID | <mailman.2508.1307394262.9059.python-list@python.org> |
| In reply to | #6906 |
On 03/06/2011 03:58, Chris Torek wrote: > >> ------------------------------------------------- > This is a bit surprising, since both "s1 in s2" and re.search() > could use a Boyer-Moore-based algorithm for a sufficiently-long > fixed string, and the time required should be proportional to that > needed to set up the skip table. The re.compile() gets to re-use > the table every time. Is that true? My immediate thought is that Boyer-Moore would quickly give the number of characters to skip, but skipping them would be slow because UTF8 encoded characters are variable sized, and the string would have to be walked anyway. Or am I misunderstanding something. Ian
[toc] | [prev] | [next] | [standalone]
| From | Chris Torek <nospam@torek.net> |
|---|---|
| Date | 2011-06-09 02:32 +0000 |
| Message-ID | <ispbb8024r6@news2.newsguy.com> |
| In reply to | #7111 |
>On 03/06/2011 03:58, Chris Torek wrote:
>>> -------------------------------------------------
>> This is a bit surprising, since both "s1 in s2" and re.search()
>> could use a Boyer-Moore-based algorithm for a sufficiently-long
>> fixed string, and the time required should be proportional to that
>> needed to set up the skip table. The re.compile() gets to re-use
>> the table every time.
In article <mailman.2508.1307394262.9059.python-list@python.org>
Ian <hobson42@gmail.com> wrote:
>Is that true? My immediate thought is that Boyer-Moore would quickly give
>the number of characters to skip, but skipping them would be slow because
>UTF8 encoded characters are variable sized, and the string would have to be
>walked anyway.
As I understand it, strings in python 3 are Unicode internally and
(apparently) use wchar_t. Byte strings in python 3 are of course
byte strings, not UTF-8 encoded.
>Or am I misunderstanding something.
Here's python 2.7 on a Linux box:
>>> print sys.getsizeof('a'), sys.getsizeof('ab'), sys.getsizeof('abc')
38 39 40
>>> print sys.getsizeof(u'a'), sys.getsizeof(u'ab'), sys.getsizeof(u'abc')
56 60 64
This implies that strings in Python 2.x are just byte strings (same
as b"..." in Python 3.x) and never actually contain unicode; and
unicode strings (same as "..." in Python 3.x) use 4-byte "characters"
per that box's wchar_t.
--
In-Real-Life: Chris Torek, Wind River Systems
Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
email: gmail (figure it out) http://web.torek.net/torek/index.html
[toc] | [prev] | [next] | [standalone]
| From | Thorsten Kampe <thorsten@thorstenkampe.de> |
|---|---|
| Date | 2011-06-03 10:32 +0200 |
| Message-ID | <MPG.2852ba74fbd084598981d@news.individual.de> |
| In reply to | #6902 |
* Roy Smith (Thu, 02 Jun 2011 21:57:16 -0400)
> In article <94ph22FrhvU5@mid.individual.net>,
> Neil Cerutti <neilc@norwich.edu> wrote:
> > On 2011-06-01, rurpy@yahoo.com <rurpy@yahoo.com> wrote:
> > > For some odd reason (perhaps because they are used a lot in
> > > Perl), this groups seems to have a great aversion to regular
> > > expressions. Too bad because this is a typical problem where
> > > their use is the best solution.
> >
> > Python's str methods, when they're sufficent, are usually more
> > efficient.
>
> I was all set to say, "prove it!" when I decided to try an experiment.
> Much to my surprise, for at least one common case, this is indeed
> correct.
> [...]
> t1 = timeit.Timer("'laoreet' in text",
> "text = '%s'" % text)
> t2 = timeit.Timer("pattern.search(text)",
> "import re; pattern = re.compile('laoreet'); text =
> '%s'" % text)
> print t1.timeit()
> print t2.timeit()
> -------------------------------------------------
> ./contains.py
> 0.990975856781
> 1.91417002678
> -------------------------------------------------
Strange that a lot of people (still) automatically associate
"efficiency" with "takes two seconds to run instead of one" (which I
guess no one really cares about).
Efficiency is much better measured in which time it saves you to write
and maintain the code in a readable way.
Thorsten
[toc] | [prev] | [next] | [standalone]
| From | "rurpy@yahoo.com" <rurpy@yahoo.com> |
|---|---|
| Date | 2011-06-03 05:51 -0700 |
| Message-ID | <bc814b92-82f1-4fca-9282-c22bfafb3cae@j23g2000yqc.googlegroups.com> |
| In reply to | #6863 |
On 06/02/2011 07:21 AM, Neil Cerutti wrote:
> > On 2011-06-01, rurpy@yahoo.com <rurpy@yahoo.com> wrote:
>> >> For some odd reason (perhaps because they are used a lot in
>> >> Perl), this groups seems to have a great aversion to regular
>> >> expressions. Too bad because this is a typical problem where
>> >> their use is the best solution.
> >
> > Python's str methods, when they're sufficent, are usually more
> > efficient.
Unfortunately, except for the very simplest cases, they are often
not sufficient. I often find myself changing, for example, a
startwith() to a RE when I realize that the input can contain mixed
case or that I have to treat commas as well as spaces as delimiters.
After doing this a number of times, one starts to use an RE right
from the get go unless one is VERY sure that there will be no
requirements creep.
And to regurgitate the mantra frequently used to defend Python when
it is criticized for being slow, the real question should be, are
REs fast enough? The answer almost always is yes.
> > Perl integrated regular expressions, while Python relegated them
> > to a library.
Which means that one needs an one extra "import re" line that is
not required in Perl.
Since RE strings are complied and cached, one often need not compile
them explicitly. Using match results is often requires more lines
than in Perl:
m = re.match (...)
if m: do something with m
rather than Perl's
if m/.../ {do something with capture group globals}
Any true Python fan should not find this a problem, the stock
response being, "what's the matter, your Enter key broken?"
> > There are thus a large class of problems that are best solve with
> > regular expressions in Perl, but str methods in Python.
Guess that depends on what one's definition of "large" is.
There are a few simple things, admittedly common, that Python
provides functions for that Perl uses REs for: replace(), for
example. But so what? I don't know if Perl does it or not but
there is no reason why functions called with string arguments or
REs with no "magic" characters can't be optimized to something
about as efficient as a corresponding Python function. Such uses
are likely to be naively counted as "using an RE in Perl".
I would agree though that the selection of string manipulation
functions in Perl are not as nice or orthogonal as in Python, and
that this contributes to a tendency to use REs in Perl when one
doesn't need to. But that is a programmer tradeoff (as in Python)
between fast-coding/slow-execution and slow-coding/fast-execution.
I for one would use Perl's index() and substr() to identify and
manipulate fixed patterns when performance was an issue.
One runs into the same tradeoff in Python pretty quickly too
so I'm not sure I'd call that space between the two languages
"large".
The other tradeoff, applying both to Perl and Python is with
maintenance. As mentioned above, even when today's requirements
can be solved with some code involving several string functions,
indexes, and conditionals, when those requirements change, it is
usually a lot harder to modify that code than a RE.
In short, although your observations are true to some extent, they
are not sufficient to justify the anti-RE attitude often seen here.
[toc] | [prev] | [next] | [standalone]
| From | Neil Cerutti <neilc@norwich.edu> |
|---|---|
| Date | 2011-06-03 13:17 +0000 |
| Message-ID | <94s587Fs2eU1@mid.individual.net> |
| In reply to | #6940 |
On 2011-06-03, rurpy@yahoo.com <rurpy@yahoo.com> wrote: > The other tradeoff, applying both to Perl and Python is with > maintenance. As mentioned above, even when today's > requirements can be solved with some code involving several > string functions, indexes, and conditionals, when those > requirements change, it is usually a lot harder to modify that > code than a RE. > > In short, although your observations are true to some extent, > they are not sufficient to justify the anti-RE attitude often > seen here. Very good article. Thanks. I mostly wanted to combat the notion that that the alleged anti-RE attitude here might be caused by an opposition to Perl culture. I contend that the anti-RE attitude sometimes seen here is caused by dissatisfaction with regexes in general combined with an aversion to the re module. I agree that it's not that bad, but it's clunky enough that it does contribute to making it my last resort. -- Neil Cerutti
[toc] | [prev] | [next] | [standalone]
Page 1 of 4 [1] 2 3 4 Next page →
Back to top | Article view | comp.lang.python
csiph-web