Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #54942 > unrolled thread
| Started by | Peter Otten <__peter__@web.de> |
|---|---|
| First post | 2013-09-28 16:52 +0200 |
| Last post | 2013-09-28 16:59 -0400 |
| Articles | 2 — 2 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: Weird bahaviour from shlex - line no Peter Otten <__peter__@web.de> - 2013-09-28 16:52 +0200
Re: Weird bahaviour from shlex - line no Piet van Oostrum <piet@vanoostrum.org> - 2013-09-28 16:59 -0400
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2013-09-28 16:52 +0200 |
| Subject | Re: Weird bahaviour from shlex - line no |
| Message-ID | <mailman.412.1380379896.18130.python-list@python.org> |
Dave Angel wrote:
> On 28/9/2013 02:26, Daniel Stojanov wrote:
>
>> Can somebody explain this. The line number reported by shlex depends
>> on the previous token. I want to be able to tell if I have just popped
>> the last token on a line.
>>
>
> I agree that it seems weird. However, I don't think you have made
> clear why it's not what you (and I) expect.
>
> import shlex
>
> def parseit(string):
> print
> print "Parsing -", string
> first = shlex.shlex(string)
> token = "dummy"
> while token:
> token = first.get_token()
> print token, " -- line", first.lineno
>
> parseit("word1 word2\nword3") #first
> parseit("word1 word2,\nword3") #second
> parseit("word1 word2,word3\nword4")
> parseit("word1 word2+,?\nword3")
>
> This will display the lineno attribute for every token.
>
> shlex is documented at:
>
> http://docs.python.org/2/library/shlex.html
>
> And lineno is documented on that page as:
>
> """shlex.lineno
> Source line number (count of newlines seen so far plus one).
> """
>
> It's not at all clear what "seen so far" is intended to mean, but in
> practice, the line number is incremented for the last token on the
> line. Thus your first example
>
> Parsing - word1 word2
> word3
> word1 -- line 1
> word2 -- line 2
> word3 -- line 2
> -- line 2
>
> word2 has the incremented line number.
>
> But when the token is neither whitespace nor ASCII letters, then it
> doesn't increment lineno. Thus second example:
>
> Parsing - word1 word2,
> word3
> word1 -- line 1
> word2 -- line 1
> , -- line 1 #we would expect this to be "line 2"
> word3 -- line 2 -- line 2
>
> Anybody else have some explanation
The explanation seems obvious: a word may be continued by the next character
if that is in wordchars, so the parser has to look at that character. If it
happens to be '\n' the lineno is immediately incremented. Non-wordchars are
returned as single characters, so there is no need to peek ahead and the
lineno is not altered.
In short: this looks like an implementation accident.
OP: I don't see a usecase for the current behaviour -- I suggest that you
file a bug report.
> or advice for Daniel, other than
> preprocessing the string by stripping any non letters off the end of the
> line?
The following gives the tokens' starting line for your examples
def shlexiter(s):
p = shlex.shlex(s)
p.whitespace = p.whitespace.replace("\n", "")
while True:
lineno = p.lineno
token = p.get_token()
if not token:
break
if token == "\n":
continue
yield lineno, token
def parseit(string):
print("Parsing - {!r}".format(string))
for lineno, token in shlexiter(string):
print("{:3} {!r}".format(lineno, token))
print("")
but I have no idea about the implications for more complex input.
[toc] | [next] | [standalone]
| From | Piet van Oostrum <piet@vanoostrum.org> |
|---|---|
| Date | 2013-09-28 16:59 -0400 |
| Message-ID | <m2eh887jch.fsf@cochabamba.vanoostrum.org> |
| In reply to | #54942 |
Peter Otten <__peter__@web.de> writes: > Dave Angel wrote: > >> On 28/9/2013 02:26, Daniel Stojanov wrote: >> >>> Can somebody explain this. The line number reported by shlex depends >>> on the previous token. I want to be able to tell if I have just popped >>> the last token on a line. [...] > The explanation seems obvious: a word may be continued by the next character > if that is in wordchars, so the parser has to look at that character. If it > happens to be '\n' the lineno is immediately incremented. Non-wordchars are > returned as single characters, so there is no need to peek ahead and the > lineno is not altered. > > In short: this looks like an implementation accident. I think shlex should be changed to give the line number of the start of the token in self.lineno. It isn't hard. -- Piet van Oostrum <piet@vanoostrum.org> WWW: http://pietvanoostrum.com/ PGP key: [8DAE142BE17999C4]
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web