Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #19291 > unrolled thread
| Started by | "M.Pekala" <mcdpekala@gmail.com> |
|---|---|
| First post | 2012-01-23 13:48 -0800 |
| Last post | 2012-01-24 11:15 +0100 |
| Articles | 11 — 7 participants |
Back to article view | Back to comp.lang.python
Parsing a serial stream too slowly "M.Pekala" <mcdpekala@gmail.com> - 2012-01-23 13:48 -0800
Re: Parsing a serial stream too slowly Jon Clements <joncle@googlemail.com> - 2012-01-23 14:00 -0800
Re: Parsing a serial stream too slowly "M.Pekala" <mcdpekala@gmail.com> - 2012-01-23 14:03 -0800
Re: Parsing a serial stream too slowly Nick Dokos <nicholas.dokos@hp.com> - 2012-01-23 18:28 -0500
Re: Parsing a serial stream too slowly Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de> - 2012-01-24 00:13 +0100
Re: Parsing a serial stream too slowly Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de> - 2012-01-24 14:04 +0100
Re: Parsing a serial stream too slowly Cameron Simpson <cs@zip.com.au> - 2012-01-24 10:49 +1100
Re: Parsing a serial stream too slowly "M.Pekala" <mcdpekala@gmail.com> - 2012-01-23 17:07 -0800
Re: Parsing a serial stream too slowly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-01-24 05:08 +0000
Re: Parsing a serial stream too slowly Cameron Simpson <cs@zip.com.au> - 2012-01-24 19:23 +1100
Re: Parsing a serial stream too slowly Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2012-01-24 11:15 +0100
| From | "M.Pekala" <mcdpekala@gmail.com> |
|---|---|
| Date | 2012-01-23 13:48 -0800 |
| Subject | Parsing a serial stream too slowly |
| Message-ID | <d250b98e-6409-474b-ba56-146e5b490e01@l1g2000vbc.googlegroups.com> |
Hello, I am having some trouble with a serial stream on a project I am
working on. I have an external board that is attached to a set of
sensors. The board polls the sensors, filters them, formats the
values, and sends the formatted values over a serial bus. The serial
stream comes out like $A1234$$B-10$$C987$, where "$A.*$" is a sensor
value, "$B.*$" is a sensor value, "$C.*$" is a sensor value, ect...
When one sensor is running my python script grabs the data just fine,
removes the formatting, and throws it into a text control box. However
when 3 or more sensors are running, I get output like the following:
Sensor 1: 373
Sensor 2: 112$$M-160$G373
Sensor 3: 763$$A892$
I am fairly certain this means that my code is running too slow to
catch all the '$' markers. Below is the snippet of code I believe is
the cause of this problem...
def OnSerialRead(self, event):
text = event.data
self.sensorabuffer = self.sensorabuffer + text
self.sensorbbuffer = self.sensorbbuffer + text
self.sensorcbuffer = self.sensorcbuffer + text
if sensoraenable:
sensorresult = re.search(r'\$A.*\$.*', self.sensorabuffer )
if sensorresult:
s = sensorresult.group(0)
s = s[2:-1]
if self.sensor_enable_chkbox.GetValue():
self.SensorAValue = s
self.sensorabuffer = ''
if sensorbenable:
sensorresult = re.search(r'\$A.*\$.*', self.sensorbenable)
if sensorresult:
s = sensorresult.group(0)
s = s[2:-1]
if self.sensor_enable_chkbox.GetValue():
self.SensorBValue = s
self.sensorbenable= ''
if sensorcenable:
sensorresult = re.search(r'\$A.*\$.*', self.sensorcenable)
if sensorresult:
s = sensorresult.group(0)
s = s[2:-1]
if self.sensor_enable_chkbox.GetValue():
self.SensorCValue = s
self.sensorcenable= ''
self.DisplaySensorReadings()
I think that regex is too slow for this operation, but I'm uncertain
of another method in python that could be faster. A little help would
be appreciated.
[toc] | [next] | [standalone]
| From | Jon Clements <joncle@googlemail.com> |
|---|---|
| Date | 2012-01-23 14:00 -0800 |
| Message-ID | <760df574-f13d-46e5-93e7-6cc51de5510a@s18g2000vby.googlegroups.com> |
| In reply to | #19291 |
On Jan 23, 9:48 pm, "M.Pekala" <mcdpek...@gmail.com> wrote: > Hello, I am having some trouble with a serial stream on a project I am > working on. I have an external board that is attached to a set of > sensors. The board polls the sensors, filters them, formats the > values, and sends the formatted values over a serial bus. The serial > stream comes out like $A1234$$B-10$$C987$, where "$A.*$" is a sensor > value, "$B.*$" is a sensor value, "$C.*$" is a sensor value, ect... > > When one sensor is running my python script grabs the data just fine, > removes the formatting, and throws it into a text control box. However > when 3 or more sensors are running, I get output like the following: > > Sensor 1: 373 > Sensor 2: 112$$M-160$G373 > Sensor 3: 763$$A892$ > > I am fairly certain this means that my code is running too slow to > catch all the '$' markers. Below is the snippet of code I believe is > the cause of this problem... > > def OnSerialRead(self, event): > text = event.data > self.sensorabuffer = self.sensorabuffer + text > self.sensorbbuffer = self.sensorbbuffer + text > self.sensorcbuffer = self.sensorcbuffer + text > > if sensoraenable: > sensorresult = re.search(r'\$A.*\$.*', self.sensorabuffer ) > if sensorresult: > s = sensorresult.group(0) > s = s[2:-1] > if self.sensor_enable_chkbox.GetValue(): > self.SensorAValue = s > self.sensorabuffer = '' > > if sensorbenable: > sensorresult = re.search(r'\$A.*\$.*', self.sensorbenable) > if sensorresult: > s = sensorresult.group(0) > s = s[2:-1] > if self.sensor_enable_chkbox.GetValue(): > self.SensorBValue = s > self.sensorbenable= '' > > if sensorcenable: > sensorresult = re.search(r'\$A.*\$.*', self.sensorcenable) > if sensorresult: > s = sensorresult.group(0) > s = s[2:-1] > if self.sensor_enable_chkbox.GetValue(): > self.SensorCValue = s > self.sensorcenable= '' > > self.DisplaySensorReadings() > > I think that regex is too slow for this operation, but I'm uncertain > of another method in python that could be faster. A little help would > be appreciated. You sure that's your code? Your re.search()'s are all the same.
[toc] | [prev] | [next] | [standalone]
| From | "M.Pekala" <mcdpekala@gmail.com> |
|---|---|
| Date | 2012-01-23 14:03 -0800 |
| Message-ID | <b16c3dd9-e7c3-49ba-906c-c1c6999252b0@t7g2000vbg.googlegroups.com> |
| In reply to | #19292 |
On Jan 23, 5:00 pm, Jon Clements <jon...@googlemail.com> wrote: > On Jan 23, 9:48 pm, "M.Pekala" <mcdpek...@gmail.com> wrote: > > > > > > > > > > > Hello, I am having some trouble with a serial stream on a project I am > > working on. I have an external board that is attached to a set of > > sensors. The board polls the sensors, filters them, formats the > > values, and sends the formatted values over a serial bus. The serial > > stream comes out like $A1234$$B-10$$C987$, where "$A.*$" is a sensor > > value, "$B.*$" is a sensor value, "$C.*$" is a sensor value, ect... > > > When one sensor is running my python script grabs the data just fine, > > removes the formatting, and throws it into a text control box. However > > when 3 or more sensors are running, I get output like the following: > > > Sensor 1: 373 > > Sensor 2: 112$$M-160$G373 > > Sensor 3: 763$$A892$ > > > I am fairly certain this means that my code is running too slow to > > catch all the '$' markers. Below is the snippet of code I believe is > > the cause of this problem... > > > def OnSerialRead(self, event): > > text = event.data > > self.sensorabuffer = self.sensorabuffer + text > > self.sensorbbuffer = self.sensorbbuffer + text > > self.sensorcbuffer = self.sensorcbuffer + text > > > if sensoraenable: > > sensorresult = re.search(r'\$A.*\$.*', self.sensorabuffer ) > > if sensorresult: > > s = sensorresult.group(0) > > s = s[2:-1] > > if self.sensor_enable_chkbox.GetValue(): > > self.SensorAValue = s > > self.sensorabuffer = '' > > > if sensorbenable: > > sensorresult = re.search(r'\$A.*\$.*', self.sensorbenable) > > if sensorresult: > > s = sensorresult.group(0) > > s = s[2:-1] > > if self.sensor_enable_chkbox.GetValue(): > > self.SensorBValue = s > > self.sensorbenable= '' > > > if sensorcenable: > > sensorresult = re.search(r'\$A.*\$.*', self.sensorcenable) > > if sensorresult: > > s = sensorresult.group(0) > > s = s[2:-1] > > if self.sensor_enable_chkbox.GetValue(): > > self.SensorCValue = s > > self.sensorcenable= '' > > > self.DisplaySensorReadings() > > > I think that regex is too slow for this operation, but I'm uncertain > > of another method in python that could be faster. A little help would > > be appreciated. > > You sure that's your code? Your re.search()'s are all the same. Whoops you are right. the search for the second should be re.search(r'\ $B.*\$.*', self.sensorbbuffer ), for the third re.search(r'\$C.*\$.*', self.sensorcbuffer )
[toc] | [prev] | [next] | [standalone]
| From | Nick Dokos <nicholas.dokos@hp.com> |
|---|---|
| Date | 2012-01-23 18:28 -0500 |
| Message-ID | <mailman.4998.1327361819.27778.python-list@python.org> |
| In reply to | #19293 |
M.Pekala <mcdpekala@gmail.com> wrote: > On Jan 23, 5:00 pm, Jon Clements <jon...@googlemail.com> wrote: > > On Jan 23, 9:48 pm, "M.Pekala" <mcdpek...@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > > Hello, I am having some trouble with a serial stream on a project I am > > > working on. I have an external board that is attached to a set of > > > sensors. The board polls the sensors, filters them, formats the > > > values, and sends the formatted values over a serial bus. The serial > > > stream comes out like $A1234$$B-10$$C987$, where "$A.*$" is a sensor > > > value, "$B.*$" is a sensor value, "$C.*$" is a sensor value, ect... > > > > > When one sensor is running my python script grabs the data just fine, > > > removes the formatting, and throws it into a text control box. However > > > when 3 or more sensors are running, I get output like the following: > > > > > Sensor 1: 373 > > > Sensor 2: 112$$M-160$G373 > > > Sensor 3: 763$$A892$ > > > > > I am fairly certain this means that my code is running too slow to > > > catch all the '$' markers. Below is the snippet of code I believe is > > > the cause of this problem... > > > > > def OnSerialRead(self, event): > > > text = event.data > > > self.sensorabuffer = self.sensorabuffer + text > > > self.sensorbbuffer = self.sensorbbuffer + text > > > self.sensorcbuffer = self.sensorcbuffer + text > > > > > if sensoraenable: > > > sensorresult = re.search(r'\$A.*\$.*', self.sensorabuffer ) > > > if sensorresult: > > > s = sensorresult.group(0) > > > s = s[2:-1] > > > if self.sensor_enable_chkbox.GetValue(): > > > self.SensorAValue = s > > > self.sensorabuffer = '' > > > > > if sensorbenable: > > > sensorresult = re.search(r'\$A.*\$.*', self.sensorbenable) > > > if sensorresult: > > > s = sensorresult.group(0) > > > s = s[2:-1] > > > if self.sensor_enable_chkbox.GetValue(): > > > self.SensorBValue = s > > > self.sensorbenable= '' > > > > > if sensorcenable: > > > sensorresult = re.search(r'\$A.*\$.*', self.sensorcenable) > > > if sensorresult: > > > s = sensorresult.group(0) > > > s = s[2:-1] > > > if self.sensor_enable_chkbox.GetValue(): > > > self.SensorCValue = s > > > self.sensorcenable= '' > > > > > self.DisplaySensorReadings() > > > > > I think that regex is too slow for this operation, but I'm uncertain > > > of another method in python that could be faster. A little help would > > > be appreciated. > > > > You sure that's your code? Your re.search()'s are all the same. > > Whoops you are right. the search for the second should be re.search(r'\ > $B.*\$.*', self.sensorbbuffer ), for the third re.search(r'\$C.*\$.*', > self.sensorcbuffer ) > The regex is probably still wrong: r'\$A.*\$.*' will e.g. match all of your initial example "$A1234$$B-10$$C987$", so s will lose the initial and final '$' and end up as "1234$$B-10$$C987" - I doubt that's what you want: >>> sensor_result = "$A123$$B456$$C789$$A456$" >>> r = re.search(r'\$A.*\$.*', sensor_result) >>> s = r.group(0) >>> s = s[2:-1] >>> s '123$$B456$$C789$$A456' Is this perhaps closer to what you want? >>> r = re.search(r'\$A[^$]+\$', sensor_result) >>> r.group(0) '$A123$' >>> I'm sure there are more problems too - e.g. why are there three buffers? If they all start empty and get modified the same way, they will all contain the same string - are they modified differently in the part of the program you have not shown? They will presumably need to be trimmed appropriately to indicate which part has been consumed already. And, as somebody pointed out already, the searches should probably be against the *buffer* variables rather than the *enable* variables. Nick
[toc] | [prev] | [next] | [standalone]
| From | Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de> |
|---|---|
| Date | 2012-01-24 00:13 +0100 |
| Message-ID | <jfkpi7$ibv$1@r03.glglgl.gl> |
| In reply to | #19291 |
Am 23.01.2012 22:48 schrieb M.Pekala:
> Hello, I am having some trouble with a serial stream on a project I am
> working on. I have an external board that is attached to a set of
> sensors. The board polls the sensors, filters them, formats the
> values, and sends the formatted values over a serial bus. The serial
> stream comes out like $A1234$$B-10$$C987$, where "$A.*$" is a sensor
> value, "$B.*$" is a sensor value, "$C.*$" is a sensor value, ect...
>
> When one sensor is running my python script grabs the data just fine,
> removes the formatting, and throws it into a text control box. However
> when 3 or more sensors are running, I get output like the following:
>
> Sensor 1: 373
> Sensor 2: 112$$M-160$G373
> Sensor 3: 763$$A892$
>
> I am fairly certain this means that my code is running too slow to
> catch all the '$' markers.
This would just result in the receive buffer constantly growing.
Probably the thing with the RE which has been mentionned by Jon is the
cause.
But I have some remarks to your code.
First, you have code repetition. You could use functions to avoid this.
Second, you have discrepancies between your 3 blocks: with A, you work
with sensorabuffer, the others have sensor[bc]enable.
Third, if you have a buffer content of '$A1234$$B-10$$C987$', your "A
code" will match the whole buffer and thus do
# s = sensorresult.group(0) ->
s = '$A1234$$B-10$$C987$'
# s = s[2:-1]
s = '1234$$B-10$$C987'
# maybe put that into self.SensorAValue
self.sensorabuffer = ''
I suggest the following way to go:
* Process your data only once.
* Do something like
[...]
theonebuffer = '$A1234$$B-10$$C987$' # for now
while True:
sensorresult = re.search(r'\$(.)(.*?)\$(.*)', theonebuffer)
if sensorresult:
sensor, value, rest = sensorresult.groups()
# replace the self.SensorAValue concept with a dict
self.sensorvalues[sensor] = value
theonebuffer = rest
else: break # out of the while
If you execute this code, you'll end with a self.sensorvalues of
{'A': '1234', 'C': '987', 'B': '-10'}
and a theonebuffer of ''.
Let's make another test with an incomplete sensor value.
theonebuffer = '$A1234$$B-10$$C987$$D65'
[code above]
-> the dict is the same, but theonebuffer='$D65'.
* Why did I do this? Well, you are constantly receiving data. I do this
with the hope that the $ terminating the D value is yet to come.
* Why does this happen? The regex does not match this incomplete packet,
the while loop terminates (resp. breaks) and the buffer will contain the
last known value.
But you might be right - speed might become a concern if you are
processing your data slower than they come along. Then your buffer fills
up and eventually kills your program due to full memory. As the buffer
fills up, the string copying becomes slower and slower, making things
worse. Whether this becomes relevant, you'll have to test.
BTW, as you use this one regex quite often, a way to speed up could be
to compile the regex. This will change your code to
sensorre = re.compile(r'\$(.)(.*?)\$(.*)')
theonebuffer = '$A1234$$B-10$$C987$' # for now
while True:
sensorresult = sensorre.search(theonebuffer)
if sensorresult:
sensor, value, rest = sensorresult.groups()
# replace the self.SensorAValue concept with a dict
self.sensorvalues[sensor] = value
theonebuffer = rest
else: break # out of the while
And finally, you can make use of re.finditer() resp.
sensorre.finditer(). So you can do
sensorre = re.compile(r'\$(.)(.*?)\$') # note the change
theonebuffer = '$A1234$$B-10$$C987$' # for now
sensorresult = None # init it for later
for sensorresult in sensorre.finditer(theonebuffer):
sensor, value = sensorresult.groups()
# replace the self.SensorAValue concept with a dict
self.sensorvalues[sensor] = value
# and now, keep the rest
if sensorresult is not None:
# the for loop has done something - cut out the old stuff
# and keep a possible incomplete packet at the end
theonebuffer = theonebuffer[sensorresult.end():]
This removes the mentionned string copying as source of increased slowness.
HTH,
Thomas
[toc] | [prev] | [next] | [standalone]
| From | Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de> |
|---|---|
| Date | 2012-01-24 14:04 +0100 |
| Message-ID | <jfma8g$jib$1@r03.glglgl.gl> |
| In reply to | #19298 |
Am 24.01.2012 00:13 schrieb Thomas Rachel:
[sorry, my Thunderbird kills the indentation]
> And finally, you can make use of re.finditer() resp.
> sensorre.finditer(). So you can do
>
> sensorre = re.compile(r'\$(.)(.*?)\$') # note the change
> theonebuffer = '$A1234$$B-10$$C987$' # for now
>
> sensorresult = None # init it for later
> for sensorresult in sensorre.finditer(theonebuffer):
> sensor, value = sensorresult.groups()
> # replace the self.SensorAValue concept with a dict
> self.sensorvalues[sensor] = value
> # and now, keep the rest
> if sensorresult is not None:
> # the for loop has done something - cut out the old stuff
> # and keep a possible incomplete packet at the end
> theonebuffer = theonebuffer[sensorresult.end():]
>
> This removes the mentionned string copying as source of increased slowness.
But it has one major flaw: If you lose synchronization, it may happen
that only the data *between* the packets is returned - which are mostly
empty strings.
So it would be wise to either change the firmware of the device to use
different characters for starting end ending a packet, or to return
every data between "$"s and discarding the empty strings.
As regexes might be overkill here, we could do
def splitup(theonebuffer):
l = theonebuffer.split("$")
for i in l[:-1]: yield i + "$"
if l: yield l[-1]
sensorvalues = {}
theonebuffer = '1garbage$A1234$$B-10$2garbage$C987$D3' # for now
for part in splitup(theonebuffer):
if not part.endswith("$"):
theonebuffer = part
break # it is the last one which is probably not a full packet
part = part[:-1] # cut off the $
if not part: continue # $$ -> gap between packets
# now part is the contents of one packet which may be valid or not.
# TODO: Do some tests - maybe for valid sensor names and values.
sensor = part[0]
value = part[1:]
sensorvalues[sensor] = value # add the "self." in your case -
# for now, it is better without
Now I get sensorvalues, theonebuffer as ({'1': 'garbage', 'A': '1234',
'2': 'garbage', 'B': '-10', 'C': '987'}, 'D3').
D3 is not (yet) a value; it might come out as D3, D342 or whatever, as
the packet is not complete yet.
Thomas
[toc] | [prev] | [next] | [standalone]
| From | Cameron Simpson <cs@zip.com.au> |
|---|---|
| Date | 2012-01-24 10:49 +1100 |
| Message-ID | <mailman.4999.1327363168.27778.python-list@python.org> |
| In reply to | #19291 |
On 23Jan2012 13:48, M.Pekala <mcdpekala@gmail.com> wrote:
| Hello, I am having some trouble with a serial stream on a project I am
| working on. I have an external board that is attached to a set of
| sensors. The board polls the sensors, filters them, formats the
| values, and sends the formatted values over a serial bus. The serial
| stream comes out like $A1234$$B-10$$C987$, where "$A.*$" is a sensor
| value, "$B.*$" is a sensor value, "$C.*$" is a sensor value, ect...
|
| When one sensor is running my python script grabs the data just fine,
| removes the formatting, and throws it into a text control box. However
| when 3 or more sensors are running, I get output like the following:
|
| Sensor 1: 373
| Sensor 2: 112$$M-160$G373
| Sensor 3: 763$$A892$
|
| I am fairly certain this means that my code is running too slow to
| catch all the '$' markers. Below is the snippet of code I believe is
| the cause of this problem...
Your code _is_ slow, but as you can see above you're not missing data,
you're gathering too much data.
Some point by point remarks below. The actual _bug_ is your use of ".*"
in your regexps. Some change suggestions below the code.
| def OnSerialRead(self, event):
| text = event.data
| self.sensorabuffer = self.sensorabuffer + text
| self.sensorbbuffer = self.sensorbbuffer + text
| self.sensorcbuffer = self.sensorcbuffer + text
Slow and memory wasteful. Supposing a sensor never reports? You will
accumulate an ever growing buffer string. And extending a string gets
expensive as it grows.
| if sensoraenable:
| sensorresult = re.search(r'\$A.*\$.*', self.sensorabuffer )
Slow and buggy.
The slow: You're compiling the regular expression _every_ time you come
here (unless the re module caches things, which I seem to recall it may.
But that efficiency is only luck.
The bug: supposing you get multiple sensor reports, like this:
$A1$$B2$$C3$
Your regexp matches the whole thing! Because ".*" is greedy.
You want "[^$]*" - characters that are not a "$".
| if sensorresult:
| s = sensorresult.group(0)
| s = s[2:-1]
| if self.sensor_enable_chkbox.GetValue():
| self.SensorAValue = s
| self.sensorabuffer = ''
What if there are multiple values in the buffer? After fixing your
regexp you will now be throwing them away. Better to go:
self.sensorabuffer = self.sensorabuffer[sensorresult.end():]
[...]
| I think that regex is too slow for this operation, but I'm uncertain
| of another method in python that could be faster. A little help would
| be appreciated.
Regex _is_ slow. It is good for flexible lexing, but generally Not
Fast. It can be faster than in-Python lexing because the inner
interpreation of the regex is C code, but is often overkill when speed
matters. (Which you may find it does not for your app - fix the bugs
first and see how it behaves).
I would be making the following changes if it were me:
- keep only one buffer, and parse it into sensor "tokens"
pass each token to the right sensor as needed
- don't use regexps
this is a speed thing; if you code is more readable with regexps and
still faster enough you may not do this
To these ends, untested attempt 1 (one buffer, lex into tokens, still
using regexps):
re_token = re.compile( r'\$([A-Z])([^$]*)\$' )
def OnSerialRead(self, event):
# accessing a local var is quicker and more readable
buffer = self.buffer
text = event.data
buffer += text
m = re_token.search(buffer)
while m:
sensor, value = m.group(1), m.group(2)
buffer = buffer[m.end():]
if sensor == 'A':
# ...
elif sensor == 'B':
# ...
else:
warning("unsupported sensor: %s", sensor)
# stash the updated buffer for later
self.buffer = buffer
I'm assuming here that you can get noise in the serial stream. If you
are certain to get only clean "$Ax$" sequences and nothing else you can
make the code much simpler. And faster again.
Pretending clean data and no regexps:
def OnSerialRead(self, event):
# accessing a local var is quicker and more readable
buffer = self.buffer
text = event.data
buffer += text
while buffer:
if not buffer.startswith('$'):
raise ValueError("bad data in buffer! code rewrite needed!")
mark2 = buffer.find('$', 1)
if mark2 < 0:
# end of token not present
# look again later
break
token = buffer[1:mark2]
buffer = buffer[mark2+1:]
if not token:
raise ValueError("no data in packet!")
sensorm value = token[1], token[1:]
... hand off to sensors as above ...
Cheers,
--
Cameron Simpson <cs@zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/
If your new theorem can be stated with great simplicity, then there
will exist a pathological exception. - Adrian Mathesis
[toc] | [prev] | [next] | [standalone]
| From | "M.Pekala" <mcdpekala@gmail.com> |
|---|---|
| Date | 2012-01-23 17:07 -0800 |
| Message-ID | <9c55c4c7-c218-4ecb-8c9e-3349cb53268a@t7g2000vbg.googlegroups.com> |
| In reply to | #19301 |
On Jan 23, 6:49 pm, Cameron Simpson <c...@zip.com.au> wrote:
> On 23Jan2012 13:48, M.Pekala <mcdpek...@gmail.com> wrote:
> | Hello, I am having some trouble with a serial stream on a project I am
> | working on. I have an external board that is attached to a set of
> | sensors. The board polls the sensors, filters them, formats the
> | values, and sends the formatted values over a serial bus. The serial
> | stream comes out like $A1234$$B-10$$C987$, where "$A.*$" is a sensor
> | value, "$B.*$" is a sensor value, "$C.*$" is a sensor value, ect...
> |
> | When one sensor is running my python script grabs the data just fine,
> | removes the formatting, and throws it into a text control box. However
> | when 3 or more sensors are running, I get output like the following:
> |
> | Sensor 1: 373
> | Sensor 2: 112$$M-160$G373
> | Sensor 3: 763$$A892$
> |
> | I am fairly certain this means that my code is running too slow to
> | catch all the '$' markers. Below is the snippet of code I believe is
> | the cause of this problem...
>
> Your code _is_ slow, but as you can see above you're not missing data,
> you're gathering too much data.
>
> Some point by point remarks below. The actual _bug_ is your use of ".*"
> in your regexps. Some change suggestions below the code.
>
> | def OnSerialRead(self, event):
> | text = event.data
> | self.sensorabuffer = self.sensorabuffer + text
> | self.sensorbbuffer = self.sensorbbuffer + text
> | self.sensorcbuffer = self.sensorcbuffer + text
>
> Slow and memory wasteful. Supposing a sensor never reports? You will
> accumulate an ever growing buffer string. And extending a string gets
> expensive as it grows.
>
> | if sensoraenable:
> | sensorresult = re.search(r'\$A.*\$.*', self.sensorabuffer )
>
> Slow and buggy.
>
> The slow: You're compiling the regular expression _every_ time you come
> here (unless the re module caches things, which I seem to recall it may.
> But that efficiency is only luck.
>
> The bug: supposing you get multiple sensor reports, like this:
>
> $A1$$B2$$C3$
>
> Your regexp matches the whole thing! Because ".*" is greedy.
> You want "[^$]*" - characters that are not a "$".
>
> | if sensorresult:
> | s = sensorresult.group(0)
> | s = s[2:-1]
> | if self.sensor_enable_chkbox.GetValue():
> | self.SensorAValue = s
> | self.sensorabuffer = ''
>
> What if there are multiple values in the buffer? After fixing your
> regexp you will now be throwing them away. Better to go:
>
> self.sensorabuffer = self.sensorabuffer[sensorresult.end():]
>
> [...]
> | I think that regex is too slow for this operation, but I'm uncertain
> | of another method in python that could be faster. A little help would
> | be appreciated.
>
> Regex _is_ slow. It is good for flexible lexing, but generally Not
> Fast. It can be faster than in-Python lexing because the inner
> interpreation of the regex is C code, but is often overkill when speed
> matters. (Which you may find it does not for your app - fix the bugs
> first and see how it behaves).
>
> I would be making the following changes if it were me:
>
> - keep only one buffer, and parse it into sensor "tokens"
> pass each token to the right sensor as needed
>
> - don't use regexps
> this is a speed thing; if you code is more readable with regexps and
> still faster enough you may not do this
>
> To these ends, untested attempt 1 (one buffer, lex into tokens, still
> using regexps):
>
> re_token = re.compile( r'\$([A-Z])([^$]*)\$' )
>
> def OnSerialRead(self, event):
> # accessing a local var is quicker and more readable
> buffer = self.buffer
>
> text = event.data
> buffer += text
>
> m = re_token.search(buffer)
> while m:
> sensor, value = m.group(1), m.group(2)
> buffer = buffer[m.end():]
> if sensor == 'A':
> # ...
> elif sensor == 'B':
> # ...
> else:
> warning("unsupported sensor: %s", sensor)
>
> # stash the updated buffer for later
> self.buffer = buffer
>
> I'm assuming here that you can get noise in the serial stream. If you
> are certain to get only clean "$Ax$" sequences and nothing else you can
> make the code much simpler. And faster again.
>
> Pretending clean data and no regexps:
>
> def OnSerialRead(self, event):
> # accessing a local var is quicker and more readable
> buffer = self.buffer
>
> text = event.data
> buffer += text
>
> while buffer:
> if not buffer.startswith('$'):
> raise ValueError("bad data in buffer! code rewrite needed!")
> mark2 = buffer.find('$', 1)
> if mark2 < 0:
> # end of token not present
> # look again later
> break
> token = buffer[1:mark2]
> buffer = buffer[mark2+1:]
>
> if not token:
> raise ValueError("no data in packet!")
> sensorm value = token[1], token[1:]
>
> ... hand off to sensors as above ...
>
> Cheers,
> --
> Cameron Simpson <c...@zip.com.au> DoD#743http://www.cskk.ezoshosting.com/cs/
>
> If your new theorem can be stated with great simplicity, then there
> will exist a pathological exception. - Adrian Mathesis
Okay I altered my code a bit based on all of your suggestions:
self.sensor_re = re.compile(r'\$([A-E])([^$]*)\$')
def OnSerialRead(self, event):
text = event.data
buffer = self.sensorbuffer + text
if self.SensorLogging:
result = self.sensor_re.search(buffer)
if result:
sensor,value = result.groups()
buffer = buffer[result.end():]
if sensor == 'A' and self.sensora_chkbox.GetValue():
self.SensorAValue = value
elif sensor == 'B' and self.sensorb_chkbox.GetValue():
self.SensorBValue = value
elif sensor == 'C' and self.sensorc_chkbox.GetValue():
self.SensorCValue = value
elif sensor == D' and self.sensord_chkbox.GetValue():
self.SensorDValue = value
elif sensor == 'E' and self.sensore_chkbox.GetValue():
self.SensorEValue = value
self.LogSensor()
self.sensorbuffer = buffer
...and it works far better, in so much as the problems I was having
before seem to be gone. I'm probably going to change this to something
without re, though for the time being this will work for me. The
reason there was a whole lot of junk, such as the three buffers,
around this code was because I am somewhat of a novice python
programmer and I was just doing random stuff to see if it would stick.
The reason I was using regex in the first place was that I thought it
would be faster than the python code due its c origins, but apparently
that may not be the case. Regardless thank you all for your help.
Cheers,
Miles Pekala
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-01-24 05:08 +0000 |
| Message-ID | <4f1e3cb5$0$11127$c3e8da3@news.astraweb.com> |
| In reply to | #19301 |
On Tue, 24 Jan 2012 10:49:41 +1100, Cameron Simpson wrote:
> | def OnSerialRead(self, event):
> | text = event.data
> | self.sensorabuffer = self.sensorabuffer + text
> | self.sensorbbuffer = self.sensorbbuffer + text
> | self.sensorcbuffer = self.sensorcbuffer + text
>
> Slow and memory wasteful. Supposing a sensor never reports? You will
> accumulate an ever growing buffer string. And extending a string gets
> expensive as it grows.
I admit I haven't read this entire thread, but one thing jumps out at me.
It looks like the code is accumulating strings by repeated +
concatenation. This is risky.
In general, you should accumulate strings into a list buffer, then join
them into a single string in one call:
buffer = []
while something:
buffer.append(text)
return ''.join(buffer)
Use of repeated string addition risks slow quadratic behaviour. The OP is
reporting slow behaviour... alarms bells ring.
For anyone who doesn't understand what I mean about slow quadratic
behaviour, read this:
http://www.joelonsoftware.com/articles/fog0000000319.html
Recent versions of CPython includes an optimization which *sometimes* can
avoid this poor performance, but it can be defeated easily, and does not
apply to Jython and IronPython, so it is best to not rely on it.
I don't know whether this is the cause of the OP's slow behaviour, but it
is worth investigating. Especially since it is likely to not just be
slow, but SLLLLLOOOOOOWWWWWWWWW -- a bad quadratic algorithm can be tens
of thousands or millions of times slower than it need be.
[...]
> The slow: You're compiling the regular expression _every_ time you come
> here (unless the re module caches things, which I seem to recall it may.
It does.
> But that efficiency is only luck.
More deliberate design than good luck :)
Nevertheless, good design would have you compile the regex once, and not
rely on the re module's cache.
[...]
> Regex _is_ slow. It is good for flexible lexing, but generally Not Fast.
I hope I will never be mistaken for a re fanboy, but credit where credit
is due: if you need the full power of a regex, you almost certainly can't
write anything in Python that will beat the re module.
However, where regexes become a trap is that often people use them for
things which are best coded as simple Python tests that are much faster,
such as using a regex where a simple str.startswith() would do.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Cameron Simpson <cs@zip.com.au> |
|---|---|
| Date | 2012-01-24 19:23 +1100 |
| Message-ID | <mailman.5017.1327393428.27778.python-list@python.org> |
| In reply to | #19314 |
On 24Jan2012 05:08, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: | On Tue, 24 Jan 2012 10:49:41 +1100, Cameron Simpson wrote: | | > | def OnSerialRead(self, event): | > | text = event.data | > | self.sensorabuffer = self.sensorabuffer + text | > | self.sensorbbuffer = self.sensorbbuffer + text | > | self.sensorcbuffer = self.sensorcbuffer + text | > | > Slow and memory wasteful. Supposing a sensor never reports? You will | > accumulate an ever growing buffer string. And extending a string gets | > expensive as it grows. | | I admit I haven't read this entire thread, but one thing jumps out at me. | It looks like the code is accumulating strings by repeated + | concatenation. This is risky. | | In general, you should accumulate strings into a list buffer, then join | them into a single string in one call: | | buffer = [] | while something: | buffer.append(text) | return ''.join(buffer) Yeah, but the OP needs to examine the string after every packet arrival, so he doesn't get to defer the joining of the strings. | Use of repeated string addition risks slow quadratic behaviour. The OP is | reporting slow behaviour... alarms bells ring. He's _inferring_ slow behaviour from the wrong results he was getting. In fact he doesn't have slow behaviour (in that python isn't slow enough in his case to cause trouble). [...snip references to more depth on string concatenation...] | > The slow: You're compiling the regular expression _every_ time you come | > here (unless the re module caches things, which I seem to recall it may. | | It does. | | > But that efficiency is only luck. | | More deliberate design than good luck :) The luck is in his program benefiting from it, not the decision of the re authors to cache the source->compiled mapping. | > Regex _is_ slow. It is good for flexible lexing, but generally Not Fast. | | I hope I will never be mistaken for a re fanboy, but credit where credit | is due: if you need the full power of a regex, you almost certainly can't | write anything in Python that will beat the re module. Indeed not. But many many uses of the re module are for trivial lexing stuff that is way simpler than the complexities involved in regexps themselves (parse regexp, compiler from parse, parse string from regexp compilation). And perhaps equally important, regexps are both cryptic and powerful, a combination that makes it very easy to write either highly expensive regexps or bighly buggy regexps - in his case a small regexp error was his bug, and programs speed had nothing to do with it. His program did have inefficiencies though, which were worth discussing. Cheers, -- Cameron Simpson <cs@zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ There are two ways of dealing with this problem: one is complicated and messy, and the other is simple and very elegant. We don't have much time left, so I'll just show you the complicated and messy way. - Richard Feynman, 1981
[toc] | [prev] | [next] | [standalone]
| From | Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> |
|---|---|
| Date | 2012-01-24 11:15 +0100 |
| Message-ID | <juc2v8-gi8.ln1@satorlaser.homedns.org> |
| In reply to | #19291 |
Am 23.01.2012 22:48, schrieb M.Pekala:
> I think that regex is too slow for this operation, but I'm uncertain
> of another method in python that could be faster. A little help would
> be appreciated.
Regardless of the outcome here, I would say that your code is still a
bit wonky on the handling of partial data and gaps due to buffer
overflows in the serial port buffers.
I'd start with something like this:
import unittest
class SensorReader(object):
def __init__(self):
pass
def parse(self, buffer):
"""store and parse given input"""
pass
def get_sample(self):
"""retrieve a single sample parsed from input data
This returns a tuple containing the sensor name and the
value. In case some input data was not parseable, it
returns a tuple with an empty string and the unrecognized
data.
"""
pass
class Test(unittest.TestCase):
def setUp(self):
self.handler = SensorReader()
def test_empty_buffer(self):
self.handler.parse(buffer='')
r = self.handler.get_sample()
self.assertEqual(r, None)
def test_input_1(self):
self.handler.parse(buffer='$A1234$')
r = self.handler.get_sample()
self.assertEqual(r, ('A', 1234))
def test_input_2(self):
self.handler.parse(buffer='$A1234$$B5678$')
r = self.handler.get_sample()
self.assertEqual(r, ('A', 1234))
r = self.handler.get_sample()
self.assertEqual(r, ('B', 5678))
def test_invalid_input(self):
self.handler.parse(buffer='234$$B5678$')
r = self.handler.get_sample()
self.assertEqual(r, ('', '234$'))
r = self.handler.get_sample()
self.assertEqual(r, ('B', 5678))
def test_partial_input(self):
self.handler.parse(buffer='$B56')
r = self.handler.get_sample()
self.assertEqual(r, None)
self.handler.parse(buffer='78$$C90')
r = self.handler.get_sample()
self.assertEqual(r, ('B', 5678))
Concerning the parsing itself, there is a "find" function in the str
class which you can use. Just search for dollar signs and take the data
in between. If it's empty, the first one is the end and the second one
the start of the next sample, so you can discard the content. Having
tests as above will make it a breeze for you to figure out actual
implementation. You could also start off with regexes and then switch to
linear scanning for speed.
Uli
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web