Groups > comp.lang.python > #19291 > unrolled thread

Parsing a serial stream too slowly

Started by	"M.Pekala" <mcdpekala@gmail.com>
First post	2012-01-23 13:48 -0800
Last post	2012-01-24 11:15 +0100
Articles	11 — 7 participants

Back to article view | Back to comp.lang.python

  Parsing a serial stream too slowly "M.Pekala" <mcdpekala@gmail.com> - 2012-01-23 13:48 -0800
    Re: Parsing a serial stream too slowly Jon Clements <joncle@googlemail.com> - 2012-01-23 14:00 -0800
      Re: Parsing a serial stream too slowly "M.Pekala" <mcdpekala@gmail.com> - 2012-01-23 14:03 -0800
        Re: Parsing a serial stream too slowly Nick Dokos <nicholas.dokos@hp.com> - 2012-01-23 18:28 -0500
    Re: Parsing a serial stream too slowly Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de> - 2012-01-24 00:13 +0100
      Re: Parsing a serial stream too slowly Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de> - 2012-01-24 14:04 +0100
    Re: Parsing a serial stream too slowly Cameron Simpson <cs@zip.com.au> - 2012-01-24 10:49 +1100
      Re: Parsing a serial stream too slowly "M.Pekala" <mcdpekala@gmail.com> - 2012-01-23 17:07 -0800
      Re: Parsing a serial stream too slowly Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-01-24 05:08 +0000
        Re: Parsing a serial stream too slowly Cameron Simpson <cs@zip.com.au> - 2012-01-24 19:23 +1100
    Re: Parsing a serial stream too slowly Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2012-01-24 11:15 +0100

#19291 — Parsing a serial stream too slowly

From	"M.Pekala" <mcdpekala@gmail.com>
Date	2012-01-23 13:48 -0800
Subject	Parsing a serial stream too slowly
Message-ID	<d250b98e-6409-474b-ba56-146e5b490e01@l1g2000vbc.googlegroups.com>

Hello, I am having some trouble with a serial stream on a project I am
working on. I have an external board that is attached to a set of
sensors. The board polls the sensors, filters them, formats the
values, and sends the formatted values over a serial bus. The serial
stream comes out like $A1234$$B-10$$C987$,  where "$A.*$" is a sensor
value, "$B.*$" is a sensor value, "$C.*$" is a sensor value, ect...

When one sensor is running my python script grabs the data just fine,
removes the formatting, and throws it into a text control box. However
when 3 or more sensors are running, I get output like the following:

Sensor 1: 373
Sensor 2: 112$$M-160$G373
Sensor 3: 763$$A892$

I am fairly certain this means that my code is running too slow to
catch all the '$' markers. Below is the snippet of code I believe is
the cause of this problem...

def OnSerialRead(self, event):
	text = event.data
	self.sensorabuffer = self.sensorabuffer + text
	self.sensorbbuffer = self.sensorbbuffer + text
	self.sensorcbuffer = self.sensorcbuffer + text

	if sensoraenable:
		sensorresult = re.search(r'\$A.*\$.*', self.sensorabuffer )
			if sensorresult:
				s = sensorresult.group(0)
				s = s[2:-1]
				if self.sensor_enable_chkbox.GetValue():
					self.SensorAValue = s
				self.sensorabuffer = ''

	if sensorbenable:
		sensorresult = re.search(r'\$A.*\$.*', self.sensorbenable)
			if sensorresult:
				s = sensorresult.group(0)
				s = s[2:-1]
				if self.sensor_enable_chkbox.GetValue():
					self.SensorBValue = s
				self.sensorbenable= ''

	if sensorcenable:
		sensorresult = re.search(r'\$A.*\$.*', self.sensorcenable)
			if sensorresult:
				s = sensorresult.group(0)
				s = s[2:-1]
				if self.sensor_enable_chkbox.GetValue():
					self.SensorCValue = s
				self.sensorcenable= ''

        self.DisplaySensorReadings()

I think that regex is too slow for this operation, but I'm uncertain
of another method in python that could be faster. A little help would
be appreciated.

[toc] | [next] | [standalone]

#19292

From	Jon Clements <joncle@googlemail.com>
Date	2012-01-23 14:00 -0800
Message-ID	<760df574-f13d-46e5-93e7-6cc51de5510a@s18g2000vby.googlegroups.com>
In reply to	#19291

On Jan 23, 9:48 pm, "M.Pekala" <mcdpek...@gmail.com> wrote:
> Hello, I am having some trouble with a serial stream on a project I am
> working on. I have an external board that is attached to a set of
> sensors. The board polls the sensors, filters them, formats the
> values, and sends the formatted values over a serial bus. The serial
> stream comes out like $A1234$$B-10$$C987$,  where "$A.*$" is a sensor
> value, "$B.*$" is a sensor value, "$C.*$" is a sensor value, ect...
>
> When one sensor is running my python script grabs the data just fine,
> removes the formatting, and throws it into a text control box. However
> when 3 or more sensors are running, I get output like the following:
>
> Sensor 1: 373
> Sensor 2: 112$$M-160$G373
> Sensor 3: 763$$A892$
>
> I am fairly certain this means that my code is running too slow to
> catch all the '$' markers. Below is the snippet of code I believe is
> the cause of this problem...
>
> def OnSerialRead(self, event):
>         text = event.data
>         self.sensorabuffer = self.sensorabuffer + text
>         self.sensorbbuffer = self.sensorbbuffer + text
>         self.sensorcbuffer = self.sensorcbuffer + text
>
>         if sensoraenable:
>                 sensorresult = re.search(r'\$A.*\$.*', self.sensorabuffer )
>                         if sensorresult:
>                                 s = sensorresult.group(0)
>                                 s = s[2:-1]
>                                 if self.sensor_enable_chkbox.GetValue():
>                                         self.SensorAValue = s
>                                 self.sensorabuffer = ''
>
>         if sensorbenable:
>                 sensorresult = re.search(r'\$A.*\$.*', self.sensorbenable)
>                         if sensorresult:
>                                 s = sensorresult.group(0)
>                                 s = s[2:-1]
>                                 if self.sensor_enable_chkbox.GetValue():
>                                         self.SensorBValue = s
>                                 self.sensorbenable= ''
>
>         if sensorcenable:
>                 sensorresult = re.search(r'\$A.*\$.*', self.sensorcenable)
>                         if sensorresult:
>                                 s = sensorresult.group(0)
>                                 s = s[2:-1]
>                                 if self.sensor_enable_chkbox.GetValue():
>                                         self.SensorCValue = s
>                                 self.sensorcenable= ''
>
>         self.DisplaySensorReadings()
>
> I think that regex is too slow for this operation, but I'm uncertain
> of another method in python that could be faster. A little help would
> be appreciated.

You sure that's your code? Your re.search()'s are all the same.

[toc] | [prev] | [next] | [standalone]

#19293

From	"M.Pekala" <mcdpekala@gmail.com>
Date	2012-01-23 14:03 -0800
Message-ID	<b16c3dd9-e7c3-49ba-906c-c1c6999252b0@t7g2000vbg.googlegroups.com>
In reply to	#19292

On Jan 23, 5:00 pm, Jon Clements <jon...@googlemail.com> wrote:
> On Jan 23, 9:48 pm, "M.Pekala" <mcdpek...@gmail.com> wrote:
>
>
>
>
>
>
>
>
>
> > Hello, I am having some trouble with a serial stream on a project I am
> > working on. I have an external board that is attached to a set of
> > sensors. The board polls the sensors, filters them, formats the
> > values, and sends the formatted values over a serial bus. The serial
> > stream comes out like $A1234$$B-10$$C987$,  where "$A.*$" is a sensor
> > value, "$B.*$" is a sensor value, "$C.*$" is a sensor value, ect...
>
> > When one sensor is running my python script grabs the data just fine,
> > removes the formatting, and throws it into a text control box. However
> > when 3 or more sensors are running, I get output like the following:
>
> > Sensor 1: 373
> > Sensor 2: 112$$M-160$G373
> > Sensor 3: 763$$A892$
>
> > I am fairly certain this means that my code is running too slow to
> > catch all the '$' markers. Below is the snippet of code I believe is
> > the cause of this problem...
>
> > def OnSerialRead(self, event):
> >         text = event.data
> >         self.sensorabuffer = self.sensorabuffer + text
> >         self.sensorbbuffer = self.sensorbbuffer + text
> >         self.sensorcbuffer = self.sensorcbuffer + text
>
> >         if sensoraenable:
> >                 sensorresult = re.search(r'\$A.*\$.*', self.sensorabuffer )
> >                         if sensorresult:
> >                                 s = sensorresult.group(0)
> >                                 s = s[2:-1]
> >                                 if self.sensor_enable_chkbox.GetValue():
> >                                         self.SensorAValue = s
> >                                 self.sensorabuffer = ''
>
> >         if sensorbenable:
> >                 sensorresult = re.search(r'\$A.*\$.*', self.sensorbenable)
> >                         if sensorresult:
> >                                 s = sensorresult.group(0)
> >                                 s = s[2:-1]
> >                                 if self.sensor_enable_chkbox.GetValue():
> >                                         self.SensorBValue = s
> >                                 self.sensorbenable= ''
>
> >         if sensorcenable:
> >                 sensorresult = re.search(r'\$A.*\$.*', self.sensorcenable)
> >                         if sensorresult:
> >                                 s = sensorresult.group(0)
> >                                 s = s[2:-1]
> >                                 if self.sensor_enable_chkbox.GetValue():
> >                                         self.SensorCValue = s
> >                                 self.sensorcenable= ''
>
> >         self.DisplaySensorReadings()
>
> > I think that regex is too slow for this operation, but I'm uncertain
> > of another method in python that could be faster. A little help would
> > be appreciated.
>
> You sure that's your code? Your re.search()'s are all the same.

Whoops you are right. the search for the second should be re.search(r'\
$B.*\$.*', self.sensorbbuffer ), for the third re.search(r'\$C.*\$.*',
self.sensorcbuffer )

[toc] | [prev] | [next] | [standalone]

#19300

From	Nick Dokos <nicholas.dokos@hp.com>
Date	2012-01-23 18:28 -0500
Message-ID	<mailman.4998.1327361819.27778.python-list@python.org>
In reply to	#19293

M.Pekala <mcdpekala@gmail.com> wrote:

> On Jan 23, 5:00 pm, Jon Clements <jon...@googlemail.com> wrote:
> > On Jan 23, 9:48 pm, "M.Pekala" <mcdpek...@gmail.com> wrote:
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > > Hello, I am having some trouble with a serial stream on a project I am
> > > working on. I have an external board that is attached to a set of
> > > sensors. The board polls the sensors, filters them, formats the
> > > values, and sends the formatted values over a serial bus. The serial
> > > stream comes out like $A1234$$B-10$$C987$,  where "$A.*$" is a sensor
> > > value, "$B.*$" is a sensor value, "$C.*$" is a sensor value, ect...
> >
> > > When one sensor is running my python script grabs the data just fine,
> > > removes the formatting, and throws it into a text control box. However
> > > when 3 or more sensors are running, I get output like the following:
> >
> > > Sensor 1: 373
> > > Sensor 2: 112$$M-160$G373
> > > Sensor 3: 763$$A892$
> >
> > > I am fairly certain this means that my code is running too slow to
> > > catch all the '$' markers. Below is the snippet of code I believe is
> > > the cause of this problem...
> >
> > > def OnSerialRead(self, event):
> > >         text = event.data
> > >         self.sensorabuffer = self.sensorabuffer + text
> > >         self.sensorbbuffer = self.sensorbbuffer + text
> > >         self.sensorcbuffer = self.sensorcbuffer + text
> >
> > >         if sensoraenable:
> > >                 sensorresult = re.search(r'\$A.*\$.*', self.sensorabuffer )
> > >                         if sensorresult:
> > >                                 s = sensorresult.group(0)
> > >                                 s = s[2:-1]
> > >                                 if self.sensor_enable_chkbox.GetValue():
> > >                                         self.SensorAValue = s
> > >                                 self.sensorabuffer = ''
> >
> > >         if sensorbenable:
> > >                 sensorresult = re.search(r'\$A.*\$.*', self.sensorbenable)
> > >                         if sensorresult:
> > >                                 s = sensorresult.group(0)
> > >                                 s = s[2:-1]
> > >                                 if self.sensor_enable_chkbox.GetValue():
> > >                                         self.SensorBValue = s
> > >                                 self.sensorbenable= ''
> >
> > >         if sensorcenable:
> > >                 sensorresult = re.search(r'\$A.*\$.*', self.sensorcenable)
> > >                         if sensorresult:
> > >                                 s = sensorresult.group(0)
> > >                                 s = s[2:-1]
> > >                                 if self.sensor_enable_chkbox.GetValue():
> > >                                         self.SensorCValue = s
> > >                                 self.sensorcenable= ''
> >
> > >         self.DisplaySensorReadings()
> >
> > > I think that regex is too slow for this operation, but I'm uncertain
> > > of another method in python that could be faster. A little help would
> > > be appreciated.
> >
> > You sure that's your code? Your re.search()'s are all the same.
> 
> Whoops you are right. the search for the second should be re.search(r'\
> $B.*\$.*', self.sensorbbuffer ), for the third re.search(r'\$C.*\$.*',
> self.sensorcbuffer )
> 

The regex is probably still wrong: r'\$A.*\$.*' will e.g. match all of
your initial example "$A1234$$B-10$$C987$", so s will lose the initial
and final '$' and end up as "1234$$B-10$$C987" - I doubt that's what you
want:

>>> sensor_result = "$A123$$B456$$C789$$A456$"
>>> r = re.search(r'\$A.*\$.*', sensor_result)
>>> s = r.group(0)
>>> s = s[2:-1]
>>> s
'123$$B456$$C789$$A456'

Is this perhaps closer to what you want?

>>> r = re.search(r'\$A[^$]+\$', sensor_result)
>>> r.group(0)
'$A123$'
>>> 

I'm sure there are more problems too - e.g. why are there three buffers?
If they all start empty and get modified the same way, they will all
contain the same string - are they modified differently in the part of
the program you have not shown? They will presumably need to be trimmed
appropriately to indicate which part has been consumed already. And, as
somebody pointed out already, the searches should probably be against
the *buffer* variables rather than the *enable* variables.

Nick

[toc] | [prev] | [next] | [standalone]

#19298

From	Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de>
Date	2012-01-24 00:13 +0100
Message-ID	<jfkpi7$ibv$1@r03.glglgl.gl>
In reply to	#19291

Am 23.01.2012 22:48 schrieb M.Pekala:
> Hello, I am having some trouble with a serial stream on a project I am
> working on. I have an external board that is attached to a set of
> sensors. The board polls the sensors, filters them, formats the
> values, and sends the formatted values over a serial bus. The serial
> stream comes out like $A1234$$B-10$$C987$,  where "$A.*$" is a sensor
> value, "$B.*$" is a sensor value, "$C.*$" is a sensor value, ect...
>
> When one sensor is running my python script grabs the data just fine,
> removes the formatting, and throws it into a text control box. However
> when 3 or more sensors are running, I get output like the following:
>
> Sensor 1: 373
> Sensor 2: 112$$M-160$G373
> Sensor 3: 763$$A892$
>
> I am fairly certain this means that my code is running too slow to
> catch all the '$' markers.

This would just result in the receive buffer constantly growing.

Probably the thing with the RE which has been mentionned by Jon is the 
cause.

But I have some remarks to your code.

First, you have code repetition. You could use functions to avoid this.

Second, you have discrepancies between your 3 blocks: with A, you work 
with sensorabuffer, the others have sensor[bc]enable.

Third, if you have a buffer content of '$A1234$$B-10$$C987$', your "A 
code" will match the whole buffer and thus do

     # s = sensorresult.group(0) ->
     s = '$A1234$$B-10$$C987$'
     # s = s[2:-1]
     s = '1234$$B-10$$C987'
     # maybe put that into self.SensorAValue
     self.sensorabuffer = ''


I suggest the following way to go:

* Process your data only once.
* Do something like

[...]
theonebuffer = '$A1234$$B-10$$C987$' # for now

while True:
     sensorresult = re.search(r'\$(.)(.*?)\$(.*)', theonebuffer)
     if sensorresult:
         sensor, value, rest = sensorresult.groups()
         # replace the self.SensorAValue concept with a dict
         self.sensorvalues[sensor] = value
         theonebuffer = rest
     else: break # out of the while

If you execute this code, you'll end with a self.sensorvalues of

     {'A': '1234', 'C': '987', 'B': '-10'}

and a theonebuffer of ''.


Let's make another test with an incomplete sensor value.

theonebuffer = '$A1234$$B-10$$C987$$D65'

[code above]

-> the dict is the same, but theonebuffer='$D65'.

* Why did I do this? Well, you are constantly receiving data. I do this 
with the hope that the $ terminating the D value is yet to come.

* Why does this happen? The regex does not match this incomplete packet, 
the while loop terminates (resp. breaks) and the buffer will contain the 
last known value.


But you might be right - speed might become a concern if you are 
processing your data slower than they come along. Then your buffer fills 
up and eventually kills your program due to full memory. As the buffer 
fills up, the string copying becomes slower and slower, making things 
worse. Whether this becomes relevant, you'll have to test.

BTW, as you use this one regex quite often, a way to speed up could be 
to compile the regex. This will change your code to

sensorre = re.compile(r'\$(.)(.*?)\$(.*)')
theonebuffer = '$A1234$$B-10$$C987$' # for now

while True:
     sensorresult = sensorre.search(theonebuffer)
     if sensorresult:
         sensor, value, rest = sensorresult.groups()
         # replace the self.SensorAValue concept with a dict
         self.sensorvalues[sensor] = value
         theonebuffer = rest
     else: break # out of the while


And finally, you can make use of re.finditer() resp. 
sensorre.finditer(). So you can do

sensorre = re.compile(r'\$(.)(.*?)\$') # note the change
theonebuffer = '$A1234$$B-10$$C987$' # for now

sensorresult = None # init it for later
for sensorresult in sensorre.finditer(theonebuffer):
     sensor, value = sensorresult.groups()
     # replace the self.SensorAValue concept with a dict
     self.sensorvalues[sensor] = value
# and now, keep the rest
if sensorresult is not None:
     # the for loop has done something - cut out the old stuff
     # and keep a possible incomplete packet at the end
     theonebuffer = theonebuffer[sensorresult.end():]

This removes the mentionned string copying as source of increased slowness.

HTH,

Thomas

[toc] | [prev] | [next] | [standalone]

#19340

From	Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de>
Date	2012-01-24 14:04 +0100
Message-ID	<jfma8g$jib$1@r03.glglgl.gl>
In reply to	#19298

Am 24.01.2012 00:13 schrieb Thomas Rachel:

[sorry, my Thunderbird kills the indentation]

> And finally, you can make use of re.finditer() resp.
> sensorre.finditer(). So you can do
>
> sensorre = re.compile(r'\$(.)(.*?)\$') # note the change
> theonebuffer = '$A1234$$B-10$$C987$' # for now
>
> sensorresult = None # init it for later
> for sensorresult in sensorre.finditer(theonebuffer):
> sensor, value = sensorresult.groups()
> # replace the self.SensorAValue concept with a dict
> self.sensorvalues[sensor] = value
> # and now, keep the rest
> if sensorresult is not None:
> # the for loop has done something - cut out the old stuff
> # and keep a possible incomplete packet at the end
> theonebuffer = theonebuffer[sensorresult.end():]
>
> This removes the mentionned string copying as source of increased slowness.

But it has one major flaw: If you lose synchronization, it may happen 
that only the data *between* the packets is returned - which are mostly 
empty strings.

So it would be wise to either change the firmware of the device to use 
different characters for starting end ending a packet, or to return 
every data between "$"s and discarding the empty strings.

As regexes might be overkill here, we could do

def splitup(theonebuffer):
     l = theonebuffer.split("$")
     for i in l[:-1]: yield i + "$"
     if l: yield l[-1]


sensorvalues = {}
theonebuffer = '1garbage$A1234$$B-10$2garbage$C987$D3' # for now
for part in splitup(theonebuffer):
     if not part.endswith("$"):
	theonebuffer = part
         break # it is the last one which is probably not a full packet
     part = part[:-1] # cut off the $
     if not part: continue # $$ -> gap between packets
     # now part is the contents of one packet which may be valid or not.
     # TODO: Do some tests - maybe for valid sensor names and values.
     sensor = part[0]
     value = part[1:]
     sensorvalues[sensor] = value # add the "self." in your case -
     # for now, it is better without

Now I get sensorvalues, theonebuffer as ({'1': 'garbage', 'A': '1234', 
'2': 'garbage', 'B': '-10', 'C': '987'}, 'D3').

D3 is not (yet) a value; it might come out as D3, D342 or whatever, as 
the packet is not complete yet.


Thomas

[toc] | [prev] | [next] | [standalone]

#19301

From	Cameron Simpson <cs@zip.com.au>
Date	2012-01-24 10:49 +1100
Message-ID	<mailman.4999.1327363168.27778.python-list@python.org>
In reply to	#19291

On 23Jan2012 13:48, M.Pekala <mcdpekala@gmail.com> wrote:
| Hello, I am having some trouble with a serial stream on a project I am
| working on. I have an external board that is attached to a set of
| sensors. The board polls the sensors, filters them, formats the
| values, and sends the formatted values over a serial bus. The serial
| stream comes out like $A1234$$B-10$$C987$,  where "$A.*$" is a sensor
| value, "$B.*$" is a sensor value, "$C.*$" is a sensor value, ect...
| 
| When one sensor is running my python script grabs the data just fine,
| removes the formatting, and throws it into a text control box. However
| when 3 or more sensors are running, I get output like the following:
| 
| Sensor 1: 373
| Sensor 2: 112$$M-160$G373
| Sensor 3: 763$$A892$
| 
| I am fairly certain this means that my code is running too slow to
| catch all the '$' markers. Below is the snippet of code I believe is
| the cause of this problem...

Your code _is_ slow, but as you can see above you're not missing data,
you're gathering too much data.

Some point by point remarks below. The actual _bug_ is your use of ".*"
in your regexps. Some change suggestions below the code.

| def OnSerialRead(self, event):
| 	text = event.data
| 	self.sensorabuffer = self.sensorabuffer + text
| 	self.sensorbbuffer = self.sensorbbuffer + text
| 	self.sensorcbuffer = self.sensorcbuffer + text

Slow and memory wasteful. Supposing a sensor never reports? You will
accumulate an ever growing buffer string. And extending a string gets
expensive as it grows.

| 	if sensoraenable:
| 		sensorresult = re.search(r'\$A.*\$.*', self.sensorabuffer )

Slow and buggy.

The slow: You're compiling the regular expression _every_ time you come
here (unless the re module caches things, which I seem to recall it may.
But that efficiency is only luck.

The bug: supposing you get multiple sensor reports, like this:

  $A1$$B2$$C3$

Your regexp matches the whole thing! Because ".*" is greedy.
You want "[^$]*" - characters that are not a "$".


| 			if sensorresult:
| 				s = sensorresult.group(0)
| 				s = s[2:-1]
| 				if self.sensor_enable_chkbox.GetValue():
| 					self.SensorAValue = s
| 				self.sensorabuffer = ''

What if there are multiple values in the buffer? After fixing your
regexp you will now be throwing them away. Better to go:

  self.sensorabuffer = self.sensorabuffer[sensorresult.end():]

[...]
| I think that regex is too slow for this operation, but I'm uncertain
| of another method in python that could be faster. A little help would
| be appreciated.

Regex _is_ slow. It is good for flexible lexing, but generally Not
Fast. It can be faster than in-Python lexing because the inner
interpreation of the regex is C code, but is often overkill when speed
matters. (Which you may find it does not for your app - fix the bugs
first and see how it behaves).

I would be making the following changes if it were me:

  - keep only one buffer, and parse it into sensor "tokens"
    pass each token to the right sensor as needed

  - don't use regexps
    this is a speed thing; if you code is more readable with regexps and
    still faster enough you may not do this

To these ends, untested attempt 1 (one buffer, lex into tokens, still
using regexps):

    re_token = re.compile( r'\$([A-Z])([^$]*)\$' )

    def OnSerialRead(self, event):
        # accessing a local var is quicker and more readable
        buffer = self.buffer

        text = event.data
        buffer += text

        m = re_token.search(buffer)
        while m:
            sensor, value = m.group(1), m.group(2)
            buffer = buffer[m.end():]
            if sensor == 'A':
                # ...
            elif sensor == 'B':
                # ...
            else:
                warning("unsupported sensor: %s", sensor)

        # stash the updated buffer for later
        self.buffer = buffer

I'm assuming here that you can get noise in the serial stream. If you
are certain to get only clean "$Ax$" sequences and nothing else you can
make the code much simpler. And faster again.

Pretending clean data and no regexps:

    def OnSerialRead(self, event):
        # accessing a local var is quicker and more readable
        buffer = self.buffer

        text = event.data
        buffer += text

        while buffer:
            if not buffer.startswith('$'):
                raise ValueError("bad data in buffer! code rewrite needed!")
            mark2 = buffer.find('$', 1)
            if mark2 < 0:
                # end of token not present
                # look again later
                break
            token = buffer[1:mark2]
            buffer = buffer[mark2+1:]

            if not token:
                raise ValueError("no data in packet!")
            sensorm value = token[1], token[1:]

            ... hand off to sensors as above ...

Cheers,
-- 
Cameron Simpson <cs@zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/

If your new theorem can be stated with great simplicity, then there
will exist a pathological exception.    - Adrian Mathesis

[toc] | [prev] | [next] | [standalone]

#19308

From	"M.Pekala" <mcdpekala@gmail.com>
Date	2012-01-23 17:07 -0800
Message-ID	<9c55c4c7-c218-4ecb-8c9e-3349cb53268a@t7g2000vbg.googlegroups.com>
In reply to	#19301

On Jan 23, 6:49 pm, Cameron Simpson <c...@zip.com.au> wrote:
> On 23Jan2012 13:48, M.Pekala <mcdpek...@gmail.com> wrote:
> | Hello, I am having some trouble with a serial stream on a project I am
> | working on. I have an external board that is attached to a set of
> | sensors. The board polls the sensors, filters them, formats the
> | values, and sends the formatted values over a serial bus. The serial
> | stream comes out like $A1234$$B-10$$C987$,  where "$A.*$" is a sensor
> | value, "$B.*$" is a sensor value, "$C.*$" is a sensor value, ect...
> |
> | When one sensor is running my python script grabs the data just fine,
> | removes the formatting, and throws it into a text control box. However
> | when 3 or more sensors are running, I get output like the following:
> |
> | Sensor 1: 373
> | Sensor 2: 112$$M-160$G373
> | Sensor 3: 763$$A892$
> |
> | I am fairly certain this means that my code is running too slow to
> | catch all the '$' markers. Below is the snippet of code I believe is
> | the cause of this problem...
>
> Your code _is_ slow, but as you can see above you're not missing data,
> you're gathering too much data.
>
> Some point by point remarks below. The actual _bug_ is your use of ".*"
> in your regexps. Some change suggestions below the code.
>
> | def OnSerialRead(self, event):
> |       text = event.data
> |       self.sensorabuffer = self.sensorabuffer + text
> |       self.sensorbbuffer = self.sensorbbuffer + text
> |       self.sensorcbuffer = self.sensorcbuffer + text
>
> Slow and memory wasteful. Supposing a sensor never reports? You will
> accumulate an ever growing buffer string. And extending a string gets
> expensive as it grows.
>
> |       if sensoraenable:
> |               sensorresult = re.search(r'\$A.*\$.*', self.sensorabuffer )
>
> Slow and buggy.
>
> The slow: You're compiling the regular expression _every_ time you come
> here (unless the re module caches things, which I seem to recall it may.
> But that efficiency is only luck.
>
> The bug: supposing you get multiple sensor reports, like this:
>
>   $A1$$B2$$C3$
>
> Your regexp matches the whole thing! Because ".*" is greedy.
> You want "[^$]*" - characters that are not a "$".
>
> |                       if sensorresult:
> |                               s = sensorresult.group(0)
> |                               s = s[2:-1]
> |                               if self.sensor_enable_chkbox.GetValue():
> |                                       self.SensorAValue = s
> |                               self.sensorabuffer = ''
>
> What if there are multiple values in the buffer? After fixing your
> regexp you will now be throwing them away. Better to go:
>
>   self.sensorabuffer = self.sensorabuffer[sensorresult.end():]
>
> [...]
> | I think that regex is too slow for this operation, but I'm uncertain
> | of another method in python that could be faster. A little help would
> | be appreciated.
>
> Regex _is_ slow. It is good for flexible lexing, but generally Not
> Fast. It can be faster than in-Python lexing because the inner
> interpreation of the regex is C code, but is often overkill when speed
> matters. (Which you may find it does not for your app - fix the bugs
> first and see how it behaves).
>
> I would be making the following changes if it were me:
>
>   - keep only one buffer, and parse it into sensor "tokens"
>     pass each token to the right sensor as needed
>
>   - don't use regexps
>     this is a speed thing; if you code is more readable with regexps and
>     still faster enough you may not do this
>
> To these ends, untested attempt 1 (one buffer, lex into tokens, still
> using regexps):
>
>     re_token = re.compile( r'\$([A-Z])([^$]*)\$' )
>
>     def OnSerialRead(self, event):
>         # accessing a local var is quicker and more readable
>         buffer = self.buffer
>
>         text = event.data
>         buffer += text
>
>         m = re_token.search(buffer)
>         while m:
>             sensor, value = m.group(1), m.group(2)
>             buffer = buffer[m.end():]
>             if sensor == 'A':
>                 # ...
>             elif sensor == 'B':
>                 # ...
>             else:
>                 warning("unsupported sensor: %s", sensor)
>
>         # stash the updated buffer for later
>         self.buffer = buffer
>
> I'm assuming here that you can get noise in the serial stream. If you
> are certain to get only clean "$Ax$" sequences and nothing else you can
> make the code much simpler. And faster again.
>
> Pretending clean data and no regexps:
>
>     def OnSerialRead(self, event):
>         # accessing a local var is quicker and more readable
>         buffer = self.buffer
>
>         text = event.data
>         buffer += text
>
>         while buffer:
>             if not buffer.startswith('$'):
>                 raise ValueError("bad data in buffer! code rewrite needed!")
>             mark2 = buffer.find('$', 1)
>             if mark2 < 0:
>                 # end of token not present
>                 # look again later
>                 break
>             token = buffer[1:mark2]
>             buffer = buffer[mark2+1:]
>
>             if not token:
>                 raise ValueError("no data in packet!")
>             sensorm value = token[1], token[1:]
>
>             ... hand off to sensors as above ...
>
> Cheers,
> --
> Cameron Simpson <c...@zip.com.au> DoD#743http://www.cskk.ezoshosting.com/cs/
>
> If your new theorem can be stated with great simplicity, then there
> will exist a pathological exception.    - Adrian Mathesis

Okay I altered my code a bit based on all of your suggestions:

	self.sensor_re = re.compile(r'\$([A-E])([^$]*)\$')

	def OnSerialRead(self, event):
		text = event.data
		buffer = self.sensorbuffer + text

		if self.SensorLogging:
			result = self.sensor_re.search(buffer)
			if result:
				sensor,value = result.groups()
				buffer = buffer[result.end():]
				if sensor == 'A' and self.sensora_chkbox.GetValue():
					self.SensorAValue = value
				elif sensor == 'B' and self.sensorb_chkbox.GetValue():
					self.SensorBValue = value
				elif sensor == 'C' and self.sensorc_chkbox.GetValue():
					self.SensorCValue = value
				elif sensor == D' and self.sensord_chkbox.GetValue():
					self.SensorDValue = value
				elif sensor == 'E' and self.sensore_chkbox.GetValue():
					self.SensorEValue = value
				self.LogSensor()
			self.sensorbuffer = buffer

...and it works far better, in so much as the problems I was having
before seem to be gone. I'm probably going to change this to something
without re, though for the time being this will work for me. The
reason there was a whole lot of junk, such as the three buffers,
around this code was because I am somewhat of a novice python
programmer and I was just doing random stuff to see if it would stick.
The reason I was using regex in the first place was that I thought it
would be faster than the python code due its c origins, but apparently
that may not be the case. Regardless thank you all for your help.

Cheers,
Miles Pekala

[toc] | [prev] | [next] | [standalone]

#19314

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-01-24 05:08 +0000
Message-ID	<4f1e3cb5$0$11127$c3e8da3@news.astraweb.com>
In reply to	#19301

On Tue, 24 Jan 2012 10:49:41 +1100, Cameron Simpson wrote:

> | def OnSerialRead(self, event):
> | 	text = event.data
> | 	self.sensorabuffer = self.sensorabuffer + text 
> | 	self.sensorbbuffer = self.sensorbbuffer + text 
> | 	self.sensorcbuffer = self.sensorcbuffer + text
> 
> Slow and memory wasteful. Supposing a sensor never reports? You will
> accumulate an ever growing buffer string. And extending a string gets
> expensive as it grows.

I admit I haven't read this entire thread, but one thing jumps out at me. 
It looks like the code is accumulating strings by repeated + 
concatenation. This is risky.

In general, you should accumulate strings into a list buffer, then join 
them into a single string in one call:

buffer = []
while something:
    buffer.append(text)
return ''.join(buffer)

Use of repeated string addition risks slow quadratic behaviour. The OP is 
reporting slow behaviour... alarms bells ring.

For anyone who doesn't understand what I mean about slow quadratic 
behaviour, read this: 

http://www.joelonsoftware.com/articles/fog0000000319.html

Recent versions of CPython includes an optimization which *sometimes* can 
avoid this poor performance, but it can be defeated easily, and does not 
apply to Jython and IronPython, so it is best to not rely on it.

I don't know whether this is the cause of the OP's slow behaviour, but it 
is worth investigating. Especially since it is likely to not just be 
slow, but SLLLLLOOOOOOWWWWWWWWW -- a bad quadratic algorithm can be tens 
of thousands or millions of times slower than it need be.

[...]
> The slow: You're compiling the regular expression _every_ time you come
> here (unless the re module caches things, which I seem to recall it may.

It does.

> But that efficiency is only luck.

More deliberate design than good luck :)

Nevertheless, good design would have you compile the regex once, and not 
rely on the re module's cache.

[...]
> Regex _is_ slow. It is good for flexible lexing, but generally Not Fast.

I hope I will never be mistaken for a re fanboy, but credit where credit 
is due: if you need the full power of a regex, you almost certainly can't 
write anything in Python that will beat the re module. 

However, where regexes become a trap is that often people use them for 
things which are best coded as simple Python tests that are much faster, 
such as using a regex where a simple str.startswith() would do.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#19328

From	Cameron Simpson <cs@zip.com.au>
Date	2012-01-24 19:23 +1100
Message-ID	<mailman.5017.1327393428.27778.python-list@python.org>
In reply to	#19314

On 24Jan2012 05:08, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
| On Tue, 24 Jan 2012 10:49:41 +1100, Cameron Simpson wrote:
| 
| > | def OnSerialRead(self, event):
| > | 	text = event.data
| > | 	self.sensorabuffer = self.sensorabuffer + text 
| > | 	self.sensorbbuffer = self.sensorbbuffer + text 
| > | 	self.sensorcbuffer = self.sensorcbuffer + text
| > 
| > Slow and memory wasteful. Supposing a sensor never reports? You will
| > accumulate an ever growing buffer string. And extending a string gets
| > expensive as it grows.
| 
| I admit I haven't read this entire thread, but one thing jumps out at me. 
| It looks like the code is accumulating strings by repeated + 
| concatenation. This is risky.
| 
| In general, you should accumulate strings into a list buffer, then join 
| them into a single string in one call:
| 
| buffer = []
| while something:
|     buffer.append(text)
| return ''.join(buffer)

Yeah, but the OP needs to examine the string after every packet arrival,
so he doesn't get to defer the joining of the strings.

| Use of repeated string addition risks slow quadratic behaviour. The OP is 
| reporting slow behaviour... alarms bells ring.

He's _inferring_ slow behaviour from the wrong results he was getting.
In fact he doesn't have slow behaviour (in that python isn't slow enough
in his case to cause trouble).

[...snip references to more depth on string concatenation...]

| > The slow: You're compiling the regular expression _every_ time you come
| > here (unless the re module caches things, which I seem to recall it may.
| 
| It does.
| 
| > But that efficiency is only luck.
| 
| More deliberate design than good luck :)

The luck is in his program benefiting from it, not the decision of the
re authors to cache the source->compiled mapping.

| > Regex _is_ slow. It is good for flexible lexing, but generally Not Fast.
| 
| I hope I will never be mistaken for a re fanboy, but credit where credit 
| is due: if you need the full power of a regex, you almost certainly can't 
| write anything in Python that will beat the re module. 

Indeed not. But many many uses of the re module are for trivial lexing
stuff that is way simpler than the complexities involved in regexps
themselves (parse regexp, compiler from parse, parse string from regexp
compilation).

And perhaps equally important, regexps are both cryptic and powerful, a
combination that makes it very easy to write either highly expensive
regexps or bighly buggy regexps - in his case a small regexp error was
his bug, and programs speed had nothing to do with it.

His program did have inefficiencies though, which were worth discussing.

Cheers,
-- 
Cameron Simpson <cs@zip.com.au> DoD#743
http://www.cskk.ezoshosting.com/cs/

There are two ways of dealing with this problem: one is complicated and
messy, and the other is simple and very elegant. We don't have much time
left, so I'll just show you the complicated and messy way.
- Richard Feynman, 1981

[toc] | [prev] | [next] | [standalone]

#19336

From	Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com>
Date	2012-01-24 11:15 +0100
Message-ID	<juc2v8-gi8.ln1@satorlaser.homedns.org>
In reply to	#19291

Am 23.01.2012 22:48, schrieb M.Pekala:
> I think that regex is too slow for this operation, but I'm uncertain
> of another method in python that could be faster. A little help would
> be appreciated.

Regardless of the outcome here, I would say that your code is still a 
bit wonky on the handling of partial data and gaps due to buffer 
overflows in the serial port buffers.

I'd start with something like this:

import unittest

class SensorReader(object):
     def __init__(self):
         pass
     def parse(self, buffer):
         """store and parse given input"""
         pass
     def get_sample(self):
         """retrieve a single sample parsed from input data

         This returns a tuple containing the sensor name and the
         value. In case some input data was not parseable, it
         returns a tuple with an empty string and the unrecognized
         data.

         """
         pass

class Test(unittest.TestCase):
     def setUp(self):
         self.handler = SensorReader()

     def test_empty_buffer(self):
         self.handler.parse(buffer='')
         r = self.handler.get_sample()
         self.assertEqual(r, None)

     def test_input_1(self):
         self.handler.parse(buffer='$A1234$')
         r = self.handler.get_sample()
         self.assertEqual(r, ('A', 1234))

     def test_input_2(self):
         self.handler.parse(buffer='$A1234$$B5678$')
         r = self.handler.get_sample()
         self.assertEqual(r, ('A', 1234))
         r = self.handler.get_sample()
         self.assertEqual(r, ('B', 5678))

     def test_invalid_input(self):
         self.handler.parse(buffer='234$$B5678$')
         r = self.handler.get_sample()
         self.assertEqual(r, ('', '234$'))
         r = self.handler.get_sample()
         self.assertEqual(r, ('B', 5678))

     def test_partial_input(self):
         self.handler.parse(buffer='$B56')
         r = self.handler.get_sample()
         self.assertEqual(r, None)
         self.handler.parse(buffer='78$$C90')
         r = self.handler.get_sample()
         self.assertEqual(r, ('B', 5678))

Concerning the parsing itself, there is a "find" function in the str 
class which you can use. Just search for dollar signs and take the data 
in between. If it's empty, the first one is the end and the second one 
the start of the next sample, so you can discard the content. Having 
tests as above will make it a breeze for you to figure out actual 
implementation. You could also start off with regexes and then switch to 
linear scanning for speed.


Uli

[toc] | [prev] | [standalone]

csiph-web

Parsing a serial stream too slowly

Contents

#19291 — Parsing a serial stream too slowly

#19292

#19293

#19300

#19298

#19340

#19301

#19308

#19314

#19328

#19336