Re: splitting file/content into lines based on regex termination

Path	csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Return-Path	<badouglas@gmail.com>
X-Original-To	python-list@python.org
Delivered-To	python-list@mail.python.org
X-Spam-Status	OK 0.003
X-Spam-Evidence	'H': 0.99; 'S': 0.00; 'mrab': 0.05; 'lines,': 0.07; 'subject:file': 0.07; 'ugly': 0.07; 'string': 0.09; 'matched': 0.09; 'skip:\\ 10': 0.09; 'subject:into': 0.09; '1:13': 0.16; 'etc...': 0.16; 'literals': 0.16; 'omitting': 0.16; 'outputs': 0.16; 'parentheses:': 0.16; 'regex,': 0.16; 'subject:based': 0.16; 'followed': 0.16; 'all.': 0.16; 'wrote:': 0.18; 'split': 0.19; 'thu,': 0.19; '>>>': 0.22; 'import': 0.22; 'bruce': 0.22; 'separate': 0.22; 'print': 0.22; 'instance,': 0.24; 'string,': 0.24; 'initial': 0.24; 'subject:/': 0.26; 'header:In- Reply-To:1': 0.27; 'tried': 0.27; 'idea': 0.28; "doesn't": 0.30; 'robert': 0.30; 'message-id:@mail.gmail.com': 0.30; 'skip:( 20': 0.30; "i'm": 0.30; 'url:mailman': 0.30; 'lines': 0.31; 'reply.': 0.31; '"")': 0.31; 'helpful.': 0.31; 'file': 0.32; 'url:python': 0.33; 'raw': 0.33; 'skip:# 10': 0.33; "i'd": 0.34; 'except': 0.35; 'test': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'i.e.': 0.36; 'url:listinfo': 0.36; 'next': 0.36; 'method': 0.36; 'thanks': 0.36; 'url:org': 0.36; 'should': 0.36; 'nov': 0.38; 'others.': 0.38; 'to:addr:python-list': 0.38; 'pm,': 0.38; 'sure': 0.39; 'to:addr:python.org': 0.39; 'url:mail': 0.40; 'how': 0.40; 'read': 0.60; 'matter': 0.61; 'kind': 0.63; 'sample': 0.67; 'line,': 0.68; 'below:': 0.68; 'results': 0.69; 'try,': 0.84; 'capture': 0.91; '2013': 0.98
DKIM-Signature	v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=d45NtMeXVq1WXvuOUHGKzGJJoV/cNk4bs3pZGrSki8Y=; b=SOzGlqyoCmZbtSI8pB5KXCFPncuQ3nB+W8LiEw68N59GA2j518t2L/Dv/+Bdr66cNS mP+QxoxhuML/TAJ5/sK4/E4ssgtzoPYdTl9zVDJ6cx2Pe9IhRNOu76GPh2UhihHT9k0q 6upGMrJ7Q8bQStXHfJTqhxTaxMdcrtOfEgxXRxxJOSrGkTaNaJM0r6nPK2moodc9miwq wKlJ+rUb4gnwbe1Go3zyyl7xJ75D00FWlamPav/6s8uyiDrd/XWwt3uwvALYjUT5tVIo 1GY/VLBKTc34l7zn1Cjil48IzKmT4BW/VAznC4ejY0+eR7/qbHJvVLdTMHHmsj3GciOA zRaQ==
MIME-Version	1.0
X-Received	by 10.50.1.78 with SMTP id 14mr3073269igk.37.1383849936356; Thu, 07 Nov 2013 10:45:36 -0800 (PST)
In-Reply-To	<527BD83A.7020604@mrabarnett.plus.com>
References	<CAP16ngqrVAnJPxBXi8B-cAL_Q+yr47pQs1WxhvCHLM8oeVKpsg@mail.gmail.com> <CAP16ngpgdF=uYr5j8OLtBZqEmsUw9f6XPg-5Rd8HpzhBjoSpyw@mail.gmail.com> <527BD83A.7020604@mrabarnett.plus.com>
Date	Thu, 7 Nov 2013 13:45:36 -0500
Subject	Re: splitting file/content into lines based on regex termination
From	bruce <badouglas@gmail.com>
To	python-list@python.org
Content-Type	text/plain; charset=ISO-8859-1
X-BeenThere	python-list@python.org
X-Mailman-Version	2.1.15
Precedence	list
List-Id	General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe	<https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive	<http://mail.python.org/pipermail/python-list/>
List-Post	<mailto:python-list@python.org>
List-Help	<mailto:python-list-request@python.org?subject=help>
List-Subscribe	<https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups	comp.lang.python
Message-ID	<mailman.2149.1383849945.18130.python-list@python.org> (permalink)
Lines	155
NNTP-Posting-Host	2001:888:2000:d::a6
X-Trace	1383849945 news.xs4all.nl 15906 [2001:888:2000:d::a6]:41256
X-Complaints-To	abuse@xs4all.nl
Xref	csiph.com comp.lang.python:58688

Show key headers only | View raw

hi.

thanks for the reply.

tried what you suggested. what I see now, is that I print out the
lines, but not the regex data at all. my initial try, gave me the
line, and then the next items , followed by the next line, etc...

what I then tried, was to do a capture/findall of the regex, and
combine the outputs in separate loops, which will be ugly but will
work....

  ff= "byu2.dat"
  #fff= "sdsu2.dat"
  with open(ff,"r") as myfile:
    s=myfile.read()


  s=s.replace("&nbsp", "")

  #with open(fff,"w") as myfile2:
  #  myfile2.write(s)
#<br>#45 / 58#0#
#<br>#45 / 58#0#
  #dat1=re.compile("<br>#(\d+) / (\d+)#(\d+)#").search(s).findall()
  dat1=re.findall("<br>#(\d+) / (\d+)#(\d+)#",s)
  dat=re.compile("<br>#(\d+) / (\d+)#(\d+)#").split(s)
  dat2 = re.compile(r"<br>#\d+ / \d+#\d+#").split(s)
  #dat=re.split('("<br>#(\d+) / (\d+)#(\d+)#")',s)
  #dat=re.compile("<br>#(\d+)").split(s)


  for m in dat:
    if m:
      print "m = "+m

      #sys.exit()

  print "dat1"
  print dat1
  print len(dat1)
  print "dat2a"
  #sys.exit()

#  for m in dat1:
#    if m:
#      print "m = "+m
#
#      #sys.exit()

  for m in dat2:
    if m:
      print "m = "+m

      #sys.exit()

  sys.exit()

  return


the test data is pasted to -->>> http://bpaste.net/show/kYzBUIfhc5023phOVmcu/

thanks
!!


On Thu, Nov 7, 2013 at 1:13 PM, MRAB <python@mrabarnett.plus.com> wrote:
> On 07/11/2013 17:45, bruce wrote:
>>
>> update...
>>
>>    dat=re.compile("<br>#(\d+) / (\d+)#(\d+)#").split(s)
>>
>> almost works..
>>
>> except i get
>> m = 10116#000#C S#S#100##001##DAY#Fund of Computing#Barrett,
>> William#3#MWF<br>#08:00am<br>#08:50am<br>#3718 HBLL
>> m = 45
>> m = 58
>> m = 0
>> m = 10116#000#C S#S#100##002##DAY#Fund of Computing#Barrett,
>> William#3#MWF<br>#09:00am<br>#09:50am<br>#3718 HBLL
>> m = 9
>> m = 58
>> m = 0
>>
>> and what i want is:
>> m = 10116#000#C S#S#100##001##DAY#Fund of Computing#Barrett,
>> William#3#MWF<br>#08:00am<br>#08:50am<br>#3718 HBLL 45 / 58,0
>> m = 10116#000#C S#S#100##002##DAY#Fund of Computing#Barrett,
>> William#3#MWF<br>#09:00am<br>#09:50am<br>#3718 HBLL 9 / 58,0
>>
>>
>> so i'd have the results of the "compile/regex process" to be added to
>> the split lines
>>
>> thoughts/comments??
>>
>> thanks
>>
> The split method also returns what's matched in any capture groups,
> i.e. "(\d+)". Try omitting the parentheses:
>
>     dat = re.compile(r"<br>#\d+ / \d+#\d+#").split(s)
>
> You should also be using raw string literals as above (r"..."). It
> doesn't matter in this instance, but it might in others.
>
>>
>>
>> On Thu, Nov 7, 2013 at 12:15 PM, bruce <badouglas@gmail.com> wrote:
>>>
>>> hi.
>>>
>>> got a test file with the sample content listed below:
>>>
>>> the content is one long string, and needs to be split into separate lines
>>>
>>> I'm thinking the pattern to split on should be a kind of regex like::
>>> <br>#45 / 58#0#
>>> or
>>> <br>#9 / 58#0
>>> but i have no idea how to make this happen!!
>>>
>>> if i read the content into a buf -> s
>>>
>>> import re
>>> dat = re.compile("what goes here??").split(s)
>>>
>>> --i'm not sure what goes in the compile() to get the process to work..
>>>
>>> thoughts/comments would be helpful.
>>>
>>> thanks
>>>
>>>
>>> test dat::
>>> 10116#000#C S#S#100##001##DAY#Fund of Computing#Barrett,
>>> William#3#MWF<br>#08:00am<br>#08:50am<br>#3718 HBLL <br>#45 /
>>> 58#0#10116#000#C S#S#100##002##DAY#Fund of Computing#Barrett,
>>> William#3#MWF<br>#09:00am<br>#09:50am<br>#3718 HBLL <br>#9 /
>>> 58#0#10178#000#C S#S#124##001##DAY#Computer Systems#Roper,
>>> Paul#3#MWF<br>#11:00am<br>#11:50am<br>#1170 TMCB <br>#41 /
>>> 145#0#10178#000#C S#S#124##002##DAY#Computer Systems#Roper,
>>> Paul#3#MWF<br>#2:00pm<br>#2:50pm<br>#1170 TMCB <br>#40 /
>>> 120#0#01489#002#C S#S#142##001##DAY#Intro to Computer
>>> Programming#Burton, Robert <div class='instructors'>Seppi, Kevin<br
>>> /></div><span
>>
>>
>
> --
> https://mail.python.org/mailman/listinfo/python-list

Back to comp.lang.python | Previous | Next — Next in thread | Find similar | Unroll thread

Thread

Re: splitting file/content into lines based on regex termination bruce <badouglas@gmail.com> - 2013-11-07 13:45 -0500
  Re: splitting file/content into lines based on regex termination Piet van Oostrum <piet@vanoostrum.org> - 2013-11-09 21:05 -0400

csiph-web