Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.003 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'mrab': 0.05; 'lines,': 0.07; 'subject:file': 0.07; 'ugly': 0.07; 'string': 0.09; 'matched': 0.09; 'skip:\\ 10': 0.09; 'subject:into': 0.09; '1:13': 0.16; 'etc...': 0.16; 'literals': 0.16; 'omitting': 0.16; 'outputs': 0.16; 'parentheses:': 0.16; 'regex,': 0.16; 'subject:based': 0.16; 'followed': 0.16; 'all.': 0.16; 'wrote:': 0.18; 'split': 0.19; 'thu,': 0.19; '>>>': 0.22; 'import': 0.22; 'bruce': 0.22; 'separate': 0.22; 'print': 0.22; 'instance,': 0.24; 'string,': 0.24; 'initial': 0.24; 'subject:/': 0.26; 'header:In- Reply-To:1': 0.27; 'tried': 0.27; 'idea': 0.28; "doesn't": 0.30; 'robert': 0.30; 'message-id:@mail.gmail.com': 0.30; 'skip:( 20': 0.30; "i'm": 0.30; 'url:mailman': 0.30; 'lines': 0.31; 'reply.': 0.31; '"")': 0.31; 'helpful.': 0.31; 'file': 0.32; 'url:python': 0.33; 'raw': 0.33; 'skip:# 10': 0.33; "i'd": 0.34; 'except': 0.35; 'test': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'i.e.': 0.36; 'url:listinfo': 0.36; 'next': 0.36; 'method': 0.36; 'thanks': 0.36; 'url:org': 0.36; 'should': 0.36; 'nov': 0.38; 'others.': 0.38; 'to:addr:python-list': 0.38; 'pm,': 0.38; 'sure': 0.39; 'to:addr:python.org': 0.39; 'url:mail': 0.40; 'how': 0.40; 'read': 0.60; 'matter': 0.61; 'kind': 0.63; 'sample': 0.67; 'line,': 0.68; 'below:': 0.68; 'results': 0.69; 'try,': 0.84; 'capture': 0.91; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=d45NtMeXVq1WXvuOUHGKzGJJoV/cNk4bs3pZGrSki8Y=; b=SOzGlqyoCmZbtSI8pB5KXCFPncuQ3nB+W8LiEw68N59GA2j518t2L/Dv/+Bdr66cNS mP+QxoxhuML/TAJ5/sK4/E4ssgtzoPYdTl9zVDJ6cx2Pe9IhRNOu76GPh2UhihHT9k0q 6upGMrJ7Q8bQStXHfJTqhxTaxMdcrtOfEgxXRxxJOSrGkTaNaJM0r6nPK2moodc9miwq wKlJ+rUb4gnwbe1Go3zyyl7xJ75D00FWlamPav/6s8uyiDrd/XWwt3uwvALYjUT5tVIo 1GY/VLBKTc34l7zn1Cjil48IzKmT4BW/VAznC4ejY0+eR7/qbHJvVLdTMHHmsj3GciOA zRaQ== MIME-Version: 1.0 X-Received: by 10.50.1.78 with SMTP id 14mr3073269igk.37.1383849936356; Thu, 07 Nov 2013 10:45:36 -0800 (PST) In-Reply-To: <527BD83A.7020604@mrabarnett.plus.com> References: <527BD83A.7020604@mrabarnett.plus.com> Date: Thu, 7 Nov 2013 13:45:36 -0500 Subject: Re: splitting file/content into lines based on regex termination From: bruce To: python-list@python.org Content-Type: text/plain; charset=ISO-8859-1 X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 155 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1383849945 news.xs4all.nl 15906 [2001:888:2000:d::a6]:41256 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:58688 hi. thanks for the reply. tried what you suggested. what I see now, is that I print out the lines, but not the regex data at all. my initial try, gave me the line, and then the next items , followed by the next line, etc... what I then tried, was to do a capture/findall of the regex, and combine the outputs in separate loops, which will be ugly but will work.... ff= "byu2.dat" #fff= "sdsu2.dat" with open(ff,"r") as myfile: s=myfile.read() s=s.replace(" ", "") #with open(fff,"w") as myfile2: # myfile2.write(s) #
#45 / 58#0# #
#45 / 58#0# #dat1=re.compile("
#(\d+) / (\d+)#(\d+)#").search(s).findall() dat1=re.findall("
#(\d+) / (\d+)#(\d+)#",s) dat=re.compile("
#(\d+) / (\d+)#(\d+)#").split(s) dat2 = re.compile(r"
#\d+ / \d+#\d+#").split(s) #dat=re.split('("
#(\d+) / (\d+)#(\d+)#")',s) #dat=re.compile("
#(\d+)").split(s) for m in dat: if m: print "m = "+m #sys.exit() print "dat1" print dat1 print len(dat1) print "dat2a" #sys.exit() # for m in dat1: # if m: # print "m = "+m # # #sys.exit() for m in dat2: if m: print "m = "+m #sys.exit() sys.exit() return the test data is pasted to -->>> http://bpaste.net/show/kYzBUIfhc5023phOVmcu/ thanks !! On Thu, Nov 7, 2013 at 1:13 PM, MRAB wrote: > On 07/11/2013 17:45, bruce wrote: >> >> update... >> >> dat=re.compile("
#(\d+) / (\d+)#(\d+)#").split(s) >> >> almost works.. >> >> except i get >> m = 10116#000#C S#S#100##001##DAY#Fund of Computing#Barrett, >> William#3#MWF
#08:00am
#08:50am
#3718 HBLL >> m = 45 >> m = 58 >> m = 0 >> m = 10116#000#C S#S#100##002##DAY#Fund of Computing#Barrett, >> William#3#MWF
#09:00am
#09:50am
#3718 HBLL >> m = 9 >> m = 58 >> m = 0 >> >> and what i want is: >> m = 10116#000#C S#S#100##001##DAY#Fund of Computing#Barrett, >> William#3#MWF
#08:00am
#08:50am
#3718 HBLL 45 / 58,0 >> m = 10116#000#C S#S#100##002##DAY#Fund of Computing#Barrett, >> William#3#MWF
#09:00am
#09:50am
#3718 HBLL 9 / 58,0 >> >> >> so i'd have the results of the "compile/regex process" to be added to >> the split lines >> >> thoughts/comments?? >> >> thanks >> > The split method also returns what's matched in any capture groups, > i.e. "(\d+)". Try omitting the parentheses: > > dat = re.compile(r"
#\d+ / \d+#\d+#").split(s) > > You should also be using raw string literals as above (r"..."). It > doesn't matter in this instance, but it might in others. > >> >> >> On Thu, Nov 7, 2013 at 12:15 PM, bruce wrote: >>> >>> hi. >>> >>> got a test file with the sample content listed below: >>> >>> the content is one long string, and needs to be split into separate lines >>> >>> I'm thinking the pattern to split on should be a kind of regex like:: >>>
#45 / 58#0# >>> or >>>
#9 / 58#0 >>> but i have no idea how to make this happen!! >>> >>> if i read the content into a buf -> s >>> >>> import re >>> dat = re.compile("what goes here??").split(s) >>> >>> --i'm not sure what goes in the compile() to get the process to work.. >>> >>> thoughts/comments would be helpful. >>> >>> thanks >>> >>> >>> test dat:: >>> 10116#000#C S#S#100##001##DAY#Fund of Computing#Barrett, >>> William#3#MWF
#08:00am
#08:50am
#3718 HBLL
#45 / >>> 58#0#10116#000#C S#S#100##002##DAY#Fund of Computing#Barrett, >>> William#3#MWF
#09:00am
#09:50am
#3718 HBLL
#9 / >>> 58#0#10178#000#C S#S#124##001##DAY#Computer Systems#Roper, >>> Paul#3#MWF
#11:00am
#11:50am
#1170 TMCB
#41 / >>> 145#0#10178#000#C S#S#124##002##DAY#Computer Systems#Roper, >>> Paul#3#MWF
#2:00pm
#2:50pm
#1170 TMCB
#40 / >>> 120#0#01489#002#C S#S#142##001##DAY#Intro to Computer >>> Programming#Burton, Robert
Seppi, Kevin
>> />
> >> > > -- > https://mail.python.org/mailman/listinfo/python-list