NNTP-Posting-Date: Mon, 04 Apr 2011 07:25:31 -0500 Date: Mon, 04 Apr 2011 05:25:22 -0700 From: Patricia Shanahan User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.9.2.15) Gecko/20110303 Thunderbird/3.1.9 MIME-Version: 1.0 Newsgroups: comp.lang.java.programmer Subject: Re: regex capability References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Message-ID: <5eGdnckj3q8mJQTQnZ2dnUVZ_jWdnZ2d@earthlink.com> Lines: 44 X-Usenet-Provider: http://www.giganews.com NNTP-Posting-Host: 75.8.126.96 X-Trace: sv3-ivUQSELrUlsNRubA32y+hddX6CRLfBDjY3fd3voZqQLccrqRjqMmJbbKNwEAzvRX2LWiRsWwn1i0W9i!PEynlQy9kZaroz7n0gIarpUgAIHl6x0xEDdd3joXdA5TojJsLYJlCdMsFNVWlUp7rZn8FF5vJIJb!2PPYe+4WPd2r3e1x1Dto2Hi6gaflKMpXdzxzOVWlbCw= X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly X-Postfilter: 1.3.40 X-Original-Bytes: 3465 Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!news.dougwise.org!feed.ac-versailles.fr!uvsq.fr!freenix!proxad.net!feeder1-2.proxad.net!74.125.64.80.MISMATCH!postnews.google.com!news1.google.com!Xl.tags.giganews.com!border1.nntp.dca.giganews.com!nntp.giganews.com!local2.nntp.dca.giganews.com!nntp.earthlink.com!news.earthlink.com.POSTED!not-for-mail Xref: x330-a1.tempe.blueboxinc.net comp.lang.java.programmer:2847 On 4/4/2011 5:03 AM, Eric Sosman wrote: > On 4/4/2011 3:50 AM, Roedy Green wrote: >> On Mon, 04 Apr 2011 02:34:30 -0500, Leif Roar Moldskred >> wrote, quoted or indirectly quoted someone who >> said : >> >>> >>> Easiest is to just use split. You can always do a regex of the type >>> "(\\d+)/((\\d+)/)?((\\d+)/)?((\\d+)/)?" but that's just pointlessly >>> complicated. There's no reason why you should use a regex when "normal" >>> string parsing is simpler and easier to read. >> >> (xxx|yyy)+ seems to generate only one group item, no matter how many >> repetitions there are. That strikes me as a bug, but likely someone >> can explain why it is a feature or inevitability. > > A (section of a) regex matches a (section of a) string, and the > Matcher machinery can tell you what substring was matched. The > machinery has no provision for doing further processing on that > matched substring, like saying "Oh, your regex didn't match a > string this time, but an array of strings." > > You could, perhaps, cook up substitutes for Pattern and Matcher > to do such a thing. But I'm not sure you'd want to, because it > could make the API rather complicated. For example, consider a > fanex (for "fancy expression," like "regular expression" only > more so) along the lines of "(pat1)(pat2)" where "pat1" and "pat2" > can match and return arrays of substrings. The FancyMatcher says > "I matched five substrings." So you call group(3) to get the > third of them -- was it matched by "pat1" or by "pat2"? Yes, you > could invent an API to deal with this -- maybe FancyMatcher returns > a tree of nodes that point to other nodes and/or to substrings -- > but I'm not confident this would be an unqualified improvement. > Not only would it make the API complicated, but it would also encourage a problem I've already seen in code posted in newsgroups - use of regex's that are very complicated and messy, just for the sake of fitting a complete job into one regex. Sometimes a single regex match really is the simplest, cleanest, most readable way of expressing some data extraction. Quite often, it is not. Patricia