Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.help > #2197 > unrolled thread
| Started by | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| First post | 2012-10-29 13:46 -0700 |
| Last post | 2012-10-31 07:33 -0700 |
| Articles | 16 — 5 participants |
Back to article view | Back to comp.lang.java.help
regex puzzle Roedy Green <see_website@mindprod.com.invalid> - 2012-10-29 13:46 -0700
Re: regex puzzle Lew <lewbloch@gmail.com> - 2012-10-29 14:48 -0700
Re: regex puzzle Roedy Green <see_website@mindprod.com.invalid> - 2012-10-30 05:59 -0700
Re: regex puzzle "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-10-30 14:48 +0100
Re: regex puzzle markspace <-@.> - 2012-10-30 14:16 -0700
Re: regex puzzle "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-10-31 09:54 +0100
Re: regex puzzle markspace <-@.> - 2012-10-31 11:25 -0700
Re: regex puzzle "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-01 13:56 +0100
Re: regex puzzle Roedy Green <see_website@mindprod.com.invalid> - 2012-11-01 18:46 -0700
Re: regex puzzle Roedy Green <see_website@mindprod.com.invalid> - 2012-10-31 07:09 -0700
Re: regex puzzle Roedy Green <see_website@mindprod.com.invalid> - 2012-10-31 07:11 -0700
Re: regex puzzle Roedy Green <see_website@mindprod.com.invalid> - 2012-10-31 16:22 -0700
Re: regex puzzle markspace <-@.> - 2012-10-31 17:29 -0700
Re: regex puzzle Roedy Green <see_website@mindprod.com.invalid> - 2012-11-01 18:43 -0700
Re: regex puzzle Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-10-30 16:39 -0700
Re: regex puzzle Roedy Green <see_website@mindprod.com.invalid> - 2012-10-31 07:33 -0700
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2012-10-29 13:46 -0700 |
| Subject | regex puzzle |
| Message-ID | <olqt88d9p21pf9nau0j4pke7kmhq08u5o4@4ax.com> |
Lets say you wanted to find strings of the form "cat" "naïve" What sort of regex would you use? or would you resort to custom code? -- Roedy Green Canadian Mind Products http://mindprod.com Let all things be done decently and in order. ~ I Corinthians 14:40. Apparently, Jehovah disapproves of Java Threads.
[toc] | [next] | [standalone]
| From | Lew <lewbloch@gmail.com> |
|---|---|
| Date | 2012-10-29 14:48 -0700 |
| Message-ID | <f7a2eec3-eca9-468a-8d8a-3d8bf360a530@googlegroups.com> |
| In reply to | #2197 |
Roedy Green wrote: > Lets say you wanted to find strings of the form > "cat" Do you mean "cat" including the quotation marks? > "naïve" Do you mean "naïve" including the quotation marks? Or do you mean to literally find the escape characters, ", e.g., as an HTML parser might? > What sort of regex would you use? or would you resort to custom code? Are you trying to find all substrings either "cat" or "naïve", or just one or the other for any given search? Or does "of the form" mean something else? If you're looking for ways to parse HTML escape characters, you could just look for the '&' character then match against a table such as the one on http://www.theukwebdesigncompany.com/articles/entity-escape-characters.php -- Lew
[toc] | [prev] | [next] | [standalone]
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2012-10-30 05:59 -0700 |
| Message-ID | <pgiv881g37e73fek318423bvrmtncgto4e@4ax.com> |
| In reply to | #2198 |
On Mon, 29 Oct 2012 14:48:28 -0700 (PDT), Lew <lewbloch@gmail.com> wrote, quoted or indirectly quoted someone who said : >including the quotation marks? I am scanning postable HTML trying to convert things surrounded in " to a style, no embedded space allowed, but embedded entity allowed to be left intact. e.g. "cat" (hex 2671756F743B6361742671756F743B in ASCII.) to <span class="quoted">cat</span> From there the style will be refined manually. "naïve" (hex 2671756F743B6E612669756D6C3B766567261756F743B in ASCII) to <span class="quoted">naïve</span> The tricky part is the search where & inside is ok, even though the terminator also starts with &. This is further complicated by the fact I have not yet written a regex utility that uses Java conventions. Funduc has its own quite different ones. I thought one proffered solution could be modified for Funduc, or I might cook up a one-shot Java utility with hard-coded constants to avoid the problem if having deal with escaping parameters. -- Roedy Green Canadian Mind Products http://mindprod.com Let all things be done decently and in order. ~ I Corinthians 14:40. Apparently, Jehovah disapproves of Java Threads.
[toc] | [prev] | [next] | [standalone]
| From | "Peter J. Holzer" <hjp-usenet2@hjp.at> |
|---|---|
| Date | 2012-10-30 14:48 +0100 |
| Message-ID | <slrnk8vmks.svs.hjp-usenet2@hrunkner.hjp.at> |
| In reply to | #2200 |
On 2012-10-30 12:59, Roedy Green <see_website@mindprod.com.invalid> wrote:
> On Mon, 29 Oct 2012 14:48:28 -0700 (PDT), Lew <lewbloch@gmail.com>
> wrote, quoted or indirectly quoted someone who said :
>>including the quotation marks?
>
> I am scanning postable HTML trying to convert things surrounded in
> " to a style, no embedded space allowed, but embedded entity
> allowed to be left intact.
>
> e.g.
> "cat" (hex 2671756F743B6361742671756F743B in ASCII.)
> to <span class="quoted">cat</span> From there the style will be
> refined manually.
Java Regexps seem to be Perl-compatible, so
s.replaceAll(""(\S*?)"", "<span class=\"quoted\">$1</span>");
should do the trick.
At least unless you have HTML code like this
<img src="cat.jpg" alt="image of a "cat"">
This would be translated into
<img src="cat.jpg" alt="image of a <span class="quoted">cat</span>">
which isn't valid HTML.
It is possible to handle that in a regexp, but this would be really
cumbersome. If you want to process HTML, use a proper HTML parser.
hp
--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
| | | hjp@hjp.at | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2012-10-30 14:16 -0700 |
| Message-ID | <k6pg3f$bgs$1@dont-email.me> |
| In reply to | #2201 |
On 10/30/2012 6:48 AM, Peter J. Holzer wrote:
>
> Java Regexps seem to be Perl-compatible, so
>
> s.replaceAll(""(\S*?)"", "<span class=\"quoted\">$1</span>");
>
I don't think this will work, in the general case. What about input like:
Hi"I'm-a"-dash-seperated-"string."
You'll end up with one replacement, where I think Roedy would require
two. However, that's a good point as Roedy didn't show any examples
involving white-space. Some clarification beyond simple examples is in
order, methinks.
[toc] | [prev] | [next] | [standalone]
| From | "Peter J. Holzer" <hjp-usenet2@hjp.at> |
|---|---|
| Date | 2012-10-31 09:54 +0100 |
| Message-ID | <slrnk91pqj.v21.hjp-usenet2@hrunkner.hjp.at> |
| In reply to | #2202 |
On 2012-10-30 21:16, markspace <-@> wrote:
> On 10/30/2012 6:48 AM, Peter J. Holzer wrote:
>
>>
>> Java Regexps seem to be Perl-compatible, so
>>
>> s.replaceAll(""(\S*?)"", "<span class=\"quoted\">$1</span>");
>>
>
> I don't think this will work, in the general case. What about input like:
>
> Hi"I'm-a"-dash-seperated-"string."
>
> You'll end up with one replacement,
No, that should be two replacements: The *? operator is non-greedy, so
\\S*? matches the shortest possible sequence of non-space characters.
package at.hjp.regexptest;
public class RegExpTest {
/**
* @param args
*/
public static void main(String[] args) {
String s = "Hi"I'm-a"-dash-seperated-"string."";
String s1 = s.replaceAll(""(\\S*?)"", "<span class=\"quoted\">$1</span>");
System.out.println(s1);
}
}
prints:
Hi<span class="quoted">I'm-a</span>-dash-seperated-<span class="quoted">string.</span>
hp
--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
| | | hjp@hjp.at | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2012-10-31 11:25 -0700 |
| Message-ID | <k6rqea$ps4$1@dont-email.me> |
| In reply to | #2204 |
On 10/31/2012 1:54 AM, Peter J. Holzer wrote:
> On 2012-10-30 21:16, markspace <-@> wrote:
>> On 10/30/2012 6:48 AM, Peter J. Holzer wrote:
>>
>>>
>>> Java Regexps seem to be Perl-compatible, so
>>>
>>> s.replaceAll(""(\S*?)"", "<span class=\"quoted\">$1</span>");
>>>
>>
>> I don't think this will work, in the general case. What about input like:
>>
>> Hi"I'm-a"-dash-seperated-"string."
>>
>> You'll end up with one replacement,
>
> No, that should be two replacements: The *? operator is non-greedy, so
> \\S*? matches the shortest possible sequence of non-space characters.
OK, what about the obverse? What if the quoted string contains
whitespace?
Not trying to bug you, but regex is tricky, and I don't often see it as
an ideal, solution, so I'm trying to learn its corner cases.
[toc] | [prev] | [next] | [standalone]
| From | "Peter J. Holzer" <hjp-usenet2@hjp.at> |
|---|---|
| Date | 2012-11-01 13:56 +0100 |
| Message-ID | <slrnk94sc9.5vl.hjp-usenet2@hrunkner.hjp.at> |
| In reply to | #2209 |
On 2012-10-31 18:25, markspace <-@> wrote:
> On 10/31/2012 1:54 AM, Peter J. Holzer wrote:
>> On 2012-10-30 21:16, markspace <-@> wrote:
>>> On 10/30/2012 6:48 AM, Peter J. Holzer wrote:
>>>> Java Regexps seem to be Perl-compatible, so
>>>>
>>>> s.replaceAll(""(\S*?)"", "<span class=\"quoted\">$1</span>");
>>>>
>>>
>>> I don't think this will work, in the general case. What about input like:
>>>
>>> Hi"I'm-a"-dash-seperated-"string."
>>>
>>> You'll end up with one replacement,
>>
>> No, that should be two replacements: The *? operator is non-greedy, so
>> \\S*? matches the shortest possible sequence of non-space characters.
>
>
> OK, what about the obverse? What if the quoted string contains
> whitespace?
Then it won't match. /\S/ matches any non-space character (it's the
opposite of /\s/, which matches any space character). If you want to
match any character, use /./.
> Not trying to bug you, but regex is tricky, and I don't often see it as
> an ideal, solution, so I'm trying to learn its corner cases.
As Jamie Zawinski once quipped:
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.
Regular expressions are a tool. There are situations where they are the
right tool and situations where they aren't. When processing HTML, they
are usually the wrong tool. I already gave one example where a simple
regexp like this fails. The problem is that unless the HTML is tightly
controlled (in this case: No "..." sequences in parameters)
you have to build a complete HTML lexer into the regexp. This is
possible, but cumbersome[1]. But for a one-time job that may not be a
problem: If you have to convert 100 files, a regexp which converts 95 of
them correctly and mangles 5 of them may be a better solution than a
program which handles them all correctly, but takes longer to write than
correcting the 5 mangled files manually.
hp
[1] I demonstrated this some time ago in the German Perl newsgroup.
It wasn't even as bad as I expected, but then Perl makes it easy to
write readable regexps.
--
_ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis
| | | hjp@hjp.at | die Satzbestandteile des Satzes nicht mehr
__/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
[toc] | [prev] | [next] | [standalone]
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2012-11-01 18:46 -0700 |
| Message-ID | <4b969852fg89jec1ju07or57djh5l68ncg@4ax.com> |
| In reply to | #2212 |
On Thu, 1 Nov 2012 13:56:41 +0100, "Peter J. Holzer"
<hjp-usenet2@hjp.at> wrote, quoted or indirectly quoted someone who
said :
> If you want to
>match any character, use /./.
Remember to use Pattern.compile("xxx", Pattern.DOTALL) to control
whether \n is considered "any character".
--
Roedy Green Canadian Mind Products http://mindprod.com
Ironically, even though the Internet was created by the US military
[DARPA (Defense Advanced Research Projects Agency)]
to withstand a nuclear attack, it is almost defenceless against malice
from any of its users
[toc] | [prev] | [next] | [standalone]
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2012-10-31 07:09 -0700 |
| Message-ID | <hoa29851ekjh96mpk1de5c63cogsr2ue42@4ax.com> |
| In reply to | #2202 |
On Tue, 30 Oct 2012 14:16:29 -0700, markspace <-@.> wrote, quoted or indirectly quoted someone who said : >Hi"I'm-a"-dash-seperated-"string." Happily I don't have to deal with that possibility. Some time ago I wrote a lint program for HTML to detect unbalanced quotes of various flavours. HTML/Unicode requires some guessing to do this. Some day I hope someone invents whole new scheme for escaping and marking nested dialogue that uses colour or font or automatic rendering to indicate depth, hiding the problem of absolute depth when writing markup. I would also like the ’ character to be cloned, so there are two slightly difference characters used, one for contractions and one for marking dialogue. The fact that the same character/entity is used for entirely different purposes creates a mess trying to parse. In HTML, "xxx" was the orginal way of doing things. I have been getting fancier over time, with 66..99 style quotes, colour coding for term, quoted words, and so-called semantics. So this means a tedious process of converting the "..." form or the “...” form to styles. I do it in two stages, convert to style "quoted", then search for class="quote" and manually decide to change some of them to class="term" or class="socalled". Single words (without embedded spaces) are statistically most likely to be "socalled" or "term". I take that into account to save a little work. I left all this out about why I was doing this thinking it would just be a distraction from the problem. -- Roedy Green Canadian Mind Products http://mindprod.com Let all things be done decently and in order. ~ I Corinthians 14:40. Apparently, Jehovah disapproves of Java Threads.
[toc] | [prev] | [next] | [standalone]
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2012-10-31 07:11 -0700 |
| Message-ID | <e9c2989a47kmn6ndfgn6glqaoduqngd0k7@4ax.com> |
| In reply to | #2201 |
On Tue, 30 Oct 2012 14:48:12 +0100, "Peter J. Holzer"
<hjp-usenet2@hjp.at> wrote, quoted or indirectly quoted someone who
said :
>s.replaceAll(""(\S*?)"", "<span class=\"quoted\">$1</span>");
>
>should do the trick.
Unfortunately Funduc does not support greedy, reluctant.
That leaves four options:
1. find a command/gui search replace with Java syntax.
2. write one.
3. write some one-shot Java code that uses a Regex.
4. write some one-shot Java code that uses a parser.
--
Roedy Green Canadian Mind Products http://mindprod.com
Let all things be done decently and in order. ~ I Corinthians 14:40.
Apparently, Jehovah disapproves of Java Threads.
[toc] | [prev] | [next] | [standalone]
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2012-10-31 16:22 -0700 |
| Message-ID | <0sb3981redva1j08nfo2cv52fco0ku70rf@4ax.com> |
| In reply to | #2206 |
On Wed, 31 Oct 2012 07:11:47 -0700, Roedy Green <see_website@mindprod.com.invalid> wrote, quoted or indirectly quoted someone who said : I found a utility that might do, PowerRegex. But they want about $150 for it, there are hints you have to rebuy it for every bug release. It seems GUI focused and DEBE. >2. write one Let's say I do this and offer it open source, free. I have written a number student regex projects. Implementing them all would be quite a hunk of work. I am trying to think of a minimalist start, enough to solve by current problem, still be useful to others, and open to gradual evolution. I thought it would be script driven. I need a format for a script that 1. describes which files to apply it to. 2. supplies pairs of search/replace regexes 3. allows search-only 4. deals with regexes that end in spaces. Presume you can trust editors to leave them in peace. It would need ability to suppress replace, just search. It would allow you run single step or run to end of file, or end of dir or everything. Regex itself handles binary. Later add some way of dealing \uxxxx in regexes. handle non-regex, case-insensitive/case-sensitive without having to embed commands in regexes, negative wildcards. GUI to edit script. HTML module where space, spaces, tab, nl all match space in searches. proofread composer/proofreader that uses colour to let you know which chars are being used as literal/commands. anonymous one-shot scripts. ability to touch up searched text in middle of search with minimal text editor or launch to favourite with line number/offset. Undo. Any ideas on ideal format for an extensible script file? -- Roedy Green Canadian Mind Products http://mindprod.com When discusing on the Internet, anything you say is presumed to contradict someone else. If you are not, it is wise to explicitly state that you agree with or are elaborating on what someone else said. You can do this efficiently by starting your post with the word "Further" or "Also".
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2012-10-31 17:29 -0700 |
| Message-ID | <k6sfp5$4or$1@dont-email.me> |
| In reply to | #2210 |
On 10/31/2012 4:22 PM, Roedy Green wrote: > >> 2. write one > > Let's say I do this and offer it open source, free. Why not use Java?
[toc] | [prev] | [next] | [standalone]
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2012-11-01 18:43 -0700 |
| Message-ID | <0i86985p3pa4ld7jlkguliutg6e1t3ff1c@4ax.com> |
| In reply to | #2211 |
On Wed, 31 Oct 2012 17:29:24 -0700, markspace <-@.> wrote, quoted or indirectly quoted someone who said : >> Let's say I do this and offer it open source, free. >Why not use Java? Last night I wrote a prototype search/replace utility in Java. It does not have a UI or script format. I temporarily hard code constants and run it under IntelliJ. It is surprisingly easy to use. The IntelliJ IDE itself helps you proofread your regexes. It lets me search or search/replace regexes in multiple files with or without confirmation. I discovered the problem of cleaning up HTML quoting is much trickier than I first thought. Because " ’ have multiple purposes, meaning depends on context. Regexes are not up to the task of dealing with that. In HTML you want to treat comments, text and tag fields differently. A feature I am thinking of implementing soon is multiple choice replace. While replacing you hit key 0 to leave as is, 1 to replace, 2 to replace with an alternate string, 3 with another alternate etc. So often what I am doing is making a finer distinction. e.g. converting class="politician" to class="democrat" or class="republican" or leaving as is. I would be able to do it in one pass. Another feature that should be easy to do is to allow hitting F to continue replacements without confirm till the end of this file, D to the end of this directory or E for everything (with confirm) In C there is getc for picking off a single keystroke. Is there an easy way to do this in Java in a command line program? So you don't have to hit Enter every time? I found I had to pre-condition my files using QEV to detect potential balancing errors http://mindprod.com/products1.html#QEV before I set regexes loose. Otherwise they "flummox". Even then they have carefully watched to make sure they do not wreck things. -- Roedy Green Canadian Mind Products http://mindprod.com Ironically, even though the Internet was created by the US military [DARPA (Defense Advanced Research Projects Agency)] to withstand a nuclear attack, it is almost defenceless against malice from any of its users
[toc] | [prev] | [next] | [standalone]
| From | Daniel Pitts <newsgroup.nospam@virtualinfinity.net> |
|---|---|
| Date | 2012-10-30 16:39 -0700 |
| Message-ID | <czZjs.7584$pn7.4092@newsfe18.iad> |
| In reply to | #2197 |
On 10/29/12 1:46 PM, Roedy Green wrote: > Lets say you wanted to find strings of the form > "cat" > "naïve" > > What sort of regex would you use? or would you resort to custom code? > I would resort to StringEscapeUtils <http://commons.apache.org/lang/api-2.4/org/apache/commons/lang/StringEscapeUtils.html>
[toc] | [prev] | [next] | [standalone]
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2012-10-31 07:33 -0700 |
| Message-ID | <rgc298tulslrs15h894gd2csjncobq5ai1@4ax.com> |
| In reply to | #2203 |
On Tue, 30 Oct 2012 16:39:51 -0700, Daniel Pitts <newsgroup.nospam@virtualinfinity.net> wrote, quoted or indirectly quoted someone who said : > >I would resort to StringEscapeUtils So your idea is convert most of the entities to unicode, then apply your regex, then convert most of them back. -- Roedy Green Canadian Mind Products http://mindprod.com Let all things be done decently and in order. ~ I Corinthians 14:40. Apparently, Jehovah disapproves of Java Threads.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.java.help
csiph-web