Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.help > #2197 > unrolled thread

regex puzzle

Started byRoedy Green <see_website@mindprod.com.invalid>
First post2012-10-29 13:46 -0700
Last post2012-10-31 07:33 -0700
Articles 16 — 5 participants

Back to article view | Back to comp.lang.java.help


Contents

  regex puzzle Roedy Green <see_website@mindprod.com.invalid> - 2012-10-29 13:46 -0700
    Re: regex puzzle Lew <lewbloch@gmail.com> - 2012-10-29 14:48 -0700
      Re: regex puzzle Roedy Green <see_website@mindprod.com.invalid> - 2012-10-30 05:59 -0700
        Re: regex puzzle "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-10-30 14:48 +0100
          Re: regex puzzle markspace <-@.> - 2012-10-30 14:16 -0700
            Re: regex puzzle "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-10-31 09:54 +0100
              Re: regex puzzle markspace <-@.> - 2012-10-31 11:25 -0700
                Re: regex puzzle "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-01 13:56 +0100
                  Re: regex puzzle Roedy Green <see_website@mindprod.com.invalid> - 2012-11-01 18:46 -0700
            Re: regex puzzle Roedy Green <see_website@mindprod.com.invalid> - 2012-10-31 07:09 -0700
          Re: regex puzzle Roedy Green <see_website@mindprod.com.invalid> - 2012-10-31 07:11 -0700
            Re: regex puzzle Roedy Green <see_website@mindprod.com.invalid> - 2012-10-31 16:22 -0700
              Re: regex puzzle markspace <-@.> - 2012-10-31 17:29 -0700
                Re: regex puzzle Roedy Green <see_website@mindprod.com.invalid> - 2012-11-01 18:43 -0700
    Re: regex puzzle Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-10-30 16:39 -0700
      Re: regex puzzle Roedy Green <see_website@mindprod.com.invalid> - 2012-10-31 07:33 -0700

#2197 — regex puzzle

FromRoedy Green <see_website@mindprod.com.invalid>
Date2012-10-29 13:46 -0700
Subjectregex puzzle
Message-ID<olqt88d9p21pf9nau0j4pke7kmhq08u5o4@4ax.com>
Lets say you wanted to find strings of the form
 &quot;cat&quot;
 &quot;na&iuml;ve&quot;

What sort of regex would you use?  or would you resort to custom code?

-- 
Roedy Green Canadian Mind Products http://mindprod.com
Let all things be done decently and in order. ~ I Corinthians 14:40.
Apparently, Jehovah disapproves of Java Threads.

[toc] | [next] | [standalone]


#2198

FromLew <lewbloch@gmail.com>
Date2012-10-29 14:48 -0700
Message-ID<f7a2eec3-eca9-468a-8d8a-3d8bf360a530@googlegroups.com>
In reply to#2197
Roedy Green wrote:
> Lets say you wanted to find strings of the form
>  &quot;cat&quot;

Do you mean 

   "cat"

including the quotation marks?

>  &quot;na&iuml;ve&quot;

Do you mean 

   "naïve"

including the quotation marks?

Or do you mean to literally find the escape characters, &quot;, e.g., as an HTML 
parser might?

> What sort of regex would you use?  or would you resort to custom code?

Are you trying to find all substrings either "cat" or "naïve", or just one or the other 
for any given search?

Or does "of the form" mean something else?

If you're looking for ways to parse HTML escape characters, you could just look for the 
'&' character then match against a table such as the one on 
http://www.theukwebdesigncompany.com/articles/entity-escape-characters.php

-- 
Lew

[toc] | [prev] | [next] | [standalone]


#2200

FromRoedy Green <see_website@mindprod.com.invalid>
Date2012-10-30 05:59 -0700
Message-ID<pgiv881g37e73fek318423bvrmtncgto4e@4ax.com>
In reply to#2198
On Mon, 29 Oct 2012 14:48:28 -0700 (PDT), Lew <lewbloch@gmail.com>
wrote, quoted or indirectly quoted someone who said :

>including the quotation marks?
 
I am scanning postable HTML trying to convert things surrounded in
&quot; to a style, no embedded space allowed, but embedded entity
allowed to be left intact.

e.g.
&quot;cat&quot; (hex  2671756F743B6361742671756F743B  in  ASCII.)
to <span class="quoted">cat</span>  From there the style will be
refined manually.


&quot;na&iuml;ve&quot; (hex
2671756F743B6E612669756D6C3B766567261756F743B in ASCII) to
 <span class="quoted">na&iuml;ve</span>
 

The tricky part is the search where & inside is ok, even though the
terminator also starts with &.

This is further complicated by the fact I have not yet written a regex
utility that uses Java conventions. Funduc has its own quite different
ones. I thought one proffered solution could be modified for Funduc,
or I might cook up a one-shot Java utility with hard-coded constants
to avoid the problem if having deal with escaping parameters.


-- 
Roedy Green Canadian Mind Products http://mindprod.com
Let all things be done decently and in order. ~ I Corinthians 14:40.
Apparently, Jehovah disapproves of Java Threads.

[toc] | [prev] | [next] | [standalone]


#2201

From"Peter J. Holzer" <hjp-usenet2@hjp.at>
Date2012-10-30 14:48 +0100
Message-ID<slrnk8vmks.svs.hjp-usenet2@hrunkner.hjp.at>
In reply to#2200
On 2012-10-30 12:59, Roedy Green <see_website@mindprod.com.invalid> wrote:
> On Mon, 29 Oct 2012 14:48:28 -0700 (PDT), Lew <lewbloch@gmail.com>
> wrote, quoted or indirectly quoted someone who said :
>>including the quotation marks?
>  
> I am scanning postable HTML trying to convert things surrounded in
> &quot; to a style, no embedded space allowed, but embedded entity
> allowed to be left intact.
>
> e.g.
> &quot;cat&quot; (hex  2671756F743B6361742671756F743B  in  ASCII.)
> to <span class="quoted">cat</span>  From there the style will be
> refined manually.

Java Regexps seem to be Perl-compatible, so 

s.replaceAll("&quot;(\S*?)&quot;", "<span class=\"quoted\">$1</span>");

should do the trick.

At least unless you have HTML code like this

<img src="cat.jpg" alt="image of a &quot;cat&quot;">

This would be translated into 

<img src="cat.jpg" alt="image of a <span class="quoted">cat</span>">

which isn't valid HTML.

It is possible to handle that in a regexp, but this would be really
cumbersome. If you want to process HTML, use a proper HTML parser.

	hp


-- 
   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR       | Man feilt solange an seinen Text um, bis
| |   | hjp@hjp.at         | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel

[toc] | [prev] | [next] | [standalone]


#2202

Frommarkspace <-@.>
Date2012-10-30 14:16 -0700
Message-ID<k6pg3f$bgs$1@dont-email.me>
In reply to#2201
On 10/30/2012 6:48 AM, Peter J. Holzer wrote:

>
> Java Regexps seem to be Perl-compatible, so
>
> s.replaceAll("&quot;(\S*?)&quot;", "<span class=\"quoted\">$1</span>");
>

I don't think this will work, in the general case.  What about input like:

Hi&quot;I'm-a&quot;-dash-seperated-&quot;string.&quot;

You'll end up with one replacement, where I think Roedy would require 
two.  However, that's a good point as Roedy didn't show any examples 
involving white-space.  Some clarification beyond simple examples is in 
order, methinks.


[toc] | [prev] | [next] | [standalone]


#2204

From"Peter J. Holzer" <hjp-usenet2@hjp.at>
Date2012-10-31 09:54 +0100
Message-ID<slrnk91pqj.v21.hjp-usenet2@hrunkner.hjp.at>
In reply to#2202
On 2012-10-30 21:16, markspace <-@> wrote:
> On 10/30/2012 6:48 AM, Peter J. Holzer wrote:
>
>>
>> Java Regexps seem to be Perl-compatible, so
>>
>> s.replaceAll("&quot;(\S*?)&quot;", "<span class=\"quoted\">$1</span>");
>>
>
> I don't think this will work, in the general case.  What about input like:
>
> Hi&quot;I'm-a&quot;-dash-seperated-&quot;string.&quot;
>
> You'll end up with one replacement,

No, that should be two replacements: The *? operator is non-greedy, so
\\S*? matches the shortest possible sequence of non-space characters.


package at.hjp.regexptest;

public class RegExpTest {

	/**
	 * @param args
	 */
	public static void main(String[] args) {
		String s = "Hi&quot;I'm-a&quot;-dash-seperated-&quot;string.&quot;";
		String s1 = s.replaceAll("&quot;(\\S*?)&quot;", "<span class=\"quoted\">$1</span>");
		System.out.println(s1);

	}

}

prints:

Hi<span class="quoted">I'm-a</span>-dash-seperated-<span class="quoted">string.</span>

	hp


-- 
   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR       | Man feilt solange an seinen Text um, bis
| |   | hjp@hjp.at         | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel

[toc] | [prev] | [next] | [standalone]


#2209

Frommarkspace <-@.>
Date2012-10-31 11:25 -0700
Message-ID<k6rqea$ps4$1@dont-email.me>
In reply to#2204
On 10/31/2012 1:54 AM, Peter J. Holzer wrote:
> On 2012-10-30 21:16, markspace <-@> wrote:
>> On 10/30/2012 6:48 AM, Peter J. Holzer wrote:
>>
>>>
>>> Java Regexps seem to be Perl-compatible, so
>>>
>>> s.replaceAll("&quot;(\S*?)&quot;", "<span class=\"quoted\">$1</span>");
>>>
>>
>> I don't think this will work, in the general case.  What about input like:
>>
>> Hi&quot;I'm-a&quot;-dash-seperated-&quot;string.&quot;
>>
>> You'll end up with one replacement,
>
> No, that should be two replacements: The *? operator is non-greedy, so
> \\S*? matches the shortest possible sequence of non-space characters.


OK, what about the obverse?  What if the quoted string contains 
whitespace?

Not trying to bug you, but regex is tricky, and I don't often see it as 
an ideal, solution, so I'm trying to learn its corner cases.


[toc] | [prev] | [next] | [standalone]


#2212

From"Peter J. Holzer" <hjp-usenet2@hjp.at>
Date2012-11-01 13:56 +0100
Message-ID<slrnk94sc9.5vl.hjp-usenet2@hrunkner.hjp.at>
In reply to#2209
On 2012-10-31 18:25, markspace <-@> wrote:
> On 10/31/2012 1:54 AM, Peter J. Holzer wrote:
>> On 2012-10-30 21:16, markspace <-@> wrote:
>>> On 10/30/2012 6:48 AM, Peter J. Holzer wrote:
>>>> Java Regexps seem to be Perl-compatible, so
>>>>
>>>> s.replaceAll("&quot;(\S*?)&quot;", "<span class=\"quoted\">$1</span>");
>>>>
>>>
>>> I don't think this will work, in the general case.  What about input like:
>>>
>>> Hi&quot;I'm-a&quot;-dash-seperated-&quot;string.&quot;
>>>
>>> You'll end up with one replacement,
>>
>> No, that should be two replacements: The *? operator is non-greedy, so
>> \\S*? matches the shortest possible sequence of non-space characters.
>
>
> OK, what about the obverse?  What if the quoted string contains 
> whitespace?

Then it won't match. /\S/ matches any non-space character (it's the
opposite of /\s/, which matches any space character). If you want to
match any character, use /./. 


> Not trying to bug you, but regex is tricky, and I don't often see it as 
> an ideal, solution, so I'm trying to learn its corner cases.

As Jamie Zawinski once quipped:

    Some people, when confronted with a problem, think "I know, I'll use
    regular expressions." Now they have two problems. 

Regular expressions are a tool. There are situations where they are the
right tool and situations where they aren't. When processing HTML, they
are usually the wrong tool. I already gave one example where a simple
regexp like this fails. The problem is that unless the HTML is tightly
controlled (in this case: No &quot;...&quot; sequences in parameters)
you have to build a complete HTML lexer into the regexp. This is
possible, but cumbersome[1]. But for a one-time job that may not be a
problem: If you have to convert 100 files, a regexp which converts 95 of
them correctly and mangles 5 of them may be a better solution than a
program which handles them all correctly, but takes longer to write than
correcting the 5 mangled files manually.

	hp

[1] I demonstrated this some time ago in the German Perl newsgroup. 
    It wasn't even as bad as I expected, but then Perl makes it easy to
    write readable regexps.


-- 
   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR       | Man feilt solange an seinen Text um, bis
| |   | hjp@hjp.at         | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel

[toc] | [prev] | [next] | [standalone]


#2216

FromRoedy Green <see_website@mindprod.com.invalid>
Date2012-11-01 18:46 -0700
Message-ID<4b969852fg89jec1ju07or57djh5l68ncg@4ax.com>
In reply to#2212
On Thu, 1 Nov 2012 13:56:41 +0100, "Peter J. Holzer"
<hjp-usenet2@hjp.at> wrote, quoted or indirectly quoted someone who
said :

> If you want to
>match any character, use /./. 

Remember to use Pattern.compile("xxx", Pattern.DOTALL) to control
whether \n is considered "any character".
-- 
Roedy Green Canadian Mind Products http://mindprod.com
Ironically, even though the Internet was created by the US military 
[DARPA (Defense Advanced Research Projects Agency)]
to withstand a nuclear attack, it is almost defenceless against malice
from any of its users

[toc] | [prev] | [next] | [standalone]


#2205

FromRoedy Green <see_website@mindprod.com.invalid>
Date2012-10-31 07:09 -0700
Message-ID<hoa29851ekjh96mpk1de5c63cogsr2ue42@4ax.com>
In reply to#2202
On Tue, 30 Oct 2012 14:16:29 -0700, markspace <-@.> wrote, quoted or
indirectly quoted someone who said :

>Hi&quot;I'm-a&quot;-dash-seperated-&quot;string.&quot;

Happily I don't have to deal with that possibility. 

Some time ago I wrote a lint program for HTML to detect unbalanced
quotes of various flavours.  HTML/Unicode requires some guessing to do
this.  Some day I hope someone invents whole new scheme for escaping
and marking nested dialogue that uses colour or font or automatic
rendering to indicate depth, hiding the problem of absolute depth when
writing markup. I would also like the &rsquo; character to be cloned,
so there are two slightly difference characters used, one for
contractions and one for marking dialogue.  The fact that the same
character/entity is used for entirely different purposes creates a
mess trying to parse.

In HTML, &quot;xxx&quot; was the orginal way of doing things.

I have been getting fancier over time, with 66..99 style quotes,
colour coding for term, quoted words, and so-called semantics.

So this means a tedious process of converting the &quot;...&quot; form
or the &ldquo;...&rdquo; form to styles.

I do it in two stages, convert to style "quoted", then search for
class="quote" and manually decide to change some of them to
class="term" or class="socalled".

Single words (without embedded spaces) are statistically most likely
to be "socalled" or "term". I take that into account to save a little
work.

I left all this out about why I was doing this thinking it would just
be a distraction from the problem.  

 
-- 
Roedy Green Canadian Mind Products http://mindprod.com
Let all things be done decently and in order. ~ I Corinthians 14:40.
Apparently, Jehovah disapproves of Java Threads.

[toc] | [prev] | [next] | [standalone]


#2206

FromRoedy Green <see_website@mindprod.com.invalid>
Date2012-10-31 07:11 -0700
Message-ID<e9c2989a47kmn6ndfgn6glqaoduqngd0k7@4ax.com>
In reply to#2201
On Tue, 30 Oct 2012 14:48:12 +0100, "Peter J. Holzer"
<hjp-usenet2@hjp.at> wrote, quoted or indirectly quoted someone who
said :

>s.replaceAll("&quot;(\S*?)&quot;", "<span class=\"quoted\">$1</span>");
>
>should do the trick.

Unfortunately Funduc does not support greedy, reluctant.

That leaves four options:

1. find a command/gui search replace with Java syntax.

2. write one.

3. write some one-shot Java code that uses a Regex.

4. write some one-shot Java code that uses a parser.
-- 
Roedy Green Canadian Mind Products http://mindprod.com
Let all things be done decently and in order. ~ I Corinthians 14:40.
Apparently, Jehovah disapproves of Java Threads.

[toc] | [prev] | [next] | [standalone]


#2210

FromRoedy Green <see_website@mindprod.com.invalid>
Date2012-10-31 16:22 -0700
Message-ID<0sb3981redva1j08nfo2cv52fco0ku70rf@4ax.com>
In reply to#2206
On Wed, 31 Oct 2012 07:11:47 -0700, Roedy Green
<see_website@mindprod.com.invalid> wrote, quoted or indirectly quoted
someone who said :

I found a utility that might do, PowerRegex. But they want about $150
for it, there are hints you have to rebuy it for every bug release.
It seems GUI focused and DEBE.

>2. write one

Let's say I do  this and offer it open  source, free.

I have written  a  number student regex projects.  Implementing them
all  would be quite a hunk of work.

I am trying to think of a minimalist start, enough to solve by current
problem, still be useful to others, and open to gradual  evolution.

I thought it would be script driven.
I  need a format for a script that
1. describes which files to apply it to.
2. supplies pairs of search/replace regexes
3. allows search-only
4. deals with regexes that end in spaces.   Presume you can trust
editors to leave them in peace.

It would need ability to suppress replace, just search.  It would
allow you run single step or run to end of file, or  end of dir or
everything.

Regex itself handles binary.

Later add some way of dealing \uxxxx in regexes. handle non-regex,
case-insensitive/case-sensitive without having to embed commands in
regexes, negative wildcards. GUI to edit script.
HTML module where space, spaces, tab, nl all match space in searches.
proofread composer/proofreader that uses colour to let you know which
chars are being used as literal/commands. anonymous one-shot scripts.
ability to  touch up searched text in middle of search with minimal
text editor or launch to favourite  with line number/offset. Undo.

Any ideas on ideal format for an extensible script file?




-- 
Roedy Green Canadian Mind Products http://mindprod.com
When discusing on the Internet, anything you say is presumed to contradict
someone else. If you are not, it is wise to explicitly state that you agree
with or are elaborating on what someone else said. You can do this
efficiently by starting your post with the word "Further" or "Also".

[toc] | [prev] | [next] | [standalone]


#2211

Frommarkspace <-@.>
Date2012-10-31 17:29 -0700
Message-ID<k6sfp5$4or$1@dont-email.me>
In reply to#2210
On 10/31/2012 4:22 PM, Roedy Green wrote:

>
>> 2. write one
>
> Let's say I do  this and offer it open  source, free.


Why not use Java?

[toc] | [prev] | [next] | [standalone]


#2215

FromRoedy Green <see_website@mindprod.com.invalid>
Date2012-11-01 18:43 -0700
Message-ID<0i86985p3pa4ld7jlkguliutg6e1t3ff1c@4ax.com>
In reply to#2211
On Wed, 31 Oct 2012 17:29:24 -0700, markspace <-@.> wrote, quoted or
indirectly quoted someone who said :

>> Let's say I do  this and offer it open  source, free.

>Why not use Java?

Last night I wrote a prototype search/replace utility in Java.  It
does not have a UI or script format.  I temporarily hard code
constants and run it under IntelliJ.  It is surprisingly easy to use.
The IntelliJ IDE itself helps you proofread your regexes.  It lets me
search or search/replace regexes in multiple files with or without
confirmation.

I discovered the problem of cleaning up HTML quoting  is  much
trickier than I first thought.

Because &quot; &rsquo; have multiple purposes, meaning depends on
context.  Regexes are not up to the task of dealing with that.
In HTML you want to treat comments, text and tag fields differently.

A feature I am thinking of implementing soon is multiple choice
replace.

While  replacing you hit key 0 to leave as is, 1 to replace, 2 to
replace with an alternate string, 3 with another alternate etc.

So often what I  am doing is making a finer distinction.
e.g.  converting class="politician" to class="democrat" or
class="republican" or leaving as is.  I would be  able to do it in one
pass.

Another feature that should be easy to do is to allow hitting F to
continue replacements without confirm till the end of this file, D  to
the end of this directory or E for everything (with confirm)

In C there is getc for picking off a single keystroke. Is there an
easy way to do this in Java in a command line program? So you don't
have to hit Enter every time?

I found I had to pre-condition my files using QEV to detect potential
balancing errors http://mindprod.com/products1.html#QEV before I set
regexes loose. Otherwise they "flummox". Even then they have carefully
watched to make sure they do not wreck things.




-- 
Roedy Green Canadian Mind Products http://mindprod.com
Ironically, even though the Internet was created by the US military 
[DARPA (Defense Advanced Research Projects Agency)]
to withstand a nuclear attack, it is almost defenceless against malice
from any of its users

[toc] | [prev] | [next] | [standalone]


#2203

FromDaniel Pitts <newsgroup.nospam@virtualinfinity.net>
Date2012-10-30 16:39 -0700
Message-ID<czZjs.7584$pn7.4092@newsfe18.iad>
In reply to#2197
On 10/29/12 1:46 PM, Roedy Green wrote:
> Lets say you wanted to find strings of the form
>   &quot;cat&quot;
>   &quot;na&iuml;ve&quot;
>
> What sort of regex would you use?  or would you resort to custom code?
>

I would resort to StringEscapeUtils

<http://commons.apache.org/lang/api-2.4/org/apache/commons/lang/StringEscapeUtils.html>

[toc] | [prev] | [next] | [standalone]


#2208

FromRoedy Green <see_website@mindprod.com.invalid>
Date2012-10-31 07:33 -0700
Message-ID<rgc298tulslrs15h894gd2csjncobq5ai1@4ax.com>
In reply to#2203
On Tue, 30 Oct 2012 16:39:51 -0700, Daniel Pitts
<newsgroup.nospam@virtualinfinity.net> wrote, quoted or indirectly
quoted someone who said :

>
>I would resort to StringEscapeUtils

So your idea is convert most of the entities to unicode, then apply
your regex, then convert most of them back.


-- 
Roedy Green Canadian Mind Products http://mindprod.com
Let all things be done decently and in order. ~ I Corinthians 14:40.
Apparently, Jehovah disapproves of Java Threads.

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.java.help


csiph-web