Groups > comp.lang.java.programmer > #22863 > unrolled thread

regexp(ing) Backus-Naurish expressions ...

Started by	qwertmonkey@syberianoutpost.ru
First post	2013-03-10 02:27 +0000
Last post	2013-03-10 11:16 -0700
Articles	20 on this page of 25 — 10 participants

Back to article view | Back to comp.lang.java.programmer

  regexp(ing) Backus-Naurish expressions ... qwertmonkey@syberianoutpost.ru - 2013-03-10 02:27 +0000
    Re: regexp(ing) Backus-Naurish expressions ... Arne Vajhøj <arne@vajhoej.dk> - 2013-03-09 21:33 -0500
    Re: regexp(ing) Backus-Naurish expressions ... Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> - 2013-03-09 21:00 -0600
    Re: regexp(ing) Backus-Naurish expressions ... Leif Roar Moldskred <leifm@dimnakorr.com> - 2013-03-09 23:33 -0600
    Re: regexp(ing) Backus-Naurish expressions ... lipska the kat <"nospam at neversurrender dot co dot uk"> - 2013-03-10 10:27 +0000
    Re: regexp(ing) Backus-Naurish expressions ... Martin Gregorie <martin@address-in-sig.invalid> - 2013-03-10 12:55 +0000
    Re: regexp(ing) Backus-Naurish expressions ... Roedy Green <see_website@mindprod.com.invalid> - 2013-03-10 07:57 -0700
      Re: regexp(ing) Backus-Naurish expressions ... Robert Klemme <shortcutter@googlemail.com> - 2013-03-10 22:39 +0100
        Re: regexp(ing) Backus-Naurish expressions ... Roedy Green <see_website@mindprod.com.invalid> - 2013-03-10 15:54 -0700
          Re: regexp(ing) Backus-Naurish expressions ... Robert Klemme <shortcutter@googlemail.com> - 2013-03-11 21:03 +0100
          Re: regexp(ing) Backus-Naurish expressions ... Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> - 2013-03-11 17:00 -0500
            Re: regexp(ing) Backus-Naurish expressions ... Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-11 18:31 -0400
              Re: regexp(ing) Backus-Naurish expressions ... Arne Vajhøj <arne@vajhoej.dk> - 2013-03-11 18:40 -0400
                Re: regexp(ing) Backus-Naurish expressions ... Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-11 21:39 -0400
            Re: regexp(ing) Backus-Naurish expressions ... Martin Gregorie <martin@address-in-sig.invalid> - 2013-03-11 23:06 +0000
            Re: regexp(ing) Backus-Naurish expressions ... Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid> - 2013-03-11 20:56 -0500
              Re: regexp(ing) Backus-Naurish expressions ... Arne Vajhøj <arne@vajhoej.dk> - 2013-03-11 22:06 -0400
              Re: regexp(ing) Backus-Naurish expressions ... Eric Sosman <esosman@comcast-dot-net.invalid> - 2013-03-12 09:30 -0400
        Re: regexp(ing) Backus-Naurish expressions ... Roedy Green <see_website@mindprod.com.invalid> - 2013-03-10 16:24 -0700
          Re: regexp(ing) Backus-Naurish expressions ... Robert Klemme <shortcutter@googlemail.com> - 2013-03-11 21:08 +0100
            Re: regexp(ing) Backus-Naurish expressions ... Arne Vajhøj <arne@vajhoej.dk> - 2013-03-11 16:59 -0400
              Re: regexp(ing) Backus-Naurish expressions ... Robert Klemme <shortcutter@googlemail.com> - 2013-03-11 22:24 +0100
        Re: regexp(ing) Backus-Naurish expressions ... Robert Klemme <shortcutter@googlemail.com> - 2013-03-11 21:00 +0100
      Re: regexp(ing) Backus-Naurish expressions ... Robert Klemme <shortcutter@googlemail.com> - 2013-03-13 08:07 +0100
    Re: regexp(ing) Backus-Naurish expressions ... markspace <markspace@nospam.nospam> - 2013-03-10 11:16 -0700

Page 1 of 2 [1] 2 Next page →

#22863 — regexp(ing) Backus-Naurish expressions ...

From	qwertmonkey@syberianoutpost.ru
Date	2013-03-10 02:27 +0000
Subject	regexp(ing) Backus-Naurish expressions ...
Message-ID	<khgr2k$u3b$1@speranza.aioe.org>

 I need to set up some code's running context via properties files and I want
to make sure that users don't get too playful messing with them, because that
could alter results greatly and in unexpected ways (they must probably won't
be able to make sense of and then they would bother the hell out of you)
~ 
 So, I must do some sanity check the running parameters if entered via the 
command prompt or if the defaults are used from the properties files
~ 
 I am telling you all of that because you many know of libraries to do such
thing
~ 
 I think one possible way to do that is via a regexp, which should match all
the options included in the test array aISAr
~ 
 One of the problems I am having is that if you enter as options say [true|t],
the matcher would match just the "t" of "true" and I want for "true" to be
actually matched another one is that, say, " true ", should be matched, as well
as "false [ nix |mac| windows ] line.separator" ...
~ 
 Any ideas you would share?
~ 
 thanks,
 lbrtchx
~ 
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ TEST CODE ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ 

import java.util.regex.Matcher;
import java.util.regex.Pattern;

// __ 
public class RegexMatches02Test{
// __ 
 public static void main( String args[] ){
  String aRegEx;
  String aIS;
  Pattern Ptrn;
  Matcher Mtchr;
  int iCnt, iMtxStart, iMtxEnd;
// __ 
  aRegEx = "^\\s*[true|false|t|f]{1}\\s*\\[";
  aRegEx = "^\\s*[true|false|t|f]{1}";
  aRegEx = "^\\s*[true|false|t|f]{1}\\s*";
  aRegEx = "^\\s*[true|false t|f]{1}\\s*";

// __ 
  String[] aISAr = new String[]{
     " true[a|b |c ] q"
   , " true [a|b |c ] q"
   , "true [a|b |c ] q"
   , "true[a|b|c] b"
   , "true[a|b|c]q"
   , "False[ y | n | q ] q"
   , "false[nix|windows|mac]line.separator"
   , "false [ nix |mac| windows ] line.separator"
   , "T[y|n]q"
   , "T[y]"
   , "false"
   , "faLse"
   , "true"
   , "TrUe"
   , "F"
   , "T"
  };
  int iISArL = aISAr.length, i = 0;
// __ 
  boolean IsLoop;
  Ptrn = Pattern.compile(aRegEx, Pattern.CASE_INSENSITIVE);

  System.err.println("// __ matching pattern: |" + aRegEx + "|");

  Mtchr = Ptrn.matcher(aISAr[i]); // get a matcher object
  IsLoop = (i < iISArL);
  while(IsLoop){
   System.err.println("// __ |" + i + "|" + aISAr[i] + "|");
   iCnt = 0;
// __ 
   while(Mtchr.find()){
    iMtxStart = Mtchr.start();
    iMtxEnd = Mtchr.end();
    System.err.println("|" + iCnt + "|" + iMtxStart + "|" + iMtxEnd + "|" +
 aISAr[i].substring(iMtxStart, iMtxEnd) + "|");
    ++iCnt;
   }// (Mtchr.find())
   System.err.println("~");
// __ 
   ++i;
   IsLoop = (i < iISArL);
   if(IsLoop){ Mtchr.reset(aISAr[i]); }
  }// while(IsLoop)
 }
}

[toc] | [next] | [standalone]

#22864

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2013-03-09 21:33 -0500
Message-ID	<513bf104$0$32108$14726298@news.sunsite.dk>
In reply to	#22863

On 3/9/2013 9:27 PM, qwertmonkey@syberianoutpost.ru wrote:
>   I need to set up some code's running context via properties files and I want
> to make sure that users don't get too playful messing with them, because that
> could alter results greatly and in unexpected ways (they must probably won't
> be able to make sense of and then they would bother the hell out of you)
> ~
>   So, I must do some sanity check the running parameters if entered via the
> command prompt or if the defaults are used from the properties files
> ~
>   I am telling you all of that because you many know of libraries to do such
> thing
> ~
>   I think one possible way to do that is via a regexp, which should match all
> the options included in the test array aISAr
> ~
>   One of the problems I am having is that if you enter as options say [true|t],
> the matcher would match just the "t" of "true" and I want for "true" to be
> actually matched another one is that, say, " true ", should be matched, as well
> as "false [ nix |mac| windows ] line.separator" ...
> ~
>   Any ideas you would share?

I would do it as:
- switch from properties to XML
- define a schema for the XML with strict restrictions on data
- let the application parse that with a validating parser and
   read it into some config object, this will ensure that required
   information is there and that the data types are correct
- let the application apply business validation rules in Java code
   on the config objects - this will ensure that the various
   information is consistent

Arne

[toc] | [prev] | [next] | [standalone]

#22865

From	Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid>
Date	2013-03-09 21:00 -0600
Message-ID	<khgst1$tjp$1@dont-email.me>
In reply to	#22863

On 3/9/2013 8:27 PM, qwertmonkey@syberianoutpost.ru wrote:
>   One of the problems I am having is that if you enter as options say [true|t],
> the matcher would match just the "t" of "true" and I want for "true" to be
> actually matched another one is that, say, " true ", should be matched, as well
> as "false [ nix |mac| windows ] line.separator" ...

Do you know the syntax of Java's regular expressions? See 
<http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html>.

In short, anything contained within square brackets is considered to be 
a set of characters to match on, so [true|t] succeeds if the character 
it's matching against is a t, r, u, e, or |. The syntax you probably 
wanted was (true|t), which would either match the string "true" or the 
string "t".

-- 
Beware of bugs in the above code; I have only proved it correct, not 
tried it. -- Donald E. Knuth

[toc] | [prev] | [next] | [standalone]

#22867

From	Leif Roar Moldskred <leifm@dimnakorr.com>
Date	2013-03-09 23:33 -0600
Message-ID	<kvOdnR7MkYC2hqHMnZ2dnUVZ8gGdnZ2d@giganews.com>
In reply to	#22863

qwertmonkey@syberianoutpost.ru wrote:
>
> I think one possible way to do that is via a regexp, which should match all
> the options included in the test array aISAr
> ~ 
> One of the problems I am having is that if you enter as options say [true|t],
> the matcher would match just the "t" of "true" and I want for "true" to be
> actually matched another one is that, say, " true ", should be matched, as well
> as "false [ nix |mac| windows ] line.separator" ...
> ~ 
> Any ideas you would share?

When working with regular expressions you should always remember that
you don't need to do everything in a single expression. There's no law
against splitting things up into sub-expressions or using "boring old
code" for parts of the match. 

You should also bear in mind that some parsing tasks are just not
suited to regular expressions and if the regular expression starts
getting complicated you should consider if the task might be solved
more easily with another approach.

Here, assuming I've understood the problem right, I might do something
as below (I'm not on my development computer, so note that this has
not been checked for errors):


  Set<String> VALID_FIRST_WORDS = toSet( "true", "false", "t", "f" );
  String WORD = "(\\w+)";
  String BRACKETED_WORD = "(\\[([^]])+\\])";
  Pattern LINE_MATCH = Pattern.compile( WORD + "\\s*" + 
    BRACKETED_WORD + "?\\s+" + WORD + "?" );
  
  boolean validLine( String inputLine ) {
    String line = inputLine.toLowerCase().trim();
    Matcher matcher = LINE_MATCH.matcher( line );
    if( matcher.matches() ) {
      String firstWord = matcher.group(1);
      // Not .group(2) as that would include the brackets.
      String bracketedWord = matcher.group(3).trim();
      String lastWord = matcher.group(4);

      return firstValid( firstWord ) && 
             bracketedValid( firstWord, bracketedWord ) &&
             lastValid( firstWord, bracketedWorld, lastWord );
    }
    return false;
  }

  boolean firstValid( String firstWord ) {
    // Alternatively, use a HashSet 
    switch( firstWord ) {
      case "true" :   /* Fall through */
      case "t" :      /* Fall through */
      case "false" :  /* Fall through */
      case "f" :      return true;
      default : return false;
    }
  }

  // This is assuming the valid values of the bracketed
  // expression depends on what the first word was
  Map<String, Set<String>> LEGAL_BRACKETED = ...;

  boolean bracketedValid( String firstWord, String bracketed ) {
    if( bracketed == null ) {
      return true;
    }
 
    Set<String> legalBracketed = LEGAL_BRACKETED.get( firstWord );

    return legalBracketed != null && 
           legalBracketed.contains( bracketed );     
  }

  boolean lastValid( String first, String bracketed, String last ) {
    if( bracketed == null && last == null ) {
      return true;
    }

    // Implementation depends on the particulars of when certain
    // last words are valid and when not.
    ...
  }
  

-- 
Leif Roar Moldskred

[toc] | [prev] | [next] | [standalone]

#22871

From	lipska the kat <"nospam at neversurrender dot co dot uk">
Date	2013-03-10 10:27 +0000
Message-ID	<ybednSpj74C-_aHMnZ2dnUVZ7t6dnZ2d@bt.com>
In reply to	#22863

On 10/03/13 02:27, qwertmonkey@syberianoutpost.ru wrote:
>   I need to set up some code's running context via properties files and I want
> to make sure that users don't get too playful messing with them, because that
> could alter results greatly and in unexpected ways (they must probably won't
> be able to make sense of and then they would bother the hell out of you)
> ~
>   So, I must do some sanity check the running parameters if entered via the
> command prompt or if the defaults are used from the properties files
> ~
>   I am telling you all of that because you many know of libraries to do such
> thing

Not sure if this is what you are after as I've never used it myself but

http://commons.apache.org/proper/commons-cli/

might help. I've always had a lot of luck with other commons stuff.

lipska

-- 
Lipska the Kat©: Troll hunter, sandbox destroyer
and farscape dreamer of Aeryn Sun

[toc] | [prev] | [next] | [standalone]

#22872

From	Martin Gregorie <martin@address-in-sig.invalid>
Date	2013-03-10 12:55 +0000
Message-ID	<khhvrq$jrb$1@localhost.localdomain>
In reply to	#22863

On Sun, 10 Mar 2013 02:27:32 +0000, qwertmonkey wrote:

> I need to set up some code's running context via properties files and I
> want to make sure that users don't get too playful messing with them,
> because that could alter results greatly and in unexpected ways (they
> must probably won't be able to make sense of and then they would bother
> the hell out of you)
>
I wrote my own extended Java equivalent of the C getopt() function, which 
separates arguments from options, accepts both short form (-h) and long 
form (--help) and allows you to specify whether an option must never, 
may, or must have an associated value, which may be written as -xvalue -x 
value -x=value in short form or --opt=value --opt value in long form. 
Option validity and value presence are checked by the parser but both 
argument checks and option value checks are left for the calling code.

This is implemented as the ArgParser class in my environ.jar library and 
can be found at:

http://sourceforge.net/projects/cdocumenter/files/cdocumenter/
environment/     

and is fully documented in javadoc comments at class and method level.


-- 
martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |

[toc] | [prev] | [next] | [standalone]

#22873

From	Roedy Green <see_website@mindprod.com.invalid>
Date	2013-03-10 07:57 -0700
Message-ID	<vo7pj8p9b0rv8pfkp64902inolmml9vm01@4ax.com>
In reply to	#22863

On Sun, 10 Mar 2013 02:27:32 +0000 (UTC),
qwertmonkey@syberianoutpost.ru wrote, quoted or indirectly quoted
someone who said :

> Any ideas you would share?

Regexes are quite limited. When you bang into their limits you can
write a finite state machine or use a parser.

see http://mindprod.com/jgloss/parser.html
http://mindprod.com/jgloss/finitestate.html
-- 
Roedy Green Canadian Mind Products http://mindprod.com
Software gets slower faster than hardware gets faster. 
 ~ Niklaus Wirth (born: 1934-02-15 age: 79) Wirth's Law

[toc] | [prev] | [next] | [standalone]

#22875

From	Robert Klemme <shortcutter@googlemail.com>
Date	2013-03-10 22:39 +0100
Message-ID	<aq4cssFsm5rU1@mid.individual.net>
In reply to	#22873

On 10.03.2013 15:57, Roedy Green wrote:
> On Sun, 10 Mar 2013 02:27:32 +0000 (UTC),
> qwertmonkey@syberianoutpost.ru wrote, quoted or indirectly quoted
> someone who said :
>
>> Any ideas you would share?
>
> Regexes are quite limited.

I beg to differ: it's amazing what you can do with them.  Especially 
modern RX engines are usually much more powerful than those needed for 
the class of regular languages.

> When you bang into their limits you can
> write a finite state machine or use a parser.

What limitations would make me want to write a FSM instead by hand?

Cheers

	robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

[toc] | [prev] | [next] | [standalone]

#22877

From	Roedy Green <see_website@mindprod.com.invalid>
Date	2013-03-10 15:54 -0700
Message-ID	<ke3qj8daj46td8g3lklac14mtr56na4rv4@4ax.com>
In reply to	#22875

Examples where regexes run out of steam:
parsing Java, HTML, BAT language ... to do syntax colouring.
screen scraping, where what you want can appear in arbiter orders, be
missing, or enclosed in a variety of delimiters.

creating code to simulate the output of forms. You have to do it in
stages. You pick out a string then you pick out strings of that


-- 
Roedy Green Canadian Mind Products http://mindprod.com
Software gets slower faster than hardware gets faster. 
 ~ Niklaus Wirth (born: 1934-02-15 age: 79) Wirth's Law

[toc] | [prev] | [next] | [standalone]

#22892

From	Robert Klemme <shortcutter@googlemail.com>
Date	2013-03-11 21:03 +0100
Message-ID	<aq6rkjFf1tsU2@mid.individual.net>
In reply to	#22877

On 10.03.2013 23:54, Roedy Green wrote:
> Examples where regexes run out of steam:

I never said you can do anything with regexps.  You said they are "quite 
limited" to which I responded "I beg to differ: it's amazing what you 
can do with them."  I think you are talking completely past me.

> parsing Java, HTML, BAT language ... to do syntax colouring.

For that you need a context free parser anyway and would not create a 
FSM by hand.

> screen scraping, where what you want can appear in arbiter orders, be
> missing, or enclosed in a variety of delimiters.

Still, I haven't seen a single reason to create a FSM by hand.

> creating code to simulate the output of forms. You have to do it in
> stages. You pick out a string then you pick out strings of that

Regexps are for _parsing_ and not for _generating_.

Cheers

	robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

[toc] | [prev] | [next] | [standalone]

#22904

From	Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid>
Date	2013-03-11 17:00 -0500
Message-ID	<khlk24$bqu$1@dont-email.me>
In reply to	#22877

On 3/10/2013 5:54 PM, Roedy Green wrote:
> Examples where regexes run out of steam:
> parsing Java, HTML, BAT language ... to do syntax colouring.

Actually, all of those examples fall under the category of lexing, which 
is very easy to do with regular expressions; the python equivalent of 
flex uses regular expressions internally to do the lexing. Basically, 
what you'd have to do is this:

1. For each token, compute the regex that matches the token and enclose 
it in a named capturing group
2. Combine the token regexes into a single regex using disjunctions
3. Run the large regex on the input string by continually finding 
matches until it runs out of them.
4. For each match, use the named capturing group to do actions for that 
part of the input string.

> screen scraping, where what you want can appear in arbiter orders, be
> missing, or enclosed in a variety of delimiters.

([()<>,:;@])|(?:[^\\"]|\\.)*|\[(?:[^\\\]]|\\.)*\]|(?:\\.|[^ 
\t\r\n()<>,:;@["])+

That is an example of a production regular expression I use specifically 
for tokenizing. Note in particular that I am matching two separate kinds 
of string literals ("foo" and [foo]). The hard part here is that I'm 
dealing with an idiot language that made comment-parsing context-free, 
but I decided to say "to hell with this" and ignore that fact, banking 
that it's a rare edge case I never have to deal with.

Granted, such large regular expressions can become extremely unwieldly 
(said regex is actually composed out of about five lines of code plus 
detailed comments above each part explaining what it does), but it's 
still very simple to do in a regex.

-- 
Beware of bugs in the above code; I have only proved it correct, not 
tried it. -- Donald E. Knuth

[toc] | [prev] | [next] | [standalone]

#22908

From	Eric Sosman <esosman@comcast-dot-net.invalid>
Date	2013-03-11 18:31 -0400
Message-ID	<khllr5$mg9$1@dont-email.me>
In reply to	#22904

On 3/11/2013 6:00 PM, Joshua Cranmer 🐧 wrote:
> [...]
> ([()<>,:;@])|(?:[^\\"]|\\.)*|\[(?:[^\\\]]|\\.)*\]|(?:\\.|[^
> \t\r\n()<>,:;@["])+
>
> That is an example of a production regular expression I use specifically
> for tokenizing. [...]

     As Ed Post noted nearly thirty years ago:

	It has been observed that a TECO command sequence
	more closely resembles transmission line noise
	than readable text.
	-- "Real Programmers Don't Use PASCAL"

Nobody I know of uses TECO any more, but regexes satisfy
people's craving for gibberish.

-- 
Eric Sosman
esosman@comcast-dot-net.invalid

[toc] | [prev] | [next] | [standalone]

#22910

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2013-03-11 18:40 -0400
Message-ID	<513e5d5a$0$32110$14726298@news.sunsite.dk>
In reply to	#22908

On 3/11/2013 6:31 PM, Eric Sosman wrote:
> On 3/11/2013 6:00 PM, Joshua Cranmer 🐧 wrote:
>> [...]
>> ([()<>,:;@])|(?:[^\\"]|\\.)*|\[(?:[^\\\]]|\\.)*\]|(?:\\.|[^
>> \t\r\n()<>,:;@["])+
>>
>> That is an example of a production regular expression I use specifically
>> for tokenizing. [...]
>
>      As Ed Post noted nearly thirty years ago:
>
>      It has been observed that a TECO command sequence
>      more closely resembles transmission line noise
>      than readable text.
>      -- "Real Programmers Don't Use PASCAL"
>
> Nobody I know of uses TECO any more, but regexes satisfy
> people's craving for gibberish.

$ edit/teco z.z
%Can't find file "Z.Z"
%Creating new file
*ex$$

:-)

(sorry - the only thing I know about TECO is how to exit)

Arne

[toc] | [prev] | [next] | [standalone]

#22919

From	Eric Sosman <esosman@comcast-dot-net.invalid>
Date	2013-03-11 21:39 -0400
Message-ID	<khm0r4$aqb$1@dont-email.me>
In reply to	#22910

On 3/11/2013 6:40 PM, Arne Vajhøj wrote:
> On 3/11/2013 6:31 PM, Eric Sosman wrote:
>>[...]
>> Nobody I know of uses TECO any more, but regexes satisfy
>> people's craving for gibberish.
>
> $ edit/teco z.z
> %Can't find file "Z.Z"
> %Creating new file
> *ex$$
>
> :-)
>
> (sorry - the only thing I know about TECO is how to exit)

     Perhaps the most important lesson of all!  ;-)

-- 
Eric Sosman
esosman@comcast-dot-net.invalid

[toc] | [prev] | [next] | [standalone]

#22912

From	Martin Gregorie <martin@address-in-sig.invalid>
Date	2013-03-11 23:06 +0000
Message-ID	<khlo1h$jaf$1@localhost.localdomain>
In reply to	#22904

On Mon, 11 Mar 2013 22:28:42 +0000, Stefan Ram wrote:

> =?UTF-8?B?Sm9zaHVhIENyYW5tZXIg8J+Qpw==?=  <Pidgeot18@verizon.invalid>
> writes:
>>On 3/10/2013 5:54 PM, Roedy Green wrote:
>>>parsing Java
>>Actually, all of those examples fall under the category of lexing,
> 
>   Parsing is not lexing, usually parsing comes after lexing.

When I need to do that in Java I use the Coco/R parser generator. It 
generates both lexer and parser and IMO is more understandable than the 
classic C equivalent (Lex + Yacc or Flax + Bison), at least partly 
because its easy to modify or extend the framework it runs in and the 
generated code is fairly readable.

-- 
martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |

[toc] | [prev] | [next] | [standalone]

#22920

From	Joshua Cranmer 🐧 <Pidgeot18@verizon.invalid>
Date	2013-03-11 20:56 -0500
Message-ID	<khm1ti$el1$1@dont-email.me>
In reply to	#22904

On 3/11/2013 5:28 PM, Stefan Ram wrote:
> =?UTF-8?B?Sm9zaHVhIENyYW5tZXIg8J+Qpw==?=  <Pidgeot18@verizon.invalid> writes:
>> On 3/10/2013 5:54 PM, Roedy Green wrote:
>>> parsing Java
>> Actually, all of those examples fall under the category of lexing,
>
>    Parsing is not lexing, usually parsing comes after lexing.
>

I agree, but Roedy wrote:
parsing Java, HTML, BAT language ... to do syntax colouring.

Syntax coloring generally requires nothing more than lexing the input to 
figure which tokens are which.

-- 
Beware of bugs in the above code; I have only proved it correct, not 
tried it. -- Donald E. Knuth

[toc] | [prev] | [next] | [standalone]

#22921

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2013-03-11 22:06 -0400
Message-ID	<513e8d96$0$32110$14726298@news.sunsite.dk>
In reply to	#22920

On 3/11/2013 9:56 PM, Joshua Cranmer 🐧 wrote:
> On 3/11/2013 5:28 PM, Stefan Ram wrote:
>> =?UTF-8?B?Sm9zaHVhIENyYW5tZXIg8J+Qpw==?=  <Pidgeot18@verizon.invalid>
>> writes:
>>> On 3/10/2013 5:54 PM, Roedy Green wrote:
>>>> parsing Java
>>> Actually, all of those examples fall under the category of lexing,
>>
>>    Parsing is not lexing, usually parsing comes after lexing.
>>
>
> I agree, but Roedy wrote:
> parsing Java, HTML, BAT language ... to do syntax colouring.
>
> Syntax coloring generally requires nothing more than lexing the input to
> figure which tokens are which.

Some languages are tricky.

C# has contextual keywords.

dynamic dynamic;

is a valid declaration and the first is a keyword and the second
is a name.

Arne

[toc] | [prev] | [next] | [standalone]

#22926

From	Eric Sosman <esosman@comcast-dot-net.invalid>
Date	2013-03-12 09:30 -0400
Message-ID	<khnahl$trr$1@dont-email.me>
In reply to	#22920

On 3/11/2013 9:56 PM, Joshua Cranmer 🐧 wrote:
> On 3/11/2013 5:28 PM, Stefan Ram wrote:
>> =?UTF-8?B?Sm9zaHVhIENyYW5tZXIg8J+Qpw==?=  <Pidgeot18@verizon.invalid>
>> writes:
>>> On 3/10/2013 5:54 PM, Roedy Green wrote:
>>>> parsing Java
>>> Actually, all of those examples fall under the category of lexing,
>>
>>    Parsing is not lexing, usually parsing comes after lexing.
>>
>
> I agree, but Roedy wrote:
> parsing Java, HTML, BAT language ... to do syntax colouring.
>
> Syntax coloring generally requires nothing more than lexing the input to
> figure which tokens are which.

     Is that how the NetBeans editor knows to display local
variables in black but class and instance fields in green?

     ;-)

-- 
Eric Sosman
esosman@comcast-dot-net.invalid

[toc] | [prev] | [next] | [standalone]

#22878

From	Roedy Green <see_website@mindprod.com.invalid>
Date	2013-03-10 16:24 -0700
Message-ID	<g95qj8t80ona2h7ut8msl3hdomg10d40j6@4ax.com>
In reply to	#22875

On Sun, 10 Mar 2013 22:39:22 +0100, Robert Klemme
<shortcutter@googlemail.com> wrote, quoted or indirectly quoted
someone who said :

>What limitations would make me want to write a FSM instead by hand?

Compacting out nugatory space in HTML would be another example.

Though they are quite complicated, I find FSMs very easy to write, and
they almost always work first time. You can narrow your thinking to a
tiny case and ignore the big picture quite safely.

In contrast, I find my regexes (of any complexity) nearly always have
some unexpected behaviour, often than does not show up immediately.

The other complicating factor is I use three different regex schemes
in a day: Java, Funduc and SlickEdit.  I keep borrowing syntax from
one of the other schemes than the one I am using.  Some day I will
have to write replacements that use Java syntax.
-- 
Roedy Green Canadian Mind Products http://mindprod.com
Software gets slower faster than hardware gets faster. 
 ~ Niklaus Wirth (born: 1934-02-15 age: 79) Wirth's Law

[toc] | [prev] | [next] | [standalone]

#22893

From	Robert Klemme <shortcutter@googlemail.com>
Date	2013-03-11 21:08 +0100
Message-ID	<aq6rtkFf5ceU1@mid.individual.net>
In reply to	#22878

On 11.03.2013 00:24, Roedy Green wrote:
> On Sun, 10 Mar 2013 22:39:22 +0100, Robert Klemme
> <shortcutter@googlemail.com> wrote, quoted or indirectly quoted
> someone who said :
>
>> What limitations would make me want to write a FSM instead by hand?
>
> Compacting out nugatory space in HTML would be another example.

There are tools for processing tag based languages.  Why would I want to 
create a FSM by hand for that?

> Though they are quite complicated, I find FSMs very easy to write, and
> they almost always work first time. You can narrow your thinking to a
> tiny case and ignore the big picture quite safely.

Certainly you can write FSMs for a lot of things.  But you were claiming 
that a manual FSM should be used instead of a regexp engine; so the 
question remains unanswered: why would anyone create a FSM by hand for 
parsing?

> In contrast, I find my regexes (of any complexity) nearly always have
> some unexpected behaviour, often than does not show up immediately.

Well, that certainly depends on your familiarity with the tool.  To me 
this sounds suspiciously like NIH syndrome.  I am so familiar with using 
regular expressions of various kinds that it would not occur to me to 
start writing a FSM for parsing by hand.  That is such a waste of time.

> The other complicating factor is I use three different regex schemes
> in a day: Java, Funduc and SlickEdit.  I keep borrowing syntax from
> one of the other schemes than the one I am using.

And how exactly do you implement a FSM in SlickEdit?

> Some day I will
> have to write replacements that use Java syntax.

Not sure what you mean by that.

Cheers

	robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

[toc] | [prev] | [next] | [standalone]

Page 1 of 2 [1] 2 Next page →

csiph-web

regexp(ing) Backus-Naurish expressions ...

Contents

#22863 — regexp(ing) Backus-Naurish expressions ...

#22864

#22865

#22867

#22871

#22872

#22873

#22875

#22877

#22892

#22904

#22908

#22910

#22919

#22912

#22920

#22921

#22926

#22878

#22893