23 August, 2009

Negation in regular expressions

2009/09/07 Update:
David Jones has posted a very simple solution to my contrived problem:

/\b(?!foo\b)[a-zA-Z]+/g
The above syntax does indeed solve the problem as laid out in this post. What I'm curious about now is if there is any situation where my proposed syntax works, and the above does not.

 

Regular expressions are highly useful things, with a variety of purposes. Nonetheless, I believe there is room to improve them. For instance, I have run across the following situation:

Let's suppose you want to find all the words in some text. This is rather simple:

/[a-zA-Z]+/g
But now let's suppose you want to find all the words, except the word "foo". This is incredibly more complicated. You can't use this construct:
/[^f][^o][^o]/g
For two primary reasons:
  1. It would match non-letter characters, like punctuation.
  2. It wouldn't match the word "boo" or "sol", even though neither are equal to the word "foo".

Let's try to rectify the first point:

/[a-eg-zA-EG-Z][a-np-zA-NP-Z]{2}[a-zA-Z]*/g
Well, now it no longer picks up punctuation, but it has grown far more complicated! And it still won't match "boo" or "sol".

What is the correct solution, then? See the update above. There mostly isn't one. You could probably come up with an incredibly long and complicated regular expression, but it's just not worth it. So, I am proposing a new syntax for regular expressions:

/[a-zA-Z]+(?^foo)/g

The above will match every word, except for the word "foo". However, it will match "boo", "sol", and "fool". On the other hand:

/[a-zA-Z]+(?^foo.*)/g
The above will match "boo" and "sol", but not "fool" or "foolhardy". This solves the problem of negation, within regular expressions.

7 comments:

  1. Or we could learn more regular expression syntax:

    /^(?!foo\b)\w+/

    «(?!» introduces a "negative assertion". Basically it says "without advancing the match cursor, attempt to match the thing in brackets here; if it matches, then fail".

    See ECMA-262 section 15.10.2.8 (no, don't really, the regular expression part of the JavaScript spec, is the _worst_ place to find out what regular expressions mean).

    It was inherited from Perl 5 via PCRE.

    (and I only know about this because we made a mug with a regular expression cheat sheet on it)

    ReplyDelete
  2. You are quite right; that regular expression does solve the problem. I knew about the negative look-ahead, but did not think about using it in that way.

    I'm curious about how well that would work in a more complicated expression. Sure, with the contrived example it was simple enough to solve. How about this, however:

    I wish to devise a regular expression that will match either single OR double-quoted strings. It must only match valid strings, and must allow for escapement. This regular expression works, but is overly verbose and complicated:

    /"(?:\\"|\\\n|[^"\n])*"|'(?:\\'|\\\n|[^'\n])*'/g

    It would be great if I could write it like this:

    /(["'])(?:\\\1|\\\n|[^\1\n])*\1/g

    But that doesn't work. Using the syntax I proposed, it would be possible:

    /(["'])(?:\\\1|\\\n|[^\n](?^\1))*\1/g

    With a quick cursory test, it seems the following also works:

    /(["'])(?:\\\1|\\\n|(?!\1)[^\n])*\1/g

    I haven't put it under a great amount of testing, but it seems to work okay. If this construct does indeed do what I desire, then you are correct that new syntax is not required.

    What I'm curious about is if there is any situation where my proposed syntax would work, but the negative look-ahead would not.

    ReplyDelete
  3. Cute. You can put back references in the negative assertion (I never knew that), so your idea of using (?!\1) is sound.

    However, your RE needs some tidying up to handle backslashes correctly. For example your RE matches the whole of:

    "\\""

    but that is not a valid string literal. It's a valid string literal followed by a lone double quote. Might not matter depending on how your tokenizer works.

    See this:

    http://wordaligned.org/articles/string-literals-and-regular-expressions

    Also, what language's string literal syntax are you trying to match? As far as I know, JavaScript does not permit newlines inside a string literal, not even if you backslash escape them, so "\
    " is not a valid literal, you always have to use \n if you want newlines. So I don't know what the \\\n stuff is trying to achieve.

    Soo... I think someone like this will do:

    /(["'])(?:\\.|(?!\1)[^\\])*\1/

    I think you have discovered something new (namely, a reasonable RE to match string literals).

    ReplyDelete
  4. Technically JavaScript does not allow for "\
    ", but all the major browsers support it, and it will be supported (officially) in ECMAScript 5.

    Indeed, I did learn some new things, notably about how the regular expression engine handles negative look-ahead, and also some interesting insights into backslash escaping.

    Thank you for your input.

    P.S. Your regular expression also matches "\\"". It produces the same output, as near as I can see.

    P.P.S. Back references, as far as I know, can occur anywhere except inside character classes. So.. [\1] or [^\1] won't work, but it will work anywhere else.

    ReplyDelete
  5. Ah, I spoke a bit too soon. Your regular expression matches one less character than mine. I will need to test it to verify that it works as expected.

    ReplyDelete
  6. "\
    " Oh yukh (I never knew that!). Still, it's good to plan ahead.

    When I said you discovered something new, I meant that a short regular expression to match string literals doesn't seem to be well known to the world, not just you and me; try googling for one, for example. And you'd think, what with all these syntax highlighters, that it would be a fairly well investigated problem.

    ReplyDelete
  7. Regular expressions in general tend to be short and confusing, so I'm not terribly surprised. The regular expression I was using has worked flawlessly for some time now, so I never thought to question it's validity. Indeed it only causes problems in very weird situations with uneven backslashes. Since I always make sure to properly backslash my code, the problem has simply not come up. That is probably the reason it has not been investigated further.

    On the plus side, I can plug the new regular expression into BRUSH, and it will automagically filter down, so I can benefit from it with almost no effort whatsoever.

    ReplyDelete