23 August, 2009

Negation in regular expressions

2009/09/07 Update:
David Jones has posted a very simple solution to my contrived problem:

The above syntax does indeed solve the problem as laid out in this post. What I'm curious about now is if there is any situation where my proposed syntax works, and the above does not.


Regular expressions are highly useful things, with a variety of purposes. Nonetheless, I believe there is room to improve them. For instance, I have run across the following situation:

Let's suppose you want to find all the words in some text. This is rather simple:

But now let's suppose you want to find all the words, except the word "foo". This is incredibly more complicated. You can't use this construct:
For two primary reasons:
  1. It would match non-letter characters, like punctuation.
  2. It wouldn't match the word "boo" or "sol", even though neither are equal to the word "foo".

Let's try to rectify the first point:

Well, now it no longer picks up punctuation, but it has grown far more complicated! And it still won't match "boo" or "sol".

What is the correct solution, then? See the update above. There mostly isn't one. You could probably come up with an incredibly long and complicated regular expression, but it's just not worth it. So, I am proposing a new syntax for regular expressions:


The above will match every word, except for the word "foo". However, it will match "boo", "sol", and "fool". On the other hand:

The above will match "boo" and "sol", but not "fool" or "foolhardy". This solves the problem of negation, within regular expressions.