RegEx Golf


XKCD #1313
XKCD #1313

Inspired by the famous XKCD cartoon that started it and by Google’s Peter Norvig over on O’Reilly.com, I decided to play some RegEx Golf myself this Easter…

If you’re not familiar with the game - the idea is to write the shortest regular expression possible that will select every element of a list, without selecting any elements from a second list. But what are regular expressions, you ask? Well, they’re a special sequence of characters that specify a particular textual patter – so you can use the appropriate regular expression when programming to automatically extract specific text from a list, file, or other text. There are some great tutorials on the web if you want to learn regular expressions (and you really should – because they’re great) – for example regexone.com

I didn’t want to use lists of Presidents of the United States (as that had been done) – so I decided to have a go at the age-old challenge: animal, mineral, or vegetable?

Having three lists to work with makes things a bit tougher - as we have two sources of potential mismatches to avoid - meaning we’ll have to be more specific than we otherwise might be…

After some after a quick web search I came up with the following lists:

Animals

alligator ant bear bee bird camel cat cheetah 
chicken chimpanzee cow crocodile deer dog 
dolphin duck eagle elephant fish fly fox frog 
giraffe goat goldfish hamster hippopotamus 
horse kangaroo kitten lion lobster monkey 
octopus owl panda pig puppy rabbit rat scorpion
seal shark sheep snail snake spider squirrel 
tiger turtle wolf zebra

Minerals

agate alexandrite amethyst aquamarine aventurine
azurite basalt calcite carnelian citrine coal 
diamond dolomite emerald feldspar fluorite garnet
gold granite gypsum hematite howlite jade jasper 
jet kunzite labradorite limestone magnetite 
malachite mica nephrite obsidian opal peridot 
pumice pyrite quartz ruby salt sandstone sapphire
sard silica soapstone sodalite talc topaz tourmaline
tsavorite turquoise unakite wairakite yeatmanite 
zircon zoisite

Vegetables

artichoke asparagus avocado bamboo basil bean 
broccoli capers carrot cauliflower celeriac 
celery chard chestnut cabbage chives cress 
cucumber fennel garlic ginger gourd kale kohlrabi
leek lentils lettuce maize okra olive onion parsley
parsnip pea pepper potato pumpkin radicchio radish 
rhubarb rocket romaine rutabaga seaweed shallot sorrel
spinach sprouts salsify squash succotash tomato turnip
watercress yam zucchini

Note that I played around slightly with the lists – to include a few tricky words – because this is no fun if it’s too easy, right? 🙂

Inspired by Peter Norvig – I too used Python, and a function (based on Peter’s) to automatically check my expression for me.

 1def mistakes(regex, first_set, second_set, third_set):
 2    """The set of mistakes made by the regex in classifying things in 
 3    the first_set from the second_set and the third_set"""
 4
 5    return ({"Should have matched: " + F
 6        for F in first_set if not re.search(regex, F)} |
 7        
 8        {"Should not have matched: " + S
 9            for S in second_set if re.search(regex, S)} |
10        
11        {"Should not have matched: " + T
12            for T in third_set if re.search(regex, T)})

Now we have that we can start playing. One easy technique might be to try to form a regex that matches all of the animals which start with ‘a’ – for example, ^a(ll|n) would match any word where a is the first letter, and is followed by either a double-l or an n… We can then do the same thing for words starting with b – adding it to the a’s using an OR…

^a(ll|n)|^b(i|ee|ear)

We need to do a bit more this time – to match bee & bear but not bean…

If we do this for the whole string (avoiding matching minerals and vegetables) we might get:

initial_animal = "^a(ll|n)|^b(i|ee|ear)|
^c(h(ee|i(c|m))|ro|ow|a(m|t))|^d(e|u|o(g|lp))|
^e[al]|^f(i|o|r|ly)|^g(ir|o(a|ldf))|^h(a|i|or)|
^k(an|i)|^l(io|o)|^mo|^o[cw]|^p(an|i|up)|
^ra[bt]|^s(c|eal|h(ar|ee)|n|qui|pid)|
^t(i|urt)|^wo|^ze"

And if we try it: print(mistakes(initial_animal, animal, mineral, vegetable)) wet get: set([]), showing us it’s a valid solution for these lists.

But it’s not very short – and we can do better than that.

Let’s take a look at the minerals. Here we can see straight away that there are some common factors we can exploit to help us. For example, quite a lot of words end -ite. So we could use "ite$" to find all words ending -ite – but in fact we don’t need to do that, since none of the animals or vegetables contain ite, we can save ourselves a character an just use "ite".

Going through the word-list using this technique (spotting things like no animals or vegetables begin with ‘j’, and although goldfish contains the word gold – it’s not at the end of the string) I came up with the following:

re_mineral = "ire|ite|ise|(r|l)ine|ian|one|[at]z|
pal|ic[ae]|^j|sum|sar|con|alt|gold$|net|agat|ys|
by|alc|ond|oal|dsp|rid|ald"

Using this substring matching technique for the animals gets us a shorter list too…

re_animal = "be(e|ar)|nk|^ow|ige|ark|ors|tte|ide|
ant|ly|roo|ig|pan|pus|o(x|g|w|at|lf)$|lio|ppo|
qui|rtl|ee[rtp]|ken|fish|ppy|ird|ebr|cr]a[mt]|eal|
ff|ake|bit|uck|pio|ms|ail|glbst|cod|hin$"

And we can do the same thing for the veg too…

re_vegetable = "ess|ok|bba|ya|[pe]k|llo|iz|iac|ut|
cc|rro|voc|bean|urd|ppe|ves|ea$|(r|le|f)y$|agu|nac|
low|ve$|orr|ts]uc|i[pc]$|mai|rh|ock|ato$|boo|uas|
til|eed|cha|nn|ing|lr|kal|oni|ers|umb|asi|dis"

We can now use an assertion to test all of the regular expressions together.

1def verify (re_a, re_m, re_v, animals, minerals, vegetables):
2    assert not (mistakes(re_a, animals, minerals, vegetables) |
3    mistakes(re_m, minerals, vegetables, animals) |
4    mistakes(re_v, vegetables, animals, minerals))
5    return True
6
7verify (re_animal, re_mineral, re_vegetable, animal, mineral, vegetable)

No output means we pass…

So that’s my attempt at this – I don’t think it’s too bad – but I suspect you could get shorter if you tried…

If you want the full source-code (such as it is!) – then go to https://github.com/Auctoris/regex_golf

If you find any shorter solutions, or have any comments about my solutions, then let me know…