Regular expressions in Javascript, why so painful!?!
Regular expressions are one of those things that have been hanging around in my peripheral vision - I knew what they did, just not how they did it. Every attempt to figure them out has always ended far too quickly, they just look scary! Recently though, I've picked up an admittedly weird taste for diving deep into some pretty dry pieces of documentation and wading around until it all makes sense. Or at least enough sense to me that I feel justified in making an attempt to re-write it all in a (hopefully) slightly less dry manner. So, without further ado, let's begin!
In JavaScript, Regular Expressions are composed of a pattern and some flags. The flags are simply some settings that affect how the results come out, they're all Booleans so we just turn them on and off as needed. The pattern is where the really interesting stuff happens - this is where we can define some really complex and powerful string searches. But, before getting into that, I'm going to round out the context within which Regular Expressions work in JavaScript.
As with everything in this language, we have an object: it's called RegExp. We can create one in two ways:
- Regular expression literal
var re = /pattern/flags;
Everything between the two "/
"s is taken as the pattern, the flags come after. This is always the form they take. Writing out the expression directly onto a variable like this gives better performance, but isn't so well suited to receiving variables to build up the pattern. After all, it's hard coded. - Regular expression constructor
var re = new RegExp('pattern','flags');
A constructor function that lets us pass in values! (We can replace the 'pattern' and 'flags' arguments with variables). This is slower, but is great for dynamically producing our patterns with user input - be careful with that though, users can be evil.
Now we have the RegExp object. As it is an extension of the main Object 'class' it has all the default properties, but it also has a few unique ones of its own:
- RegExp.source : string. The pattern we're searching with. This is the crazy gibberish we'll be getting into after laying out the context within which we use it.
- RegExp.lastIndex : int. The number at which to start the next search, so if you want to start searching from the 5th character in a string, this would be set to 4 (yeay for 0 based numbering). Initially this defaults to 0. After the first result is found it is set to the index of the first character after our result in the main string. So, if you search for a 3 letter word and the pattern matches one that starts at the 5th character,
lastIndex
will then be set to 7. Don't worry about it too much yet, we'll see it in action soon enough! Note - this requires the global flag to be set, which we'll also cover soon.
The other unique properties on RegExp hold the values set by the flags we pass/write in:
- RegExp.ignoreCase : Boolean (
i
). Fairly self explanatory, I hope! - RegExp.global : Boolean (
g
). Without this, only one result will be returned from a string - the first one, over and over again. Theg
flag means search (with your pattern) throughout the entire string, then we can get all the results! - RegExp.multiline : Boolean (
m
). This lets us treat lines as separate strings,^
(beginning of a string) and$
(end of a string) will hook onto the start and end of each line. Without them
flag, they would only match the start and end of the entire string. - RegExp.sticky: Boolean (
y
). Will only look up matches fromlastIndex
, and no more - this is currently only supported in FireFox, but it is now in the ES6 spec so the others might implement it - time will tell!
Here's a simple example. This pattern just looks for any exact matches of "abc" and has the three cross-browser compatible flags set: i
, g
, and m
. The order of the flags doesn't matter. Try copying and pasting this into the console (right click on this page, inspect element, select the 'console' tab then paste and hit enter!):
var myRegExp = /abc/igm;
console.log("\n Source: ", myRegExp.source);
console.log(
"\n IgnoreCase", myRegExp.ignoreCase,
"\n Global", myRegExp.global,
"\n Multiline", myRegExp.multiline);
console.log("\n Last Index: ", myRegExp.lastIndex);
/abc/ig
Look for every matchg
of exactly 'abc', but ignore the casei
and treat new lines as string boundariesm
That's the RegExp object! How to create it and what's stored within it. Minus the pattern bit, but I've got one more area to cover before that completely non-human readable fun!
Functions that use RegExp
exec()
, test()
, match()
, search()
, replace()
, and split()
These are the guys that take a RegExp object and use it to work out some kind of result from a string. They search, they check, they make new strings, they build new arrays, they do all we could possibly (I think) want them to do! For each example in this section I'll use a fairly simple pattern /\a\w+/
which will match any word beginning with 'a'. I'll also use this on the string 'aa ab ac', that way we'll be able to see what's going on with each function quite clearly. You can copy and paste these examples into the console to follow along, just make sure you paste these two values in first:
var stringToSearch = 'aa ab ac';
var myRegExp = /a\w+/g;
/a\w+/g
Look for any matchg
, where 'a' is followed by one or more+
word characters\w
exec()
In cases where a search may match multiple substrings, this function lets us iterate through them. It's best to use it within a loop but I'll be explicit in the example for clarity. Also, this will only work with the global g
flag set on our RegExp, otherwise the lastIndex
property will not increment and we'll be forever stuck on the first result. So set it! Then loop. Eventually we will reach the final result, after that exec will return null
once, then start from the beginning again.
console.log( myRegExp.lastIndex); // 0
console.log( myRegExp.exec(stringToSearch) ); //["aa", index: 0, input: "aa ab ac"]
console.log( myRegExp.lastIndex) // 2
console.log( myRegExp.exec(stringToSearch) ); //["ab", index: 3, input: "aa ab ac"]
console.log( myRegExp.lastIndex) // 5
console.log( myRegExp.exec(stringToSearch) ); //["ac", index: 6, input: "aa ab ac"]
console.log( myRegExp.lastIndex) // 8
console.log( myRegExp.exec(stringToSearch) ); // null
console.log( myRegExp.lastIndex); // 0
console.log( myRegExp.exec(stringToSearch) ); // ["aa", index: 0, input: "aa ab ac"]
console.log( myRegExp.lastIndex) // 2
//and so on
See the looping behavior there? It might be clearer if you paste it into the console - I've wrapped everything in console.log statements so the results show up in the console. If it doesn't work for you, remember to paste in the two variables from the beginning of this section! Now looking at each result array - it's giving us the result plus a couple of extra bits:
- index: The index of the result's first character in the original string (so the first result gets 0 as 'aa' is at the beginning of the original string, the second gets 3 as the first letter of 'ab' is the third character of the original string).
- input: The original string.
With more complex patterns we'll also receive a third piece of extra information with our result - actually we'll get a few of them: a series of 'parenthesized substring matches'. But we'll get to that later!
test()
Say you'd like to check if your pattern is actually going to match anything in a string. This function will simply return true for yes and false for no. Super simple! Note, with the global set on our RegExp, this will also iterate through the results updating lastIndex
in exactly the same way as exec()
, so if you've done some work with a global pattern and are about to test - it's probably a good idea to RegExp.lastIndex = 0;
so you don't get a false negative by accidentally testing from the end of the original string.
For this example we can again paste into the console (if you've already put in the two variables there's no need to put them in again). Run it a few times to see what happens:
console.log( myRegExp.test(stringToSearch) ); //true
console.log( myRegExp.lastIndex); //n
You could also set a non global var myRegExp = /\a\w+/;
pattern and try the above examples again just to see how they react.
match()
For me, this is the most intuitive. It returns an array of the results, handy if you're looking to count how many results we get! Watch out for that global
flag though - without it we'll only get the first result. You might also note that the above two functions (exec and test) are properties of the RegExp object (as in we call RegExp.function('and pass in the string here');
). The remaining 4 functions (match, search, replace, and split) are properties of the String object so the format flips.
stringToSearch.match(myRegExp); // ["aa","ab","ac"]
Note, lastIndex
isn't touched nor is it consulted. match() runs totally independently! As do all the rest of the string functions we're going through.
search()
This is kind of similar to test()
in that it allows us to check if we have anything in the string that our pattern matches. But, this returns the index of the first match - even with the global flag and lastIndex
set beyond the first match. Also, given that it returns a number, -1 is used to indicate there are no matches.
myRegExp.lastIndex = 5; //just to prove search ignores this
stringToSearch.search(myRegExp); // 0
stringToSearch.search(/ad/); // to show -1 as this finds nothing
replace()
This is another pretty intuitive one, it's find and replace! Our Regexp is used to match parts of a string which are then replaced by another string that we define. The replace function takes two arguments. Our RegExp goes in as the first and the replacement string goes in as the second. stringToSearch.replace(myRegExp, 'replacement!');
That first argument can also take a string, but that's not as fun as RegExp - so I'm going to ignore that bit. The only ever-so-slightly unintuitive bit, in my opinion, is that the replacement doesn't happen on our original string (stringToSearch), instead a new string with the applied replacements is returned. Don't get me wrong though, this is a good thing!
var newString = stringToSearch.replace(myRegExp, 'replacement!');
console.log(stringToSearch); // "aa ab ac"
console.log(newString); // "replacement! replacement! replacement!"
Note, if we don't pass in a replacement string 'undefined' is used instead - am I the only one who finds this weirdly amusing? If you're looking to use replace() to deleted substrings, pass in ''
as the second argument.
split()
The final function! For those occasions when we have a string and wish to split it up into an array of substrings, each defined by something we can match - like commas, or something far more complex / variable if the string is a mess.
myRegExp = / /; // just a space
var spaceSeparatedArray = stringToSearch.split(myRegExp);
console.log(spaceSeparatedArray); //["aa", "ab", "ac"]
Note, again, the pattern isn't global but split()
doesn't care.
Regular Expression Patterns
At last! We're here, the part where we get to decipher the patterns which are kind of like a tiny language all of their own. Fortunately the alphabet for this pseudo-language isn't too long and fits quite nicely into a few sections.
Simple patterns
Plain letters. The simplest of searches - it takes a string and looks for any exact matches. Yes, it's pretty much the same as indexof()
, but we've got to start somewhere!
var stringToSearch = 'aa ab ac ba bb bc ca cb cc';
var myRegExp = /ab/g;
stringToSearch.match(myRegExp); //["ab"]
/ab/g
Search for every matchg
, of 'ab'ab
Boundaries
Words, lines, and the string itself - each one has a defined beginning and end, which we can match! Strings and lines are pretty easy: ^
for the start and for the end: $
, although the lines won't match unless the m
(multiline) flag is set. Words, on the other hand, have some quirks to remember. On the surface they seem easy: \b
matches a word boundary, not any actual characters - just any position that is between a word character and a non-word character. The quirk is in what qualifies as a word character - if you're dealing with a foreign language, it's likely that anything with an accent will be considered 'non-word'. Have fun with that!
var stringToSearch = 'aaa aa a';
var myRegExp = /\ba\b/g;
console.log( myRegExp.exec(stringToSearch) ); //["a", index: 7...
/\ba\b/g
Look for any matchg
, of a word boundary\b
, followed by 'a', followed by another word boundary\b
Try swapping the pattern for /\ba$/g
. As the match 'a' is at the end of the string, it will give us the same result. Alternatively we could search for characters that are preceded or followed by specific 'white space' characters, eg tab: \t
, vertical tab \v
(Which I've never used, or even heard of until now!), or carriage return \r
. There are a few more, all of which are covered by \s
(any white space character) or its opposite: \S
. I'll put a small list below, but if you're interested here's a more exhaustive list of special characters we can use.
^
start of string (or line ifm
- multiline - is set)$
end of string (or line ifm
is set)\b
word boundary.\B
non-word boundary, as in not adjacent to any word boundaries.\r
carriage return\t
tab\v
vertical tab
Multiple characters
By this, I mean a character we're looking for could be one of many. We define our list of possible characters within square brackets: []
. So, for example, using the same string as above we could search for any 'words' that start with either 'a' or 'b':
var stringToSearch = 'aa ab ac ba bb bc ca cb cc';
var myRegExp = /[ab]\w/g;
stringToSearch.match(myRegExp); //["aa", "ab", "ac", "ba", "bb", "bc"]
/[ab]\w/g
Search for every matchg
, that begins with 'a' or 'b'[ab]
, and is followed by any word character\w
Note, in the result none of the 'words' beginning with 'c' are returned - they don't match!
We can also invert this, as in "find any character but those that are defined", by adding a ^
into our range:[^ab]
. Try it with the example above in the console, the result might not be what you'd immediately expect - just remember we're saying any character except 'a' or 'b'. That includes non-word characters. So our result is this:
[" a", " a", " b", " b", " b", " c", " c", " c"]
* There are only two 'a' results, the first 'word' 'aa' isn't preceded by anything so isn't matched.
* ' ' (space) is matched as it's not 'a' or 'b' and all our spaces are followed by word characters so it fits the pattern [^ab]
.
Character Ranges can be easier to type out if we're looking for a group of characters that are next to each other in their Unicode order (by the way, if you check out that link be aware that JavaScript Regular Expressions don't really deal well with foreign characters (as already mentioned), most are considered 'non-word'. As far as I can tell we're only really good down to the 0070 range... Mr. Bond). So [a-z]
matches all the Latin alphabet lowercase characters (unless the i
flag is set, in which case it matches lower and uppercase characters).
To make life even easier we are provided with a few shortcuts for common ranges - the word one we've already seen in the previous examples:
\d
Any digit character, same as[0-9]
, or[0123456789]
.\D
A character that is not a digit, same as[^0-9]
, or[^\d]
.\w
An alphanumeric character (‘word character’).\W
A non-alphanumeric character.\s
Any whitespace character (space, tab, newline, and similar).\S
Any non-whitespace character..
Any character except for newline.
In the first two (digit and non-digit) I've given their alternatives so you can see there are quite a few ways to write the same thing!
Repeating matches
For when we want to find a single character (or group of characters which we'll get to next) that repeat! We can define the number of repetitions in a few ways:
+
one or more*
zero or more?
zero or one{2}
twice{2,4}
min of 2, max of 8{,2}
min zero, max 2{4,}
4 or more
These apply to the element to their left (either a character or a group). So /\d{2,}/g
will match any instance g
of a number \d
that is 2 or more digits long {2,}
. Pretty simple! Lets make it more complex.
The ?
can also be applied to the other operators here. By default each will return the largest number of characters possible, this is called being greedy (no joke!). But if we add ?
after one of them (eg, +?
) it is turned non-greedy and will return the fewest characters instead.
Time for an example I think, lets look for a number that is one or more digits long:
var stringToSearch = 'abc 0123456789 abc';
console.log( stringToSearch.match(/\d+/) ); //["0123456789", ...
console.log( stringToSearch.match(/\d+?/) ); //["0", ...
/\d+/
Greedily match one or more digits./\d+?/
Non-greedily match one or more digits, you could think of this as the 'humble' operator!
Grouping
Also known as "Remembering substrings ", or "Capturing parentheses ", or "parenthesized substring matches"... but I find "grouping" a bit easier so lets just use that. We use ()
to define a group. This does a couple of things. First, anything within the parentheses is treated as a single element by any following operators. For example, lets search a crazy string for any occurrences where 'ab' is repeated 2 or more times:
var stringToSearch = 'aa ab ac aaaa abab acac aaaaaa ababab acacac';
var myRegExp = /(ab){2,}/g;
stringToSearch.match(myRegExp); //["abab", "ababab"]
/(ab){2,}/g
Look for everything(g
) that has "ab"((ab)
) repeated 2 or more times ({2,}
).- Also to note, if we left out the parentheses (
/ab{2,}/g
) it would look for 'abb' and return null.
The second thing that happens is that the result 'remembers' matches from within the parentheses. To see what I mean by this lets do the previous example but without the global flag:
var stringToSearch = 'aa ab ac aaaa abab acac aaaaaa ababab acacac';
var myRegExp = /(ab){2,}/;
stringToSearch.match(myRegExp); //["abab", "ab"]
Can you guess what just happened? Remember, without g
we'll only ever get the first match "abab". So that second result in the array, "ab", that's the 'remembered' bit! It's the part of the original string that was matched by the group! The second 2 characters of the matching word "abab" aren't included in the 'remembered' result as they were matched by the {2,}
bit and not the (ab)
bit. If you recall from the beginning of this article I mentioned that we would get some extra information with our results - this is that extra information, although at the start we had it wrapped in a console.log statement to expose all the details. So try running console.log( stringToSearch.match(myRegExp) );
to see a fully detailed results array.
Multiple groups. Yep, we can use more than one group in an expression too, and for each additional group we get one more additional 'remembered' result in the array that's returned. Here's a fairly contrived example to demonstrate (g
is set and I'm using exec to iterate through each match):
var stringToSearch = '1a1 1a2 1a3 1b1 1b2 1b3';
var myRegExp = /(\d)a(\d)/g;
console.log( myRegExp.exec(stringToSearch) ); //["1a1", "1", "1", ...]
console.log( myRegExp.exec(stringToSearch) ); //["1a2", "1", "2", ...]
console.log( myRegExp.exec(stringToSearch) ); //["1a3", "1", "3", ...]
/(\d)a(\d)/g
Look for every matchg
where any digit\d
is followed by 'a' followed by any digit\d
, remember the first digit, remember the second digit.
Using 'remembered results' Of course we can iterate through the results with exec()
, take the results array, pull out the remembered matches and use them manually however we want. But there are also 2 extra ways for us to use these remembered bits: one is within the pattern and another within the replace()
function.
Within the pattern, a group that we have defined can be reused later on in the same pattern with \n
(where 'n' is the number of the group):
var stringToSearch = "aba abb abc";
var myRegExp = /(a)b\1/;
stringToSearch.match(myRegExp); //["aba", "a"]
/(a)b\1/
Look for and remember 'a', followed by 'b', followed by whatever was in the first group\1
- 'a'
Because the first (and only) group matched 'a' the reference to that group \1
will only ever be 'a' as well, so this example is an overly complex way of saying - go find 'aba'.
The replace()
function remembers each group incrementally with $n
, so the first match would be $1
which we would be able to use within the replacement string. Here's an example in which we look for 'a' space 'b', remember the two letters, and swap them in the replacement string:
var stringToSearch = "a b";
var myRegExp = /(a)\s(b)/;
var newString = stringToSearch.replace(myRegExp, "$2 $1");
console.log(newString); //b a
/(a)\s(b)/
Look for and remember 'a', followed by a space, followed by (and remember) 'b'
Non-remembering groups / non-capturing parentheses. If we're using groups and don't want to remember one we can do this: (?:pattern)
. That will take it out of the 'remembered' bits array and will also skip the incremental numbering of those bits. This will help a lot if we're working with a giant pattern using many many groups - if we only wanted to remember a few of them we could apply ?:
to all the others and not worry about having to count through every group to get the index of the few that we actually want to use. Lets take that previous replacement example and build on it a little to show what I mean (it's crazy, but it's what came to mind!):
var stringToSearch = "a b";
var myRegExp = /(a)(?:\s){2,}(b)/;
var newString = stringToSearch.replace(myRegExp, "$2 $1");
console.log(newString); //b a
/(a)(?:\s){2,}(b)/
Look for and remember 'a', followed by 2 or more{2,}
space characters\s
but don't remember them?:
, followed by 'b'.
Yes we could just use /(a)\s\s(b)/
but it's for a good cause! (learning). In that mad example we have three groups, (a)
, (?:\s)
, and (b)
. Normally they would be referred to in the replace()
function as $1
, $2
, and $3
, but because we're telling it not to remember the crazy middle space one, (a)
is $1
and (b)
is $2
. Run that example in the console then try removing the ?:
and running it again to see what happens. In conclusion of this tiny section - ?:
is magic for organizing our parenthesized substring matches!
OR
If you've made it through the previous 'Grouping' section, don't worry - this one is much simpler! (Also, well done!) If you've ever used the pipe |
to define 'OR' in JavaScript, you already know this one. We can give alternatives to match ('this' or 'that') like this: /this|that/
. That example basically gives us two completely different patterns to match with (this) and (that). Although this might be useful in some situations, it's more likely that we'll want to use alternatives only in a specific part of our pattern... we're back to groups again! But you already know how they work. So now for another completely random example:
var stringToSearch = "this didn't do whatever that did!";
var myRegExp = /(this|that) did/g;
stringToSearch.match(myRegExp); //["this did", "that did"]
/(this|that) did/g
Look for every matchg
of either 'this' or 'that'(this|that)
, followed by 'did'.
Hopefully that made sense. We gave the pattern two alternatives for matching in the group only. 'did' get's matched no matter what goes on within the group.
Lookaround
Sometimes you might find yourself looking for 'x' only if it has 'y' before or after it but you don't want 'y' to show up in the result. Or, you are looking for 'x' that does not have an adjacent 'y', and again don't want the 'not y' definition match to show up in the result. For these things we have lookarounds. Unfortunately JavaScript only actually supports lookaheads (for now) - checking the context after 'x'. Lookbehinds can be done but they're a bit of a hack, so lets start with lookaheads:
Lookahead
Confusingly these are also contained within parentheses but are not remembered - don't worry though, we can use groups within the lookahead match and they will be remembered. In fact, it takes a full on RegExp pattern so we can match whatever we want in there! But I'll leave all that fun out for now, we've already covered pretty much everything, so onto the example:
var stringToSearch = "xy yx";
console.log( /x(?=y)/.exec(stringToSearch) ); //["x", index: 0, ...
console.log( /x(?!y)/.exec(stringToSearch) ); //["x", index: 4, ...
x(?=y)
look for 'x' that is followed by 'y'(?=y)
x(?!y)
look for 'x' that is not followed by 'y'(?!y)
'x followed by y' hits the very start of the search string and the opposite hits the very end of the string!
Lookbehind
If and when they do happen, positive lookbehinds should look like (?<=y)x
(find 'x' preceded by 'y'), and negative (?<!y)x
(find 'x' not preceded by 'y'). But we can't do that, we have to make up our own ways. Here's my idea for each:
var stringToSearch = "xy yx xy";
console.log( /([^y]|^)(x)/.exec(stringToSearch) );
console.log( /y(x)/.exec(stringToSearch) );
([^y]|^)(x)
2 groups to remember, we'll use the second to get our actual result. In the first group: any character that is not y[^y]
OR|
the beginning of the string^
.
y(x)
Find 'y' followed by 'x' and remember 'x' - use the remembered match as our result.
If we're going to be using a few groups within our pattern I can see how these approaches would start to become unmanageable, but it did spur an interesting thought - can we use groups within groups?
var stringToSearch = "xy yx xy";
/(((x)(y))\s((y)(x)))/.exec(stringToSearch);
//["xy yx", "xy yx", "xy", "x", "y", "yx", "y", "x"]
Yep! Turns out we can! Lets break that down a bit to see what's going on:
[
"xy yx", //The result
"xy yx", //The first group - the big one surrounding everything
"xy", //The first group within the big group
"x", //The first group within the previous one
"y",
"yx",
"y",
"x"
]
Ah of course, it's a tree! So, for nested groups the search starts at the top of the tree then steps down to the bottom before popping back up and along - that makes sense! I'm digressing, but it's the end of this article so that's all right!
In conclusion
So that's how Regular Expressions do what they do! Or, how we ask them to do what they do - the actual mechanics behind how it works can definitely be left for another time. For now I'll just give a tiny recap as a reminder. In the first half we covered the general structure of Regular Expressions /pattern/flags
, their creation (literal vs constructor), and the functions that use them: exec()
, test()
, match()
, search()
, replace()
, and split()
. In the second half we went through the components that make up the actual patterns: Simple Patterns for exact matches, Boundaries (words, lines, the start and end of a string), Character ranges with []
, how to define repetition in matches, how to group parts of our pattern and remember what was in those groups, alternative matches with |
, and (finally) how to look ahead and behind(ish) to give our search context.
There are a few extra pieces which you can find in this table of all the special characters from Mozilla or by having a google! Something I've yet to find but would quite like would be a good collection of examples for common RegExp patterns. If you know of one or have written one - let me know!