Random Regex Tips / Coder's Block

As I mentioned before, I really like regex. I actually think it’s fun. So here’s a random smattering of quick tips to help you wrangle your regex.

Beware the Carriage Return

If you’re trying to match a pattern across a line break, then just accounting for the newline character isn’t enough. Notice how this pattern doesn’t work.

A lot of text editors (and browsers) actually insert 2 characters for each line break: a carriage return (\r) and then the newline (\n). Make sure your patterns are aware of carriage returns and you’ll have greater success.

The Evil Twin

A lot of character classes will give you the corresponding opposite when you switch the case of the letter. For example, \s will match white-space characters, while \S will match non-white-space characters.

Try Unicode Categories

The \p{...} syntax lets you match categories of characters that would be really verbose to define otherwise. For example, you can easily match all punctuation characters with \p{P}. For a list of all the categories you can use, check out this table.

Multiple Passes

The zero-width positive lookahead assertion (whew!) is an incredibly useful construct. It lets you check the same part of a string for multiple things. As an example, say you want to check if a string is a good password. To be a good password, it must be at least 8 characters, and have at least one uppercase letter, one lowercase letter, and one number.

Here is the pattern to do this (play with it here):

(?=.{8,})(?=.*[A-Z])(?=.*[a-z])(?=.*\d).*

It looks hairy, but if look closer you’ll see we have 4 groupings wrapped in (?=...) with very simple patterns in each, one for each good password criteria. These are our assertions. When the assertion patterns are matched, they do not “eat” the matched part of the string, which is what allows us to do multiple passes over the same part. Then finally, we have a .* to actually “eat” the string up after all the assertions have been satisfied.

Know When to Say No

Regex is a very powerful tool when used in the right places. But sometimes regex is simply the wrong tool for the job. Generally speaking, there is a higher computational cost when using regex in place of simple string operations. There’s also a higher maintenance cost. It’s no fun when you have to wade through someone else’s gigantic regex pattern that looks like they vomited punctuation all over the screen.

Well, I think it’s fun. But most people don’t.