JavaScript String Replace Magic

Replacing strings in JavaScript is a fairly common task, but there are some nuances that might surprise the uninitiated, not to mention some really powerful features just below the surface. Read on to see.

String.prototype.replace()

The Simple Case

The most obvious way to do string replacement is by passing 2 strings to replace(), the string to find and the string to replace it with.

replace() does not change the value of str, rather it returns a new string with the applied replacements (this example doesn’t do anything with the returned string, but you’d probably want to).

It’s worth mentioning that this approach is case sensitive. Searching for “Salt” in the above example would come up empty.

Replacing All Occurrences

There is one major caveat: this approach only replaces the first occurrence.

This is often not what you want. To replace all occurrences, you’ll want to use the regex flavored version of replace().

The g flag is crucial. That’s what makes it a global search, finding all occurrences. But what if you want to specify a string to replace via a variable, instead of hardcoding “badger”? It’s a little more typing, but not hard to do with a RegExp object.

We’re dealing with regex now, so don’t forget to escape characters that have special meaning or you’ll get weird syntax errors. Just stick a \ in front of them.

Doing More with Regex

Opening the door to regex lets us do more interesting things. For example, replacing multiple words at once. While we’re at it, let’s solve that case-sensitivity issue from earlier.

The | element lets us create an “or” list of terms. And you probably spotted the i flag, which makes the replacements case insensitive.

Here’s something fancier. The following looks for 3 digit numbers and replaces them with the same number, but with dashes between the digits.

\d is a special regex element that matches any single digit, so we use 3 of those to find 3 digit numbers. Each one is wrapped in parenthesis (), which lets us reference them in the replacement string. $1 is replaced with whatever was matched in the first set of parenthesis, $2 with what’s in the second, and so on.

We’ve barely scratched the surface of what can be done with regex. You can do some amazing things, but that’s an entire post (if not an entire book) of its own.

Replacement Functions

Sometimes you may want to do something completely custom when replacing text, something that not even regex can handle. Our next example replaces fractions with their decimal equivalents.

\d+ means “one or more digits in a row”. We do this twice, both times wrapped in parenthesis, so we can catch the numerator and denominator. In the middle is \/, which is simply an escaped slash /. Put it all together and we’ve got a fraction finder.

This time, instead of a string for the second parameter of replace(), we’re providing a function. This function is called every time a match is found and its return value is what replaces the match. The function takes a couple parameters, though we’re only using 3 in this example. The parameters are, in order:

  • match – The matching string that was found.
  • submatch1, submatch2, … – Each set of parenthesis used will add a submatch parameter. It’s the same concept as the $1, $2, etc. stuff from earlier.
  • offset – How far into the string this match is, in number of characters.
  • string – The entire original string that we called replace() on.

The code within the function does the actual fraction/decimal conversion. It takes the first submatch (numerator) and second submatch (denominator), converts them both to integers, divides them to get the decimal value, then returns that as a string.

We’re just dividing numbers here, but you have the freedom to write whatever custom string replacement code you want. Go nuts.

1 comment » Related topics:

Get Text Snippets with Regex

Say you want to take a snippet from a body of text. A fairly common task, but with a few rules if you want to do it right:

  • It can never exceed n characters/words
  • It can’t cut off in the middle of a word
  • It doesn’t include trailing whitespace or punctuation

Although none of this is terribly difficult, it’s still a couple lines of code. On the other hand, a single regex can do all of this for you.

Snippet by Character Count

This regex gives you the first n + 1 characters (100, in this case).

Regex for snippet by character count

Boring technical explanation: starting at the beginning ^, grab up to 99 {0,99} characters ., but whatever is grabbed must be immediately followed by a non-whitespace character \S immediately preceding a word boundary \b.

Snippet by Word Count

This regex gives you the first n words (10, in this case).

Regex for snippet by word count

Another boring technical explanation: starting at the beginning ^, grab up to 10 {0,10} groups (). Each group may or may not start with a space \s?. Either way, each group then has 1 or more non-whitespace characters \S+. The final group must end on a word boundary \b.

Closing Remarks

Regex often walks a fine line between elegance and WTF. Personally, I’m comfortable using the 2 I’ve shared, but would never use this monstrosity. Your mileage may vary.

3 comments » Related topics:

Regex Storm Goes Open Source

I love regex. So four years ago, I dedicated dozens upon dozens of hours to create the best regex testing and reference site I could. I called it Regex Storm. I’ve already talked about the experience in a previous retrospective.

Regex Storm screenshot

Still Kicking

The site is still online and still fully functional, as far as I know. In fact, despite doing barely anything for it in the past few years, traffic is actually growing.

Regex Storm visits per month

I still get email now and then from someone telling me how useful it is. I’m glad! I plan to keep it online as long as I can, but I still have no plans to continue development on it. The thing is, Regex Storm is woefully outdated. The markup is XHTML! The CSS has browser-specific hacks for Firefox 3! The ajax is an antiquated mess of .NET postbacks via UpdatePanels! At this point, it would be better to start from scratch, but that’s not something I’m willing to take on right now.

Open Sourced

I’ve decided to release the codebase for Regex Storm, even if it is a bit outdated. Not that I expect anyone to pick up development. I just don’t have a good reason to keep it closed source. And maybe there are still some good nuggets of code in there somewhere — or at the very least, something for people to rummage through for nostalgia’s sake. Also, to be honest, it’s been taking up one of my paid private repo slots on GitHub, so I may as well.

Feel free to poke around in the Regex Storm GitHub repo.

Random Regex Tips

As I mentioned before, I really like regex. I actually think it’s fun. So here’s a random smattering of quick tips to help you wrangle your regex.

Beware the Carriage Return

If you’re trying to match a pattern across a line break, then just accounting for the newline character isn’t enough. Notice how this pattern doesn’t work.

A lot of text editors (and browsers) actually insert 2 characters for each line break: a carriage return (\r) and then the newline (\n). Make sure your patterns are aware of carriage returns and you’ll have greater success.

The Evil Twin

A lot of character classes will give you the corresponding opposite when you switch the case of the letter. For example, \s will match white-space characters, while \S will match non-white-space characters.

Try Unicode Categories

The \p{...} syntax lets you match categories of characters that would be really verbose to define otherwise. For example, you can easily match all punctuation characters with \p{P}. For a list of all the categories you can use, check out this table.

Multiple Passes

The zero-width positive lookahead assertion (whew!) is an incredibly useful construct. It lets you check the same part of a string for multiple things. As an example, say you want to check if a string is a good password. To be a good password, it must be at least 8 characters, and have at least one uppercase letter, one lowercase letter, and one number.

Here is the pattern to do this (play with it here):

It looks hairy, but if look closer you’ll see we have 4 groupings wrapped in (?=...) with very simple patterns in each, one for each good password criteria. These are our assertions. When the assertion patterns are matched, they do not “eat” the matched part of the string, which is what allows us to do multiple passes over the same part. Then finally, we have a .* to actually “eat” the string up after all the assertions have been satisfied.

Know When to Say No

Regex is a very powerful tool when used in the right places. But sometimes regex is simply the wrong tool for the job. Generally speaking, there is a higher computational cost when using regex in place of simple string operations. There’s also a higher maintenance cost. It’s no fun when you have to wade through someone else’s gigantic regex pattern that looks like they vomitted puncutation all over the screen.

Well, I think it’s fun. But most people don’t.

0 comments » Related topics: