It should come as no surprise that Ruby gives you a lot of flexibility right out of the box when it comes to manipulating text. After all, it originated in the 90s when Perl was on the ascension, and Matz took inspiration from that language which is famous for its text processing prowess.
I’ve needed to do a fair bit of parsing work lately, and as part of that I’ve become more familiar with some of the ins and outs of using Regular Expressions to seek through text to find and possibly replace tokens. This is by no means an exhaustive resource, but it should provide you with a general idea of what’s possible in your day-to-day Ruby programming.
Gsub
If you need to do a search and replace in one or more places throughout your string, gsub
is typically the way to go. I think most Rubyists will discover this method pretty early on when learning about string manipulation.
What I didn’t know until recently is you can pass a block to gsub
. For each match in the string, the block will be evaluated and the return value will be the replacement for that match. This means you can write code that will determine the replacement values conditionally based on what exactly is getting matched!
For example, if you wanted to change <div>
tags to <span>
tags, but only if there are no attributes, you could write something like this:
"<div>This is a string</div>" \
"<div class='centered'>This is another string</div>"
.gsub(/(<.*?[ >])(.*?)(<\/.*?>)/) do |match|
if $1.end_with?(" ")
match
else
"<span>#{$2}</span>"
end
end
# <span>This is a string</span><div class='centered'>This is another string</div>
(Now this isn’t a great example because it doesn’t handle nested tags, but you get the idea…)
In case you’re not familiar with capture groups, the $1
and $2
are referencing the first capture group which is an opening tag (aka <div>
) and the second capture group which is the text inside the tag.
gsub
also lets you provide a hash where matches will be replaced by the values of matched keys:
"Foo is the nicest bar you'll ever meet."
.gsub(/Foo|bar/, "Foo" => "Joe", "bar" => "guy")
# Joe is the nicest guy you'll ever meet.
I suspect the block syntax is ultimately of more value though.
Partition
The partition
method lets you divide a string into three pieces: the part of the string before a single match, the match itself, and everything that comes after that match. If you include capture groups in your regular expression, you can utilize those as well. One way you can take advantage of this type of data is by using partition
to search a string for tokens, and build a new string up via a buffer as you transform the tokens.
Let’s say you want to be able to put colons around words where you’d like the word length to appear as a kind of footnote after the world. You want text :like: this
to turn into text like(4) this
.
Here’s how you could write it using partition
, a buffer, and an until
loop:
string = "This is :something: you'll :want: to try :out: for yourself."
buffer = ""
until string.empty?
text, token, string = string.partition(/ :(.*?): /)
buffer << text
if token.length.positive?
buffer << " #{$1}"
buffer << "(#{$1.length}) "
end
end
puts buffer
# This is something(9) you'll want(4) to try out(3) for yourself.
Now, is this something you could do with a gsub
block as described previously? Yes indeed:
string = "This is :something: you'll :want: to try :out: for yourself."
string.gsub!(/ :(.*?): /) do
" #{$1}(#{$1.length}) "
end
puts string
In fact that’s a lot simpler. However, in this example you don’t have access to any of the text before or after the token. If that’s something that’s important to you (maybe you need to process the token differently depending on what comes before it, or after it), you’ll want to use partition
.
Or will you?? There is another way!
StringScanner
Using StringScanner is like bringing a bazooka to a paintball tournament. It’s extraordinarily powerful, but it can also land you in some serious trouble—not to mention get a little mind bend-y if you’re not careful.
StringScanner
is actually the name of a Ruby class in the standard library (stdlib
), which you’ll need to import by adding require "strscan"
to the top of your code. You use it by instantiating a scanner with a string, and then you use various methods to scan the string for patterns and advance a “pointer”.
Let’s say you want to replace “cake” with “pie” in a string, but not if the keyword is preceded by “short” or if it’s followed by “pops”. We’ll use a buffer and do string replacement like in previous examples, but because we have all the benefits of a scanner it’s pretty easy to look backwards and forwards and determine our next course of action.
require "strscan"
string = "Let them eat cake and then more shortcake and finally cake pops!"
scanner = StringScanner.new(string)
buffer = ""
until scanner.eos?
portion = scanner.scan_until(/cake/)
if portion.nil?
buffer << scanner.rest
scanner.terminate
next
end
unless scanner.pre_match =~ /short$/ or scanner.check(/\s+pops/)
buffer << portion.sub(/cake/, "pie")
else
buffer << portion
end
end
puts buffer
# Let them eat pie and then more shortcake and finally cake pops!
Whoa, what’s going on here?
First, we set up an until scanner.eos?
loop. This means the loop will iterate until we’ve reached the end of the string.
The scan_until
method looks for a pattern and advances the current pointer to that location. (You can verify this by adding puts scanner.pointer
below scan_until
.) It returns the portion of the string that matches the pattern, so we can use that to perform string substitution to change “cake” to “pie”.
However, we don’t want to do the substitution if cake is preceeded immediately by “short”, so we’ll check for a regex match on everything that’s come before the portion (scanner.pre_match
) to see if it ends with “short”. We also want to check if the very next part of the string is the word “pops”, so we’ll use the scanner.check
method. This checks what comes immediately next in the string, but it doesn’t advance the pointer. (There’s also a check_until
method which is analogous to scan_until
.) By not advancing the pointer, we avoid messing up our position in the string and can continue looping normally.
The if portion.nil?
block near the top of the loop handles the case where there are no more instances of “cake” in the string but there’s still more to the string we need to account for. By adding the .rest
of the string to our buffer and calling scanner.terminate
, we force the scanner to advance to the end of the string, in which case until scanner.eos?
will evaluate true and end the loop.
This example is fairly simple because it’s only changing a single word to another word, so the substitution itself doesn’t require any fancy regex. But combine StringScanner
with all of the techniques we’ve already learned (gsub
blocks, even partition
), and you’re able to build extremely sophisticated routines to handle nearly any kind of text processing imaginable.
Summary
Whew, that’s a lot to take in! Today you’ve leaned that gsub
is much more than just a way to say that “a” should become “b”. By supplying a block, you have precise control over the replacement strings by first inspecting each match of the source string.
In addition, the partition
string method lets you divide a string into pre-match, match, and post-match components—and by doing so over and over in a loop and using a buffer, you can transform a large and complicated string section-by-section.
Finally, for the most precise control over searching text for one or more tokens and performing elaborate search-and-replace actions based on the relationships those tokens have with the rest of the text, the StringScanner
object is there just waiting to unleash its full power. Not only that, your code can benefit from previous techniques in the midst of using StringScanner
for maximum Ruby text processing prowess.
“Ruby is simple in appearance, but is very complex inside, just like our human body.”
matz
Join 300 fullstack Ruby developers and subscribe to receive a timely tip you can apply directly to your Ruby site or application each week:
Banner image by Tara Evans on Unsplash
Other Recent Articles
Episode 9: Preact Signals and the Signalize Gem
What are signals? What is find-grained reactivity? Why is everyone talking about them on the frontend these days? And what, if anything, can we apply from our newfound knowledge of signals to backend programming?
Episode 8: Hotwiring Multi-Platform Rails Apps with Ayush Newatia
I’m very excited to have Ayush on the show today to talk about all things fullstack web dev, his new book The Rails & Hotwire Codex, and why “vanilla” is awesome!