Reversing a string in 2019

How do you reverse a string ?

If you’re a Ruby programmer, you’d try something like this:

str.reverse

And as a Pythonista, you might say:

str[::-1]

And as a Javascript programmer,

str.split("").reverse().join("")

Or maybe you work with Go-lang, and you’ve seen this discussion where Rob Pike mentioned something about Unicode strings not wanting to be reversed.

The bottomline is, in 2019, string manipulation is far more involved and complicated.

For example, reversing “👈🏾👆” with the Ruby or Python code above gives you “👆🏾👈” and not “👆👈🏾” which is what you were expecting. “राम” would become “मार” and not “मरा”.

You might notice that these discrepeancies happen in strings that - - have Unicode characters - have some kind of combining characters

What is going on here ?

In the world of Unicode, a string is not a collection of characters in the traditional sense. Each ‘letter’ that makes up the string could be a compound of several characters. In the ASCII era, we used to have one byte characters, and strings were an array of such one byte characters. Not anymore.

Now we have Unicode code points, which is the closest that comes to characters. But a code-point may have one or more bytes. Strings can be thought of as a combination of these code-points.

The perceived characters (by us humans) here are referred to as grapheme-clusters. 👈🏾 is a grapheme cluster, and so is रा. You can guess why it is being called a cluster - because its a combination of the emoji character and the colouring (In case of रा, its the base character र and the matra)

So now when you’re programming, a given string manipulation task might have different results based on how characters are counted. If you’re working with grapheme clusters, you get one result, and another with code-points (and yet another with bytes). You might need specialized manipulation libraries (like the grapheme package in Python).

The key takeaway here is, your programming language has stopped taking that decision for you. The Rust programming language, for example, refrains from providing say a reverse function in the standard library. Go-lang as well. And my guess is, any language that was invented after Utf-8 became popular (around 2008) would not.

But I deal with only English (and emojis)

I’m not a native English speaker, and yet, often I used to think like that. That has been a good escape for a long time. As long as you are dealing with non-specialized English, you might get away with ignoring the problem. Your text reverses just fine.

Except now that we have Emojis. It is not possible, in today’s world, to imagine user generated input that will not contain Emojis. The popularity of Emojis has presented the problem right in our face.

Now all your string manipulation should work correct with Emojis.

  • Your database should be able to store Emojis.
  • Your cursor should stand in the correct spot when around Emojis.
  • Your character count should be correct for text with Emojis.
  • Your sorting should work correct with Emojis.

Traditional string manipulation is not going to work anymore.

The way forward

Fortunately, the problem has largely been solved - by other people. Most popular languages are up to date with grapheme manipulation. Most operating systems are good with Utf-8.

© 2023