Rust: Beware of Escape Sequences \n

Dominic E.
11 min readMar 1, 2021

--

Photo by Paul Basel from Pexels

Serde is one of the most popular crates in the Rust ecosystem for efficiently serializing and deserializing data structures. It supports a variety of data formats, including JSON, YAML, MessagePack, and others. Unlike many other (de)serializers, it doesn’t use runtime reflection and instead uses Rust’s compelling trait system. That makes serde exceptionally efficient because data structures essentially know how to serialize or deserialize themselves, and they do so by implementing the Serialize or Deserialize traits. Luckily, this is very straightforward for most types, thanks to the derive macro. However, it’s not always roses and unicorns 🦄. You might encounter pitfalls just like me, and in this blog post, I would like to talk about one in particular: Escape Sequences.

Let’s take a look at what is required to deserialize JSON strings at scale.

Hint: the secret ingredient is zero-copy value deserialization. So let’s dive in and see what we need to bear in mind to achieve this.

To set the stage, suppose we are building a website where you can easily create snippets of data and share them with the world. Let’s call this service Snipster ️(Snippet + Hipster 🤷‍♂️). Basically, it’s just a clone of what Github calls Gists.

So we’ll need a client and a server. The client is used to create the snippets and the server exposes a simple REST API to retrieve and persist the data. We are not going to implement them in this post, because that’s not really relevant. But I am a huge fan of warp for the server side. Give it a whirl.

We pick JSON for exchanging data between the client and server because it’s easy, human-readable, and one of the most common data formats on the web.

The Data Model

With that in mind, we can come up with a JSON data model which looks like this:

A snippet has a description, can be public or private, and contains 1 to many files with a name and content. Each files´ content can be anything, for instance, JavaScript, a bash script, or literally anything, as long as it’s text-based.

Now, let’s translate this into Rust data types:

#[derive(Serialize, Deserialize)]
struct Snippet {
description: String,
public: bool,
files: HashMap<String, File>,
}
#[derive(Serialize, Deserialize)]
struct File {
name: String;
content: String;
}

Pretty straightforward so far. The client would convert the data, a JavaScript object, into a string by calling JSON.stringify(data) and then send it our way. On the server, we’d grab this JSON string and call:

let snippet: Snippet = serde_json::from_str(json_str)?;

Because we have annotated our structs with the derive macros, serde_json will do the heavy lifting and automagically convert the string into a Snippet. Pretty darn sweet.

But… and here’s the catch.

If we pay close attention to our structs above, we can notice that we used String quite often. If we recall, a String is one of the string types of Rust that has ownership over the contents of the string, meaning we can change the String, for example by appending a char (.push('a')) or a borrowed string slice (.push_str("yay")).

String: A quick Recap

A String in Rust can be represented as depicted below:

It’s made up of three components:

  • a pointer that points to the string slice, a sequence of characters somewhere on the heap not the stack
  • the length of the string, or the number of bytes it currently occupies
  • and the capacity, the number of bytes this string can potentially occupy

The length of a String will always be less or equal to the capacity. Modifying the string means reducing the length if the string gets smaller or dropping the memory and allocating new memory with more capacity if the string grows in size.

Unlike a string slice or its borrowed form (&str), a String can actually allocate or drop memory when the string itself is changed. This is only possible because it owns the data. There can be many read-only borrows but only a single owner.

That said, if we have a raw string slice and want to be able to mutate it, we’d have to own that section of memory, and this is only possible if we create a String from it. In fact, this means that internally the bytes have to be copied.

Ok, but why this detour exactly? Why did we now talk about String at such length? Well, I think it’s time to address the elephant in the room and talk about the problem. This will clear things up and explain why String was the only string variant that would work with our data model.

Meet: Escape Sequences 😈

The Problem

Our application is all about creating snippets, and each snippet contains 1-n files of text content. So it’s pretty inevitable to avoid special characters such as newline characters; I mean, if it was all in one line, then it’s not very easy for anyone to understand. So what, you might say. Does it really matter if the content contains these special characters? The short answer is yes. It does, mainly because we are exchanging data as JSON strings.

To understand this, let’s have a look at the following partial snippet:

{
files: [
{
content: `fn main() {
println!("Hello World!");
}
`
}
]
}

The above is a JavaScript object that represents a snippet. It contains one file with some content, which happens to be Rust 🙌. Notice how we use a template literal (`...`) to write out the content of this file across multiple lines. So we intuitively used newline characters to make our code more readable. Cool, but it’s not yet valid JSON. So let’s use JSON.stringify on our object, and we end up with the following JSON string:

Yay! Well, or maybe not so. While we now have a valid JSON string that we could send to our server, we should pause for a moment and take a close look at the string. Interestingly, any newlines in the original template literal are now actual characters in the JSON string. The escape sequence for a newline is \n. A line break is a control sequence that tells the rendering device, for example, a terminal, to move the print head down one line. This information is essential, and we want to make sure that it doesn’t get lost, so it has to be encoded into the JSON string somehow. And that’s where escape sequences come into play. It’s a way of encoding this piece of information into a string. So the control sequences get escaped. However, it also means they have to be removed again at some point. Why? Because the original string doesn’t really contain any \n. I mean, it does but not as ASCII or Unicode characters. Those are control sequences, or also called meta characters. What actually lives in memory is the following:

Line breaks in memory are just 1 byte. In fact, a line break can be represented as 0x0A in Hex or 00001010 in binary. To really get to the bottom of this, let’s simplify our string a bit and look at foo\nbar. Let’s convert this into Hex, just because it’s easier to read than binary.

However, this is not what is supposed to live in memory. It’s instead this:

The \n should not be present in memory, and it should be converted to a proper line feed, which is part of the ASCII table.

If we wanted the string as is to live in memory, we’d have to double escape the control sequence: \\n.

So good so far, but are those escape sequences really a problem? Yes and no.

It’s a problem because we have to use String and cannot be smarty pants and store &str slice references to the original text in our struct, which would be ideal. Because, after all, serde advertises zero-copy deserialization, which means that we’d just borrow the string or the bytes. More importantly, this avoids allocating memory for every field and avoids copying data from the input over to the newly allocated fields.

However, this won’t work if our strings contain escape sequences for precisely the reasons we discussed above. What’s in memory is different from the original JSON string, and hence we need to copy the string and own the underlying memory to mutate it. It has to be mutated because we have to get rid of those escape sequences and unescape them. Luckily, serde does for us.

Doesn’t sound that bad if it’s all taken care of for us anyway, right? Not really. It’s undoubtedly a computational overhead of copying over all the data, but what’s more, is unescaping the escape sequences. For small strings this won’t be really an issue, but if you deal with MBs of data, this can become very noticeable and now there’s a bottleneck. Dang.

This is something to keep in mind. I want you to walk away from this blog post with an awareness of this “problem” and one or two possible solutions. For what it’s worth, this was a real issue for me and it took me a while to get to the bottom of this. I wish I had such blog post back then. I suppose that’s why we are here 🤓.

So… how do we fix this?

The Solution(s)

Now that we know what we are dealing with, it’s time to look at the solution. In fact, there is more than one solution, but we’ll look at two in particular.

We already know that using string slices won’t work, and the compiler will simply yell at us. We don’t want to use String, because of the overhead of memory allocation and the additional compute for unescaping the escaped control sequences.

One solution would be to use smart pointers, in particular Cow.

Smart Pointer: Cow 🐮

Cow is one of Rust’s smart pointers, which provides clone-on-write functionality. This means it allows immutable access to borrowed data but clones the data if needed, that is, if mutation or ownership is required. This is pretty cool, right?

With that in mind, we could redesign our Rust data types as follows:

use std::borrow::Cow;#[derive(Serialize, Deserialize)]
struct Snippet<'a> {
#[serde(borrow)]
description: Cow<'a, str>,
public: bool, files: HashMap<&'a str>, File<'a>>,
}
#[derive(Serialize, Deserialize)]
struct File<'a> {
name: &'a str,
#[serde(borrow)]
content: Cow<'a, str>,
}

Notice how we now use Cow<'a, str> instead of String for the description and content. Other fields like name will likely not contain newlines.

Any fields using Cow are now dynamic and can either be &str or String. It’s also necessary to use #[serde(borrow] in combination with Cow because type like &str are implicitly borrowed. Other types of fields need to opt in to borrowing by using the attribute. By default, using Cow with Deserialize will always own the value.

Check out this playground, which demonstrates that, if you’re not using the [serde(borrow)] attribute, the data is owned even though the content does not contain any special characters. Play around with the code and the content, and add the attribute to see how the output changes.

Because Cow works with general borrowed data, we have to be explicit about the lifetime of the data. The lifetime of the strings is equal to the lifetime of the Snippet.

It’s definitely a solution, and it’s better than using String because it would not copy data where mutation is not required, e.g., if there are no escape sequences in the data. Instead of cloning everything by default, it will lazily clone and create an owned value only if needed. This could potentially save computation. However, it’s rather unlikely that the data does not contain escape sequences.

So what’s better?

Binary Formats

A better solution may be to simply not use JSON for data exchange between the client and the server. I know, JSON is very convenient and easy to use, but if performance matters, then maybe it’s time to rethink the format in which we exchange data.

We have already seen that what’s in memory is different from the actual string. Again, that’s because control sequences get stored as a single byte rather than escaping it and storing multiple characters.

What if we could send bytes directly?

Yes! That’s absolutely possible and also reasonable, and we now enter the territory of binary formats.

There are a variety of binary formats, including:

  • FlatBuffers
  • Protocol Buffers (Protobuf)
  • CBOR
  • BSON
  • MessagePack

…and more. There are always pros and cons when choosing between a format for data exchange. JSON is for sure the de facto solution on the web, but as we have seen, it has its drawbacks. As always, it really depends on the project, and there are still tradeoffs. For my specific use case, I ended up using a binary format for several reasons:

  • Deserialization was much faster, and I avoided the computational overhead for dealing with escape sequences and memory allocation. After some profiling, I saw a 3x improvement over serde_json. A huge factor was the zero-copy value decoding which was now possible.
  • Binary formats tend to be much more compact and therefore save bandwidth and reduce the request time.

There are, of course, also downsides. For example, the data has to be encoded into binary somewhere, and for us, that’s on the client side. This means we are limited to use JavaScript or WebAssembly if we are lucky. So there is some cost to pay after all. Whether that cost is higher or smaller is for you to find out. An in-depth comparison is out of scope for this blog post. However, in my case, the advantages outweighed the disadvantages. Mostly because deserialization was 3x faster. In fact, encoding also outperformed JSON.stringify by more than 2x.

Conclusion

This post discussed escape sequences’ impact when exchanging data between a client and a server. We have seen that zero-copy value decoding might not always work and that deserialization may entail unexpected computational overhead. That said, we want to be careful when designing our data model. We should be thoughtful when choosing data types because certain types work better in some instances than others. For example, Cow is fantastic when you want to avoid cloning data unless ownership is required. If that doesn’t cut it and you still experience performance issues to a certain degree, then maybe look into binary formats. It’s not always the best to default to JSON.

But keep mind:

Always measure! Premature optimizations are the root cause of all evil 😈.

I hope this blog post was helpful. If you have any questions, feel free to reach out to me on Twitter. I’d love to chat.

--

--

Dominic E.

GDE for Web / @Angular • Software Engineer @StackBlitz • Trainer @thoughtram • JavaScript stuff • Node.js • Rust • Deep Learning • Cyclist • Design