String interning in C++

Ever since I started the C++ version of my compiler, there were a few techniques I had in mind to keep memory usage under control. Deduplicating strings turns out to be a pretty general one, so I made a separate library for it, available now as stringpool on Github.

The basic tradeoff at the heart of any string interning system is that you get memory savings by sacrificing speed, and stringpool is no different in this regard. It hashes the string you give it, comparing it to the strings it already knows, and stores a copy of the string if it’s novel. Whether it’s new or not, it then returns to you a string_handle that refers to the interned copy of the string. The caller can then efficiently store and work with that handle instead of the original string. string_handle is not completely interchangeable with const char*, std::string, or whatever you were using before, but it has the basics covered. A notable limitation is that interned strings are immutable, but this is often acceptable in practice.

Of course, putting strings into a cache amounts to a reduction in memory usage only if the cache is hit later, ideally many more times than the number of unique strings. It so happens that many applications do display such a usage pattern. For example, a compiler that reads in the name of a function, variable, or type will likely read in that same name many times, once where it is defined, plus once for each place it is used. String interning can allow that compiler to store only one copy of each string instead of as many copies as appear in the source code.

I am planning to write a few posts going into some detail about certain features of stringpool, including

  • O(1) concatenation of interned strings
  • reference counting

Stay tuned!


Posted

in

by