In this post, I will discuss two interesting x86 instructions, what they do, and why you might want to use them.
PEXT – Parallel Bits Extract
When you want to manipulate the bits in an integer, the built-in bitwise operators in C are sufficient to achieve anything. However, sometimes you can do better. The PEXT instruction offers a particular way to manipulate bits that performs much better and uses much less code than otherwise.
I first used PEXT as a perfect hash function in my chess engine, but this example turns out to be rather general. Imagine we have 64-bit integers that we want to hash for use in a hash table. These numbers may be very large (using all 64 bits), but we know hashing can work because only certain bits (say, 6 of them) are of interest. If the bits we cared about were just the lowest 6 bits, our job would be easy, since the hash function could simply be ANDing the value with 0b111111. But in this scenario, the bits we care about are scattered throughout the 64. The needed mask might be more like 0x20C0000020080010. We could still use AND to turn off bits we don’t care about, but that wouldn’t give us a good hash function, since we’d still be stuck with large numbers. You could try adding a division with a modulus, but you have to check for collisions. Using a larger modulus mitigates collisions but means your hash table has to be larger, which might not be acceptable. More operations (like multiplying, shifting, XORing, etc.) might be added to in an attempt to improve the hash properties.
What we want in the end, whenever we’re using a hash table, is for the hash function’s output range to be a contiguous interval starting at 0. Given a value to be hashed and the mask of bits we care about, PEXT gives us exactly that. PEXT selects the bits indicated by the mask and concatenates them in the least significant part of the output, leaving the rest of the bits in the output zero. This is what makes PEXT great for hash tables, since the output, when interpreted as an integer, will be as small as possible. This means your hash tables can be as compact as possible—no wasted space, and no collisions.
Input value | Input mask | Output |
---|---|---|
0x12345678CAFEBABE | 0xFFFF0000FFFF0000 | 0x1234CAFE |
0b1111 | 0b0101 | 0b11 |
0b1010 | 0b0101 | 0 |
Notice that like with AND, the output will never have more bits set in it than are set in the mask. But unlike with AND, the masked bits with PEXT will be shifted down until they occupy the least significant positions.
Here are a few general properties:
Here is an educational C implementation of PEXT. Although this implementation could be improved, it would still be far slower than the built-in instruction, which on many CPUs takes as little as 3 clock cycles!
uint64_t pext(uint64_t value, uint64_t mask) {
uint64_t ret = 0;
int setMaskBits = 0;
for (int i = 0; i < 64; ++i) {
if (mask & (1ull << i)) {
if (value & (1ull << i))
ret |= 1ull << setMaskBits;
++setMaskBits;
}
}
return ret;
}
PDEP – Parallel Bits Deposit
While we’re at it, let’s look at PDEP, which can be thought of as an opposite of PEXT. Given some bit pattern and a mask, PDEP sends the bits in the bit pattern to the positions indicated by the mask. This scatters whereas PEXT gathers. Output bits not indicated by the mask are set to zero.
Input value | Input mask | Output |
---|---|---|
0x1234CAFE | 0xFFFF0000FFFF0000 | 0x12340000CAFE0000 |
0b11 | 0b0101 | 0b0101 |
0 | 0b0101 | 0 |
The same properties noted above also apply to PDEP:
Here is an educational C implementation of PDEP.
uint64_t pdep(uint64_t value, uint64_t mask) {
uint64_t ret = 0;
int setMaskBits = 0;
for (int i = 0; i < 64; ++i) {
if (mask & (1ull << i)) {
if (value & (1ull << setMaskBits))
ret |= 1ull << i;
++setMaskBits;
}
}
return ret;
}
Note the differences between these functions are beautifully simple.
uint64_t pextpdep(uint64_t value, uint64_t mask) {
uint64_t ret = 0;
int setMaskBits = 0;
for (int i = 0; i < 64; ++i) {
if (mask & (1ull << i)) {
if (value & (1ull << isetMaskBits))
ret |= 1ull << setMaskBitsi;
++setMaskBits;
}
}
return ret;
}
One neat application for PDEP is when you want to enumerate all bit patterns in which only certain bits can change, while the other bits are fixed. For example, say you want to print all the integers whose 0th-2nd bits and 6th-7th bits are the same as those of the constant 0b00101001, while the bits outside these intervals are free to take on all possible values. With PDEP, this is easy. We simply use a mask to dictate which bits are free to change, then invoke PDEP to do the hard part for us.
#include <immintrin.h>
#include <cstdint>
void printAllValues(uint64_t templateValue, uint64_t changeableMask) {
// Turn off the bits we're going to change.
// (This can also be done with one instruction, ANDN.)
templateValue &= ~changeableMask;
// Count the values we're going to produce.
int count = 1 << __builtin_popcount(changeableMask);
// Counting from 0 to N is the same as enumerating all combinations of N bits.
for (int i = 0; i < count; ++i) {
// Use PDEP to distribute the bits to the places we want.
uint64_t v = _pdep_u64(i, changeableMask) | templateValue;
printf("%lx\n", v);
}
}
Calling printAllValues(0b00101001, 0b11000111)
would print the following values, shown here in binary for ease of comparison. Changeable bits are shown in boldface.
00101000
00101001
00101010
00101011
00101100
00101101
00101110
00101111
01101000
01101001
01101010
01101011
01101100
01101101
01101110
01101111
10101000
10101001
10101010
10101011
10101100
10101101
10101110
10101111
11101000
11101001
11101010
11101011
11101100
11101101
11101110
11101111
Compiling
To compile C/C++ programs that use PEXT and PDEP, consult your toolchain’s documentation for what to #include as well as for the exact name of the intrinsic functions. Try <immintrin.h>
or <x86intrin.h>
. Also, most compilers by default target a “generic” x86 CPU which does not assume hardware support for these instructions. Therefore, you may need to tell your compiler to assume the CPU supports these instructions. This can be accomplished in GCC and Clang by setting -march=haswell
or newer. It can be accomplished in MSVC by setting /arch:AVX2
. Note that PEXT and PDEP are part of the BMI2 extensions, not the better-known AVX2. However, it seems that in practice, support for BMI2 and support for AVX2 always occur together.
APL
I later learned that the APL programming language has functions that work like PEXT and PDEP. The one like PEXT is called Compress. Compress takes an array and a mask (a Boolean array), and it outputs the ith element of the array for every i where the ith bit is set in the mask. This is exactly what PEXT does on a register. The one like PDEP is called Expand. This overlap between APL and x86-64 BMI2 is noted on the APL wiki, which also points out the AVX-512 functions VPCOMPRESSQ (Store Sparse Packed Quadword Integer Values Into Dense Memory/Register) and VPEXPANDQ (Load Sparse Packed Quadword Integer Values From Dense Memory/Register), which seemingly take their names from APL.