This still gave me 3 ms, so it's not too bad. Like some other solutions, it divides the 32 bits in half a couple times, reverses the smaller pieces, and then assembles a result.

```
uint32_t reverseBits(uint32_t n)
{
const uint8_t nibbles[16] =
{
0x0, 0x8, 0x4, 0xC,
0x2, 0xA, 0x6, 0xE,
0x1, 0x9, 0x5, 0xD,
0x3, 0xB, 0x7, 0xF
};
uint32_t res = 0;
res |= (nibbles[(n >> 0) & 0xf] << 28);
res |= (nibbles[(n >> 4) & 0xf] << 24);
res |= (nibbles[(n >> 8) & 0xf] << 20);
res |= (nibbles[(n >> 12) & 0xf] << 16);
res |= (nibbles[(n >> 16) & 0xf] << 12);
res |= (nibbles[(n >> 20) & 0xf] << 8);
res |= (nibbles[(n >> 24) & 0xf] << 4);
res |= (nibbles[(n >> 28) & 0xf] << 0);
return res;
}
```