My 2ms C solution and some thoughts

• My first version is:

``````uint32_t reverseBits(uint32_t n) {
uint32_t res = 0;
for (int i = 0; i < 31; i++) {
res |= (n & 1);
n >>= 1; res <<= 1;
}
res |= (n & 1);
return res;
}
``````

And this works pretty well. 4ms.

After that, I replaced

``````for (int i = 0; i < 31; i++)
``````

with

``````for (int i = 31; i > 0; i--)
``````

This works in 3ms.
I'm not very familiar with x86 assembly and I only know MIPS.
Comparing with zero is, well at least in my opinion, faster. (Please comment if I'm wrong.)

After that, I thought we could do loop expand, just as compilers do.
So the next version is:

``````uint32_t reverseBits(uint32_t n) {
uint32_t res = 0;

// repeat 31 times
res |= (n & 1); n >>= 1; res <<= 1;
res |= (n & 1); n >>= 1; res <<= 1;
res |= (n & 1); n >>= 1; res <<= 1;
res |= (n & 1); n >>= 1; res <<= 1;
....
res |= (n & 1); n >>= 1; res <<= 1;

return res | (n & 1);
}
``````

Well it turns out to be slightly slower, 4ms. I don't know why.

I tried to figure out some loop-free solution and failed. (Please comment if there is any)

So my last version is

``````uint32_t reverseBits(uint32_t n) {
uint32_t res = 0;
for (int i = 31; i > 0; i--) {
res |= (n & 1);
n >>= 1; res <<= 1;
}
return res | (n & 1);
}
``````

Almost the same as the second one but it magically works in 2ms.
I'm not sure if OJ's time measurement is precise enough, and I don't think caching is a good idea since this function would be called millions of times, searching would takes more time than just reversing it.

EDIT:
I found this one on Stack Overflow:

``````uint32_t reverseBits(uint32_t x) {
x = ((x >> 1) & 0x55555555u) | ((x & 0x55555555u) << 1);
x = ((x >> 2) & 0x33333333u) | ((x & 0x33333333u) << 2);
x = ((x >> 4) & 0x0f0f0f0fu) | ((x & 0x0f0f0f0fu) << 4);
x = ((x >> 8) & 0x00ff00ffu) | ((x & 0x00ff00ffu) << 8);
x = ((x >> 16) & 0xffffu) | ((x & 0xffffu) << 16);
return x;
}
``````

And it works in 3ms. Could someone explain it?

1. Time meansure on OJ is not very accurate. You need a larger benchmark for testing performance.
2. Spatial locality and temporal locality makes cache works better. A very large loop body make cache for your codes occure more cache miss. That is why the performance may be even worse.
3. The last solution may be a simple one. First, it swap odd bits and even bits. Then, it swap each two bits...
4. 64K lookup table makes last solution faster. (But OJ may not accept your code since it is too long)
5. There are some typos in your last solution.

• This post is deleted!

Looks like your connection to LeetCode Discuss was lost, please wait while we try to reconnect.