My first version is:

```
uint32_t reverseBits(uint32_t n) {
uint32_t res = 0;
for (int i = 0; i < 31; i++) {
res |= (n & 1);
n >>= 1; res <<= 1;
}
res |= (n & 1);
return res;
}
```

And this works pretty well. 4ms.

After that, I replaced

```
for (int i = 0; i < 31; i++)
```

with

```
for (int i = 31; i > 0; i--)
```

This works in 3ms.

I'm not very familiar with x86 assembly and I only know MIPS.

Comparing with zero is, well at least in my opinion, faster. (Please comment if I'm wrong.)

After that, I thought we could do loop expand, just as compilers do.

So the next version is:

```
uint32_t reverseBits(uint32_t n) {
uint32_t res = 0;
// repeat 31 times
res |= (n & 1); n >>= 1; res <<= 1;
res |= (n & 1); n >>= 1; res <<= 1;
res |= (n & 1); n >>= 1; res <<= 1;
res |= (n & 1); n >>= 1; res <<= 1;
....
res |= (n & 1); n >>= 1; res <<= 1;
return res | (n & 1);
}
```

Well it turns out to be slightly slower, 4ms. I don't know why.

I tried to figure out some loop-free solution and failed. (Please comment if there is any)

So my last version is

```
uint32_t reverseBits(uint32_t n) {
uint32_t res = 0;
for (int i = 31; i > 0; i--) {
res |= (n & 1);
n >>= 1; res <<= 1;
}
return res | (n & 1);
}
```

Almost the same as the second one but it magically works in 2ms.

I'm not sure if OJ's time measurement is precise enough, and I don't think caching is a good idea since this function would be called millions of times, searching would takes more time than just reversing it.

**EDIT:**

I found this one on Stack Overflow:

```
uint32_t reverseBits(uint32_t x) {
x = ((x >> 1) & 0x55555555u) | ((x & 0x55555555u) << 1);
x = ((x >> 2) & 0x33333333u) | ((x & 0x33333333u) << 2);
x = ((x >> 4) & 0x0f0f0f0fu) | ((x & 0x0f0f0f0fu) << 4);
x = ((x >> 8) & 0x00ff00ffu) | ((x & 0x00ff00ffu) << 8);
x = ((x >> 16) & 0xffffu) | ((x & 0xffffu) << 16);
return x;
}
```

And it works in 3ms. Could someone explain it?