I don't do a line by line scan, instead, I do a diagonal scan. And I also only scan what should be updated, and ignore most others.

So my algorithm runs with a O(N^3) time complexity instead of O(mnN). And of course, it doesn't scan a cell out of boundaries so the worst case will be O(mnN), when N^3 > mnN. This makes it faster than a line by line O(mnN) solution.

And also, my algorithm only needs O(mn) space. Note, It is entirely possible to reduce the space to O(min(N^2, mn)) too using my algorithm, if we just store all reachable cells in a N^2 space. However, it will make the implementation utterly messy, so I will still use the traditional O(mn) grid.

Show an example of my idea:

Input, 5,5,3,2,2

At the very beginning, only position 2,2 has a value 1, means by 0 move, there is only 1 unique way to reach (2,2), and 0 unique ways to reach all other places:

```
0 0 0 0 0
0 0 0 0 0
0 0 1 0 0
0 0 0 0 0
0 0 0 0 0
```

My algorithm scans this way:

```
0 0 0 0 0
0 0 \ 0 0
0 \ 1 \ 0
0 0 \ 0 0
0 0 0 0 0
```

"\" means it scans from top-left to bottom-right, and it only scans the cells marked with a "\". So to calculate move #1, it only scans 4 positions instead of all 25 positions.

All scanned positions will add the values from all its neighbors. The result after first round of update:

```
0 0 0 0 0
0 0 1 0 0
0 1 1 1 0
0 0 1 0 0
0 0 0 0 0
```

Then it scans this way:

```
0 0 \ 0 0
0 \ 1 \ 0
\ 1 \ 1 \
0 \ 1 \ 0
0 0 \ 0 0
```

Results after 2nd round:

```
0 0 1 0 0
0 2 1 2 0
1 1 4 1 1
0 2 1 2 0
0 0 1 0 0
```

For all those cells that touches borders, the number of ways the ball can get out from this cell is the number on that cell times the number of directions it can get out.

For example, a cell at corners have 2 directions to get out, but the one on the border just has 1. So if a corner cell has a value K, the sum should add 2K.

The implementation of this algorithm is rather more complicated than it appears, because you need to do a lot of corner case handling when scanning diagonally. Here is the implementation:

```
struct Solution {
const static int BASE = 1000000007;
int findPaths(int height, int width, int maxmove, const int si, const int sj) {
int grid[height][width];
memset(grid, 0, sizeof(int) * height * width);
int sum = 0;
for (int k = 0; k < maxmove; ++k)
for (int d = si - sj + k, r = k; d >= si - sj - k; d -= 2, r --)
for(int i = max(max(sj - r,0)+ d,0), j = i - d; i < height && j < width && i <= si + r; i ++, j ++)
{
grid[i][j] = (k == 0);
int bound = 4;
if (i - 1 >= 0)
grid[i][j] = (grid[i][j] + grid[i - 1][j]) % BASE, bound --;
if (i + 1 < height)
grid[i][j] = (grid[i][j] + grid[i + 1][j]) % BASE, bound --;
if (j - 1 >= 0)
grid[i][j] = (grid[i][j] + grid[i][j - 1]) % BASE, bound --;
if (j + 1 < width)
grid[i][j] = (grid[i][j] + grid[i][j + 1]) % BASE, bound --;
sum = (sum + bound * (long long)grid[i][j]) % BASE;
}
return sum;
}
};
```