First of all, the standard O(N^2) DP solution written with C++ goes MLE/TLE in LC, but works pretty well when written with JAVA/Python :(

It looks like this (Python version):

```
class Solution(object):
def numberOfArithmeticSlices(self, A):
dp = [defaultdict(int) for i in range(len(A))]
res = 0
for i in range(1, len(A)):
for j in range(i):
step = A[i]-A[j]
dp[i][step] += 1
if step in dp[j]:
dp[i][step]+= dp[j][step]
res += dp[j][step]
return res
```

Obviously, it maintains an array of dictionary to store the number of arithmetic subsequences (including length 2) ending with A[i]. As a result, the time and space complexity are both O(N^2) which I think is quite reasonable to deal with 0 ≤ N ≤ 1000 in LC. But it fails in C++. So I tweak the meaning of DP equation from:

```
DP[i][d] = the number of arithmetic subsequences ending with A[i], difference is d. (NOTE here the length of valid subsequences can be 2)
```

to

```
DP[i][d] = the number of arithmetic subsequences whose last but one number is A[i], difference is d.
```

After that, the length of valid subsequences we record must be at least 3. It does save memory and finally passes LC in 423 ms:

```
class Solution {
public:
int numberOfArithmeticSlices(vector<int>& A) {
if (A.empty()) return 0;
int n = A.size();
vector<unordered_map<long long, int >> dp(n);
unordered_set<int> s(A.begin(), A.end());
int res = 0;
for (int i = 1; i < n; ++i) {
for (int j = i-1; j >= 0; --j) {
long long d = (long long)A[i] - (long long)A[j];
int tmp = dp[j].count(d) ? dp[j][d] : 0;
if (tmp) res += tmp;
if (s.count(A[i]+d)) dp[i][d] += 1 + tmp;
}
}
return res;
}
};
```