I'm looking for something below O(n^2)
and probably that'll be without nested loops.
I managed to write a linear-ish scan, but I have to re-start it from the last known bad position, and it looks too complicated.
I think I found a linear solution:
public class Solution {
public int longestSubstring(String s, int k) {
Validator validator = new Validator(k);
for (int i = 0; i < s.length(); i++) {
validator.add(s.charAt(i));
}
if (validator.invalids.isEmpty()) { // full input valid
return s.length();
}
Set<Character> ignore = new HashSet<>(validator.invalids);
int best = 0;
int start = 0;
validator.reset();
for (int end = 0; end < s.length(); end++) {
char ch = s.charAt(end);
if (ignore.contains(ch)) {
validator.reset();
start = end + 1;
} else if (validator.add(ch)) {
best = Math.max(best, end - start + 1);
}
}
return best;
}
private static class Validator {
private final int[] counts = new int[256];
private final Set<Character> invalids = new HashSet<>();
private final int k;
public Validator(int k) {
this.k = k;
}
/**
* @param c character visited
* @return whether the window is still valid
*/
public boolean add(char c) {
int count = ++counts[c];
boolean valid = count == 0 || k <= count;
if (valid) {
invalids.remove(c);
} else {
invalids.add(c);
}
return valid;
}
public void reset() {
invalids.clear();
Arrays.fill(counts, 0);
}
}
}
In my original one I had to reset to last bad, because I didn't pre-calculate the bad letters.
This version is not production-ready, but it's already asymptotically linear as I see it.
There could be improvements made around resetting (because if the input only has bad chars it's best case O(256*n)
).
It doesn't look right to me. Suppose you encounter a character that is not exactly invalid (it isn't in the ignored set), but it occurs only once near the beginning of the string. All its other occurrences are near the end. So you encounter this character, mark it as a start, and begin to extend your string. At a certain point you encounter enough repeated characters so your add
function returns true and you consider the whole [begin, end] substring to be a valid one, only it isn't.
As a matter of fact, the comment for add
is wrong. It doesn't return whether the window is still valid. It returns whether the added character is a valid one within the window. But it may contain other, invalid characters. If you modify it to return invalids.isEmpty()
, then the comment will be correct, but the algorithm still won't work. You'll end up extending an invalid string.
A test case like "abbbbbbcaa", 3
can illustrate my point. 'a' and 'b' are valid characters, 'c' is an ignored one. Your algorithm will consider "abbbbbb"
to be a valid string, but it isn't. The correct answer is "bbbbbb"
.
@stachenov Huh, thanks for looking. You're right it's not a correct implementation... it got accepted though with either return value.
Yes, the add
method should return invalids.isEmpty();
, that's a refactoring mistake I made.
Still, it's doesn't find the correct solution as you said. So there's no linear solution, we need to always reset and re-check?
This also a good example of not knowing that complexity to aim for. I was trying to get a linear one, even though I have no clue it exists. I have 2 trivial-ish bad quadratic solutions and one I can't figure out a complexity of (probably best case n
, worst case n^2
, and average in between); the latter rechecks at some positions and calculates the ignore on the fly.
I was looking for O(N lg N) to begin with. Got close to the average-case O(N lg N) (the one with intervals) in my mind, but decided against implementing it (seemed too complicated and not guaranteeing the complexity) and tried the sliding window O(N^2) which was accepted much to my surprise. So yes, I agree, a good example of not knowing the complexity in advance. And it illustrates my idea of spoiling the problem by specifying the complexity. When solving it, I'd definitely prefer penaltyless TLE to a specified complexity requirement.
@TWiStErRob The recursive solution of @StefanPochmann is in some sense O(n), because a total of O(n) work is done at every recursion level, and there are at most 26 levels of recursion.
@hxtang https://discuss.leetcode.com/topic/57092/6-lines-python? Interesting thought on the 26 levels...
Then any interval splitting solution is linear! I forgot that the alphabet size is limited.