O(nlogn) Suffix Array solution. Clear explanation.


  • 2
    Z

    Suffix array is an efficient tool for solving various string problems. The construction O(nlogn) method is very easy to understand, but can be also solved using O(n) construction methods such as DC3 or Skew.

    First, let's make several symbol definitions:

    len(S): Length of a string S

    S: A string S used in suffix array construction and calculation. Let's suppose S' starting index is 0, ending index is len(S') - 1. When we construct suffix array, we usually append '$' to the end of the suffix array to get S. Hence, S = S' + '$'

    S[i...j]: A sub string of S, starting from i, ending in j, inclusive.

    Suffix(i): A sub string of S, starting from i, to the end of the string S. ie, Suffix(i) = S[i..len(S) - 1]

    For example, in string S = "BANANA$", suffixes of S are:

    • Suffix(0) = "BANANA$"
    • Suffix(1) = "ANANA$"
    • Suffix(2) = "NANA$"
    • Suffix(3) = "ANA$"
    • Suffix(4) = "NA$"
    • Suffix(5) = "A$"
    • Suffix(6) = "$"

    sa[]: A one-dimensional array with length = len(S), storing a permutation of 0...len(S) - 1. Besides, Suffix(sa[i]) < Suffix(sa[i + 1]) for 0 ≤ i < len(S) - 1. This means that we first sort suffixes of S and save the beginning index of each suffix into sa[].

    In the above "BANANA$" example, the order of the suffixes are:

    • sa[0] = 6 = Suffix(6) = "$"
    • sa[1] = 5 = Suffix(5) = "A$"
    • sa[2] = 3 = Suffix(3) = "ANA$"
    • sa[3] = 1 = Suffix(6) = "ANANA$"
    • sa[4] = 0 = Suffix(0) = "BANANA$"
    • sa[5] = 4 = Suffix(4) = "NA$"
    • sa[6] = 2 = Suffix(2) = "NANA$"

    rank[]: A one-dimensional array,the inversion of sa[]. For each sa[i] = j, we have rank[j] = i. It is obvious that rank[i] represents where Suffix(i) ranks in sa[].

    In the above example, we have:

    • rank[0] = 4
    • rank[1] = 3
    • rank[2] = 6
    • rank[3] = 2
    • rank[4] = 5
    • rank[5] = 1
    • rank[6] = 0

    we can see that sa[rank[i]] = i

    Constructing suffix array sa[]

    So, how can we construct sa[]? A simple way is to see each Suffix(i) as a string and use string comparisons to sort them. Suppose n = len(S), this approach has O(n^2) complexity.

    We now introduce a doubling algorithm that has O(nlogn) complexity.

    For a string S, we define "k-prefix" as:

    S^k = S[0 .. k - 1] , if len(S) ≥ k

    S^k = S, if len(S) < k

    Define <^k, =^k, ≤^k as follows:

    Suppose there are two strings u and v,

    u <^k v iff u^k < v^k

    u =^k v iff u^k = v^k

    u ≤^k v iff u^k ≤ v^k

    It is clear that these comparisons compares first k characters of two strings. Particularly, if len(u) < k or len(v) < k, it does not matter because comparison ends before reaching length k.

    Hence, we get three important features:

    1. For k ≥ n, Suffix(i) <^k Suffix(j) iff Suffix(i) < Suffix(j)
    2. Suffix(i) =^2k Suffix(j) iff Suffix(i) =^k Suffix(j) and Suffix(i+k) =^k Suffix(j+k)
    3. Suffix(i) <^2k Suffix(j) iff (Suffix(i) <^k Suffix(j)) or (Suffix(i) =^k Suffix(j) and Suffix(i+k) <^k Suffix(j+k))

    There is a problem here: if i + k ≥ len(S) or j + k ≥ len(S), there is no definition for Suffix(i + k) or Suffix(j + k). However, this will not cause trouble because we use '$' as ending character. If '$' is not in character set of string S, then if i + k ≥ len(S), then len(Suffix(i)) ≤ k and hence Suffix(i) =^k Suffix(j) will always be false. Therefore, first k-prefix comparison will generate comparison result in this situation.

    We define following symbols:

    sa^k[]: A one-dimensional array with length = len(S), storing a permutation of 0...len(S) - 1. Besides, Suffix(sa[i]) <^k Suffix(sa[i + 1]) for 0 ≤ i < len(S) - 1. This means that we first sort k-prefix of suffixes of S and save the beginning index of each suffix into sa^k[].

    rank^k[]: A one-dimensional array,the inversion of sa^k[]. For each sa^k[i] = j, we have rank^k[j] = i. It is obvious that rank^k[i] represents where Suffix(i) ranks in sa^k[].

    If we already have sa^k[] and rank^k[], we can easily get sa^2k[] and rank^2k[]

    TO BE CONTINUED

    // sort suffixes of S in O(n*log(n))
    public int[] suffixArray(CharSequence S) {
    	int n = S.length();
    	Integer[] order = new Integer[n];
    	for (int i = 0; i < n; i++)
    		order[i] = n - 1 - i;
    
    	// stable sort of characters
    	Arrays.sort(order, new Comparator<Integer>() {
    
    		@Override
    		public int compare(Integer o1, Integer o2) {
    			return Character.compare(S.charAt(o1), S.charAt(o2));
    		}
    	});
    
    	int[] sa = new int[n];
    	int[] classes = new int[n];
    	for (int i = 0; i < n; i++) {
    		sa[i] = order[i];
    		classes[i] = S.charAt(i);
    	}
    	// sa[i] - suffix on i'th position after sorting by first len characters
    	// classes[i] - equivalence class of the i'th suffix after sorting by
    	// first len characters
    
    	for (int len = 1; len < n; len <<= 1) {
    		int[] c = classes.clone();
    		for (int i = 0; i < n; i++) {
    			// condition sa[i - 1] + len < n simulates 0-symbol at the end
    			// of the string
    			// a separate class is created for each suffix followed by
    			// simulated 0-symbol
    			classes[sa[i]] = i > 0 && c[sa[i - 1]] == c[sa[i]] && sa[i - 1] + len < n
    					&& c[sa[i - 1] + (len >> 1)] == c[sa[i] + (len >> 1)] ? classes[sa[i - 1]] : i;
    		}
    		// Suffixes are already sorted by first len characters
    		// Now sort suffixes by first len * 2 characters
    		int[] cnt = new int[n];
    		for (int i = 0; i < n; i++)
    			cnt[i] = i;
    		int[] s = sa.clone();
    		for (int i = 0; i < n; i++) {
    			// s[i] - order of suffixes sorted by first len characters
    			// (s[i] - len) - order of suffixes sorted only by second len
    			// characters
    			int s1 = s[i] - len;
    			// sort only suffixes of length > len, others are already sorted
    			if (s1 >= 0)
    				sa[cnt[classes[s1]]++] = s1;
    		}
    	}
    	return sa;
    }
    
        // Let i, j be index of CharSequence s
        // height[rank[i]] = LCP(Suffix(sa[rank[i]]), Suffix((sa[rank[i] + 1]))
    // height array can be used to construct longest common prefixes.
        // Let i' = min(rank[i], rank[j]), j' = max(rank[i], rank[j])
        // LCP(i, j) = min{height[k], i' <= k <= j'}
        // This is an RMQ question.
    public int[] height(int[] sa, CharSequence s) {
    	int n = sa.length;
    	int[] rank = new int[n];
    	for (int i = 0; i < n; i++)
    		rank[sa[i]] = i;
    	int[] height = new int[n - 1];
    	for (int i = 0, h = 0; i < n; i++) {
    		if (rank[i] < n - 1) {
    			for (int j = sa[rank[i] + 1]; Math.max(i, j) + h < s.length()
    					&& s.charAt(i + h) == s.charAt(j + h); ++h)
    				;
    			height[rank[i]] = h;
    			if (h > 0)
    				--h;
    		}
    	}
    	return height;
    }
    
    // Mirror index to left side centered. Eg: abc#cba
    private int mirrorLeft(int mid, int x) {
    	if (x <= mid)
    		return x;
    	return mid - (x - mid);
    }
    
    public String longestPalindrome(String s) {
    	StringBuilder ss = new StringBuilder(s);
    	ss.append("#");
    	ss.append(new StringBuilder(s).reverse().toString());
    	int sa[] = suffixArray(ss.toString());
    	int height[] = height(sa, ss.toString());
    	int N = ss.length();
    	int n = s.length();
    	int max = 0, pos = 0;
    	for (int i = 0; i < N - 1; ++i) {
    		int a = mirrorLeft(n, sa[i]);
    		int b = mirrorLeft(n, sa[i + 1]);
    		// Exclude abcdba#abdcba cases with the second condition
    		if (height[i] > max && Math.min(a, b) + height[i] - 1 == Math.max(a, b)) {
    			max = height[i];
    			pos = i;
    		}
    	}
    	return ss.substring(sa[pos], sa[pos] + max);
    }

Log in to reply
 

Looks like your connection to LeetCode Discuss was lost, please wait while we try to reconnect.