Java beats 97%: moving window and REAL bits manipulation


  • 1
    G

    use 'rolling hash' and moving window.

    Since DNA only contains ACGT, we can use only 2 bits to represent one char.

    Define:
    A = 00, C = 01, G = 10, T = 11

    Then a 10-char DNA sequence can be represented by a unique 20-bit number.

    Since we only consider 10-char DNA sequence, we can use 'moving window' strategy. each time we move one step forward the window, remove the last char and append the new char to it:

    code = (code & mask) << 2 | map[s.charAt(i) - 'A']

    where :

    • & mask cause a removal of the top char
    • << 2 moves the window one step
    • | map[s.charAt(i) - 'A'] append the new char to the end.
    public class Solution {
        public List<String> findRepeatedDnaSequences(String s) {
            List<String> res = new ArrayList<>();
            if (s.length() < 11) return res;
            char[] map = new char[26];
            map['C' - 'A'] = 1;
            map['G' - 'A'] = 2;
            map['T' - 'A'] = 3;
            // 0011 1111 1111 1111 1111 
            int mask = 0x3ffff;
            Set<Integer> seen = new HashSet<>();
            Set<Integer> repeat = new HashSet<>();
    
            // init
            int code = 0;
            for (int i = 0; i < 9; i ++) {
                code <<= 2;
                code |= map[s.charAt(i) - 'A'];
            }
    
            for (int i = 9; i < s.length(); i ++) {
                code = (code & mask) << 2 | map[s.charAt(i) - 'A'];
                if (!seen.add(code) && repeat.add(code)) {
                    res.add(s.substring(i - 9, i + 1));
                }
            }
    
            return res;
        }
    }
    

  • 0
    T

    This solution is so brilliant!!


Log in to reply
 

Looks like your connection to LeetCode Discuss was lost, please wait while we try to reconnect.