I did it in 10 lines of C++


  • 139
    J

    The main idea is to store the substring as int in map to bypass the memory limits.

    There are only four possible character A, C, G, and T, but I want to use 3 bits per letter instead of 2.

    Why? It's easier to code.

    A is 0x41, C is 0x43, G is 0x47, T is 0x54. Still don't see it? Let me write it in octal.

    A is 0101, C is 0103, G is 0107, T is 0124. The last digit in octal are different for all four letters. That's all we need!

    We can simply use s[i] & 7 to get the last digit which are just the last 3 bits, it's much easier than lookup table or switch or a bunch of if and else, right?

    We don't really need to generate the substring from the int. While counting the number of occurrences, we can push the substring into result as soon as the count becomes 2, so there won't be any duplicates in the result.

    vector<string> findRepeatedDnaSequences(string s) {
        unordered_map<int, int> m;
        vector<string> r;
        int t = 0, i = 0, ss = s.size();
        while (i < 9)
            t = t << 3 | s[i++] & 7;
        while (i < ss)
            if (m[t = t << 3 & 0x3FFFFFFF | s[i++] & 7]++ == 1)
                r.push_back(s.substr(i - 10, 10));
        return r;
    }
    

    BTW, the OJ doesn't seems to have test cases which the given string length is smaller than 9, so I didn't check it to make the code simpler.

    Any suggestions?

    Update:

    I realised that I can use s[i] >> 1 & 3 to get 2 bits, but then I won't be able to remove the first loop as 1337c0d3r suggested.


  • 62

    Neat idea. The additional 1 bit per letter still encode each substring in 10x3 = 30 bits, just under 4 bytes for a 32-bit integer.

    Your code could be further simplified. By observing that s[i] & 7 is never 0, each of the first nine substrings with length < 10 will have unique hash key and will never collide with other 10-letter long sequences. Therefore the first loop could be removed and be compacted into a single loop.

    vector<string> findRepeatedDnaSequences(string s) {
        unordered_map<int, int> m;
        vector<string> r;
        for (int t = 0, i = 0; i < s.size(); i++)
            if (m[t = t << 3 & 0x3FFFFFFF | s[i] & 7]++ == 1)
                r.push_back(s.substr(i - 9, 10));
        return r;
    }
    

    Another observation is the mapped value need not be an integer counter, and could simply be a boolean to further save space. This requires some extra logic though:

    vector<string> findRepeatedDnaSequences(string s) {
        unordered_map<int, bool> m;
        vector<string> r;
        for (int t = 0, i = 0; i < s.size(); i++) {
            t = t << 3 & 0x3FFFFFFF | s[i] & 7;
            if (m.find(t) != m.end()) {
                if (m[t]) {
                    r.push_back(s.substr(i - 9, 10));
                    m[t] = false;
                }
            } else {
                m[t] = true;
            }
        }
        return r;
    }
    

    PS: OJ does have test cases which the given string has length that is less than 9.


  • 24
    M

    Concise solution! A little longer but more readable solution base on your idea.

    int str2int(string s) {
        int str=0;
        for (int i = 0; i < s.size(); ++i)
            str = (str<<3) +(s[i]&7);
        return str;
    }
    vector<string> findRepeatedDnaSequences(string s) {
        vector<string> res;
        unordered_map<int,int> coll;
        for (int i = 0; i+10 <= s.size(); ++i)
            if (coll[str2int(s.substr(i,10))]++ == 1)
                res.push_back(s.substr(i,10));
        return res;
    }

  • 0

    Not really optimal because you are calling substr() for every 10-length substring. What's the point of regenerating the hash using bit-manipulation for each substring when you can do rolling hash?


  • 0
    M

    Yes, rolling is more efficient, but it may be difficult for some people to understand it directly. My solution is just for these guys. BTW, if the pattern is longer, it can save a little more space ...


  • 0
    R

    Thanks for the great idea!

    Another way to calculate the rolling value is to

    t = ((t<<5)>>>2)|(s.charAt(i)&7); // in Java
    

    not saying it's better, just another way doing it.


  • 1
    L

    What happened if s[i] & 7 == 0, could you explain that part more?


  • 0
    C

    for example, if s[0] & 7 == 0 and s[1] & 7 == 0, the hash code of string s[0..0] and s[0..1] are the same, which will be wrong.


  • 0
    Z

    I like short code.


  • 0
    S

    if s = "AAAAAAAAAAA", which is combined with 11 'A', should we consider "AAAAAAAAAA" as a result?


  • 12
    /** 'A' - 64 = 1 (mod 5) = 1 
    *   'C' - 64 = 3 (mod 5) = 3 
    *   'G' - 64 = 7 (mod 5) = 2 
    *   'T' - 64 = 20 (mod 5) = 0
    */
    
    t = t << 2 & 0xfffff | (s[i] - 64) % 5; 
    

    is also a good choose!


  • 0
    H

    My Java solution using prime_tang's idea. Thanks!

     public class Solution {
    public List<String> findRepeatedDnaSequences(String s) {
        Map<Integer, Boolean> map = new HashMap<Integer, Boolean>();
        List<String> result = new ArrayList<String>();
        for(int t = 0, i = 0; i < s.length(); i++){
            t = t<<2 & 0xfffff | ((int)s.charAt(i)-64) %5 ;
            if(map.containsKey(t)){
                if(map.get(t)){
                    result.add(s.substring(i-9, i+1));
                    map.put(t,false);
                }
            }else{
                map.put(t, true);
            }
        }
        return result;
    }
    

    }


  • 0

    This is not really optimal because you making the algorithm complexity becomes higher!


  • 1

    For once I'm jealous of you guys having overflow :-)
    (I mostly use Python, and ints don't overflow there)


  • 0

    (s[i] - 64) % 5 is 0 when s[i] is 'T', so your solution is not right when s start with "TT", just as "TTAAAAACCCCCAAAAACCCCC", the line "s.substring(i-9, i+1)" could out of range when i=2.

    Here is right solution:

    class Solution {
    public:
        std::vector<std::string> findRepeatedDnaSequences(std::string s) {
            std::vector<std::string> res;
            if (s.size() < 11)
                return res;
            std::unordered_map<int, int> flag;
            int t = 0, i = 0;
            for (; i != 9; ++i)
                t = t << 2 | (s[i] - 64) % 5;
            for (; i != s.size(); ++i)
                if (flag[t = t << 2 & 0xfffff | (s[i] - 64) % 5]++ == 1)
                    res.push_back(s.substr(i - 9, 10));
            return res;
        }
    };

  • 0

    Thank you for all the nice sharers here and their brilliant ideas. I have learnt from them a lot. For those who may have difficulties understanding the concise solution immediately, my blog has a detailed explanation.


  • 0
    J

    Here is the java version I wrote:

       public List<String> findRepeatedDnaSequences(String s) {
    	List<String> ans = new ArrayList<>();
    	if (s.length() < 10) {
            return ans;
    	}
    
        Map<Integer, Integer> map = new HashMap<>();
        int h = 0;
        for (int i = 0; i < 9; i++) {
        	h = h << 3 | s.charAt(i) & 7;
        }
        for (int i = 9; i < s.length(); i++) {
        	h = h << 3 & 0x3FFFFFFF | s.charAt(i) & 7;
        	map.put(h, map.getOrDefault(h, 0) + 1);
        	if (map.getOrDefault(h, 0) == 2) {
        		ans.add(s.substring(i - 9, i + 1));
        	}
        }
        return ans;
    }
    

    Brilliant idea! btw.


  • 0
    R

    I believe this is cleaner :

    vector<string> findRepeatedDnaSequences(string s) {
        map<char, int> ch{{'A',0},{'C',1},{'G',2},{'T',3}}; 
        map<int, int> cache;
        vector<string> result;
        int hash, mask=((1<<20)-1);
        for(int i=0; i<(int)s.size(); ++i){
            hash=(hash<<2)+ch[s[i]];
            if(i>=9){
                hash=(hash&mask);
                cache[hash]++;
                if(cache[hash]==2)
                    result.push_back(s.substr(i-9,10));
            }
        }
        return result;
    }
    

  • 0
    Z

    It really took a long while for me to understand your code since I never see people use assignment of t to get value of t except for things like ++t.


  • 0
    G

    I think your solution is good,very helpful !


Log in to reply
 

Looks like your connection to LeetCode Discuss was lost, please wait while we try to reconnect.