Evolve from straightforward solution to optimal, a review of all solutions


  • 0
    1. It is straightforward to use one hashtable to store all sequences and another hashtable to store repeated sequences
        vector<string> findRepeatedDnaSequences(string s) {
            unordered_set<string> seq, res;
            for(int i=0;i<(int)s.size()-9;i++) {
                string sub = s.substr(i,10);
                if(!seq.insert(sub).second) res.insert(sub);
            }
            return vector<string>(res.begin(),res.end());
        }
    
    1. The above solution does not use the information that the string contains only ACGT. Since each substr of length 10 contains only 4 distinct characters, we can encode each char with 2 bits and a substr with 20 bits to save memory.
        vector<string> findRepeatedDnaSequences(string s) {
            unordered_set<int> seq, add;
            char ht[85] = {0};
            ht['C'] = 1;
            ht['G'] = 2;
            ht['T'] = 3;
            vector<string> res;
            int sub=0;
            for(int i=0;i<s.size();i++) {
                sub = (sub<<2)+ht[s[i]] & 0xfffff;
                if(i>8 && !seq.insert(sub).second && add.insert(sub).second) res.push_back(s.substr(i-9,10));
            }
            return res;
        }
    
    1. Two things can still be improved from #2. First, the ascii code of ACGT are differnt in the last 3 digits. So we can just use it to encode without the additonal coding table. Second, we can count the occrences of each substr and add to the result when it reaches 2. So we only need 1 hashtable. Idea is from @JayXon
         vector<string> findRepeatedDnaSequences(string s) {
            unordered_map<int,int> ct;
            vector<string> res;
            int sub=0;
            for(int i=0;i<s.size();i++) {
                sub = (sub<<3)+ (s[i] & 7) & 0x3fffffff;
                if(ct[sub]++==1) res.push_back(s.substr(i-9,10));
            }
            return res;
        }
    
    1. We actually do not need to count the occurences of a substr. We just need to know if it is a repeated and if it has been added to the result. So we can change the value of the hashtable to bool to further save memory.
        vector<string> findRepeatedDnaSequences(string s) {
            unordered_map<int,bool> add;
            vector<string> res;
            int sub=0;
            for(int i=0;i<s.size();i++) {
                sub = (sub<<3)+ (s[i] & 7) & 0x3fffffff;
                auto it = add.find(sub);
                if(it==add.end()) add[sub] = 1;
                else if(it->second) {
                    res.push_back(s.substr(i-9,10));
                    it->second = 0;
                }
            }
            return res;
        }
    

Log in to reply
 

Looks like your connection to LeetCode Discuss was lost, please wait while we try to reconnect.