C++ passed solution using bitmap, any better ideas?


  • 1
    C

    Since there are only four possibilities for each char, we could use 2 bit to represent one char and assign them value like: 00 for A, 01 for C 10 for G and 11 for T.

    Thus each 10 char substring could be transformed into 20 bits. We use a 32-bit int to represent the 10 char sequence. After we have the first 10 chars and move forward, we add the new char and remove the left-most char. We could use some bit operation to represent this update.

    Whenever we have a int value, we insert it into a map, which counts the occurrence of each int value.

    Finally we find out all the int value that appears more than once and transform them back to 10 char string.

    My code could be found here

    https://changhaz.wordpress.com/2015/02/05/leetcode-repeated-dna-sequences/

    Any better ideas?


  • 1
    C
    class Solution {
    public:
        // A - 00; C - 01; G - 10; T - 11;
        int DNAtoInt(string dna_seq) {   //dna_seq.size() == 10
            int num = 0;
            for (int i = 0; i < 10; ++i) {
                num <<= 2;
                switch (dna_seq.at(i)) {
                    case 'A':
                        num |= 0x0;
                        break;
                    case 'C':
                        num |= 0x1;
                        break;
                    case 'G':
                        num |= 0x2;
                        break;
                    case 'T':
                        num |= 0x3;
                        break;
                    default:
                        break;
                }
            }
            return num;
        }
        string IntToDNA(int num) {
            string dna(10, ' ');
            int i = 9;
            while (i >= 0) {
                int x = num & 3;
                switch (x) {
                    case 0:
                        dna.at(i) = 'A';
                        break;
                    case 1:
                        dna.at(i) = 'C';
                        break;
                    case 2:
                        dna.at(i) = 'G';
                        break;
                    case 3:
                        dna.at(i) = 'T';
                        break;
                    default:
                        break;
                }
                num >>= 2;
                --i;
            }
            return dna;
        }
        vector<string> findRepeatedDnaSequences(string s) {
            vector<string> result;
            if (s.size() <= 10) {
                return result;
            }
            unordered_map<int, int> m;
            int xx = DNAtoInt(s.substr(0, 10));
            m.insert(make_pair(xx, 1));
            for (int i = 10; i < s.size(); ++i) {
                xx = xx & 0x3ffff;
                xx <<= 2;
                switch (s.at(i)) {
                    case 'A':
                        xx |= 0;
                        break;
                    case 'C':
                        xx |= 1;
                        break;
                    case 'G':
                        xx |= 2;
                        break;
                    case 'T':
                        xx |= 3;
                        break;
                    default:
                        break;
                }
                unordered_map<int, int>::iterator it = m.find(xx);
                if (it != m.end()) {
                    ++it->second;
                } else {
                    m.insert(make_pair(xx, 1));
                }
            }
            
            for (auto r : m) {
                if (r.second > 1) {
                    result.push_back(IntToDNA(r.first));
                }
            }
            
            return result;
        }
    };
    

    same idea, C++ implementation.


  • 0
    M

    Your code is about 100 lines, it seems impossible to finish it in an 45 mins interview


  • 0
    W

    My solution is similar but much shorter. In fact, this 100-line codes can be cut down to < 40 lines. I wonder whether there is more clever method.


  • 0
    J
    class Solution {
    public:
    int ToNum(char c)
    {
        switch (c)
        {
            case 'A': return 0;
            case 'C': return 1;
            case 'G': return 2;
            case 'T': return 3;
        }
    }
    
    string ToString(int n)
    {
        string s(10, ' ');
    
        char t[] = "ACGT";
        for (int i = 0; i < 10; i++)
        {
            s[9 - i] = t[n & 3];
            n >>= 2;
        }
        return s;
    }
    
    vector<string> findRepeatedDnaSequences(string s) {
        if (s.size() <= 10)
        {
            return {};
        }
    
        unordered_map<int, int> table;
        
        int num = 0;
        for (int i = 0; i < s.size(); i++)
        {
            num <<= 2;
            num |= ToNum(s[i]);
            
            if (i >= 9)
            {
                table[num & 0xFFFFF]++;
            }
        }
        
        vector<string> result;
        for (auto it = table.begin(); it != table.end(); ++it)
        {
            if (it->second > 1)
            {
                result.push_back(ToString(it->first));
            }
        }
        return result;
    }
    };

  • 0
    C

    much better code than what I had.

    One further improvement given by 'sen' is, we actually do not need hash-to-string function.

    Before we are adding int value to map, we check how many times it has appeared before. If it has appeared exactly once, we add the string directly to the output array. By this we find all piece appearing more than once and avoid duplicates.


  • 0
    J

    Agree. That's a very good idea!


Log in to reply
 

Looks like your connection to LeetCode Discuss was lost, please wait while we try to reconnect.