Clean Java solution (hashmap + bits manipulation)


  • 178
    C
    public List<String> findRepeatedDnaSequences(String s) {
        Set<Integer> words = new HashSet<>();
        Set<Integer> doubleWords = new HashSet<>();
        List<String> rv = new ArrayList<>();
        char[] map = new char[26];
        //map['A' - 'A'] = 0;
        map['C' - 'A'] = 1;
        map['G' - 'A'] = 2;
        map['T' - 'A'] = 3;
    
        for(int i = 0; i < s.length() - 9; i++) {
            int v = 0;
            for(int j = i; j < i + 10; j++) {
                v <<= 2;
                v |= map[s.charAt(j) - 'A'];
            }
            if(!words.add(v) && doubleWords.add(v)) {
                rv.add(s.substring(i, i + 10));
            }
        }
        return rv;
    }

  • 0
    B

    I got MLE with the following code. So strange. I can't see why.

    List<String> ret = new ArrayList<String>();
            if(s == null || s.length() < 10) return ret;
            HashSet<Integer> found = new HashSet<>();
            HashSet<Integer> doubleTime = new HashSet<>();
            for(int i = 0; i <= s.length()-10; i++) {
                int v = 0;
                for(int j = i; j < i+10; j++) {
                    v = v << 2;
                    v = v | (s.charAt(j) - 'A');
                }
                if(found.contains(v)) {
                    if(!doubleTime.contains(v)) {
                        ret.add(s.substring(i, i+10));
                        doubleTime.add(v);
                    }
                } else {
                    found.add(v);
                }
            }
            return ret;
    

  • 1
    C

    v = v | (s.charAt(j) - 'A'); - does it really work?

    My assumption is that it is going to produce random results, since:

    'C' - 'A' = 2 (10 in binary)
    'G' - 'A' = 6 (110)
    'T' - 'A' = 19 (10011)

    With bitwise "OR" this is going to produce chaotic values.


  • 0
    B

    You're right.

    How numb I was!!!!


  • 0
    R

    That it correct


  • 6

    "if(!words.add(v) && doubleWords.add(v)) " is cool


  • 0
    G

    Hi ,
    I am not able to understand the logic. Can some one right few steps why we are doing this.

    Thanks,
    Guru


  • 1
    M

    Thanks for the neat solution! Could you please explain why v |= map[s.charAt(j) - 'A']; could act as something like a hash to distinguish a 10-letter-long-sequence?


  • 107
    C

    The key point is that it is not doing hash, it is doing the exact coding of a 10-letter sequence into a 4-bytes number, which is simply not possible for any generic string, but is possible for strings in this problem because they can have only 4 differfent characters.

    In more detail:

    If two objects have same hash it means that they may or may not be equal (though two equal objects are required to have same hash). So hashing is not enough here (like calling just "AACCCCCGGG".hashCode() and storing it in the map), because there can be another (different) string with same hash and the program will output wrong result.

    We also cannot store the 10-letter substrings themselves because they consume too much memory and the program will exceed memory limit.

    So, instead of hashing or storing strings themselves the solution converts 10 letter string into 4-bytes integer (which is much smaller than string in terms of consumed memory). This would not be possible if the string could contain all 26 letters of English alphabet for example. But it is possible for our case, because there can be only 'A', 'C', 'G' and 'T' letters.

    So we have only 4 possible letters, and we can use as little bits as possible to store each character of our 10-letter string. We really need only 2 bits (bits, not bytes) for this. Specifically the solution uses the following coding:

    0 = 00 (bits in binary number system) = 'A'

    1 = 01 (bits in binary number system) = 'C'

    2 = 10 (bits in binary number system) = 'G'

    3 = 11 (bits in binary number system) = 'T'

    Note that since there 10 letters and each letter requires only 2 bits, we will need only 10 * 2= 20 bits to code the string (which is less then size of integer in java (as well as in all othere popular languages), which is 4 bytes = 32 bits).

    For example, this is how "AACCTCCGGT" string will be coded:

    A A C C T C C G G T

    00 00 01 01 11 01 01 10 10 11 = 00000101110101101011 (binary) = 23915 (decimal)


  • 0
    M

    Understand. Thanks so much for detailed explanation!


  • 0
    S

    @crazyirontoiletpaper Thank you for the detailed explanation. I hope I can thumb up your comment! PS: Funny user name~


  • 0
    I

    Thanks!
    How does " if(!words.add(v) && doubleWords.add(v))" work by the way?
    I didn't get it..


  • 0
    M

    I think that's because the !words.add(v) will first be executed and only when duplicate it becomes true and will execute the second add and see the result.


  • 8
    F

    if(!words.add(v) && doubleWords.add(v))
    The first words.add(v) is to make sure it's duplicated 10-char sequence, doubleWords.add(v) is to make sure it's not duplicated more than 3 times.


  • 25
    E

    Improve a little bit. Maintain a size 10 window rather than loop through every time.

    public class Solution {
        public List<String> findRepeatedDnaSequences(String s) {
            List<String> result = new ArrayList<>();
            Set<Integer> word = new HashSet<>();
            Set<Integer> secondWord = new HashSet<>();
            int[] map = new int[26];
            map['C' - 'A'] = 1;
            map['G' - 'A'] = 2;
            map['T' - 'A'] = 3;
            int value = 0;
            for (int i = 0; i < s.length(); i++) {
                value <<= 2;
                value |= map[s.charAt(i) - 'A'];
                value &= 0xfffff;
                if (i < 9) {
                    continue;
                }
                if (!word.add(value) && secondWord.add(value)) {
                    result.add(s.substring(i - 9, i + 1));
                }
            }
            return result;
        }
    }

  • 0
    M

    Oh, boy.
    So many smart thoughts.


  • 1
    L

    for(int j = i; j < i + 10; j++) {
    v <<= 2;
    v |= map[s.charAt(j) - 'A'];
    }
    this part, why do you construct a hash operation like this? Is there some conventional method to construct hash function?


  • 1
    C

    That is not a hash function construction, that is a conversion from string representation to a binary number representation of a 10-characters long strings, which may contain only 4 different characters (that is why it is possible to represent it as a 32bit number). Hash function is a bit different thing. If x == y, then x.hashCode() == y.hashCode(), but the converse is not true: if x.hashCode() == y.hashCode() it does NOT mean that x == y (though it CAN BE the case), at the same time if x.hashCode() != y.hashCode() it 100% means that x != y.


  • 0
    F

    So great!Thanks for your solution.


  • 0
    R

    why we need doulbleWords? what will happen if we don't have HashSet doulbleWords?


Log in to reply
 

Looks like your connection to LeetCode Discuss was lost, please wait while we try to reconnect.