Simple java solution using hashset and 4-based int


  • 5

    maintain two hashSets, one records the 10-letter substrings when scanning the input string,
    the other stores the strings that are already in the result.

    need to convert string to integer to reduce memory used.
    Since there are only four possible characters in string, I can use a 4-based integer to represent each string.

    one tricky thing: do not write comments outside the class, or u will get Memory Limit Exceeded.

    public List<String> findRepeatedDnaSequences(String s) {
        List<String> ans = new ArrayList<>();
        int len = s.length();
        HashSet<Integer> set = new HashSet<>();
        HashSet<Integer> res = new HashSet<>(); 
        for (int i = 10; i<= len; ++i){
            String sub = s.substring(i-10,i);
            int n = convert(sub);
            if (!res.contains(n)){
                if (!set.contains(n))
                    set.add(n);
                else{
                    ans.add(sub);
                    res.add(n);   
                }
            }
        }
        return ans;
    }
    
    public static int convert (String sub){
        int res = 0;
        HashMap<Character, Integer> dict = new HashMap<>();
        dict.put('A',0); dict.put('C',1);
        dict.put('G',2); dict.put('T',3);
        for (int i = 0; i< 10; ++i ){
            res+=dict.get(sub.charAt(i)); 
            res*=4;
        }
        return res;
    }

  • 0
    K

    How can you be sure that 2 strings won't end up returning the same int for your convert method?


  • 0

    imagine some outer space alians' DNA has 10 kinds of bases (letters), like ABCDEFGHIJ, you can just convert each letter to digits and represent the string with a number, eg. "0123456789", without any confusion, and this is just a normal 10-based number. Now similarly, the human DNA has 4 kinds of letter, you can use a 4-based number to represent a sequence, e.g. "0123012301". Here each digit represents a letter at that position, so the numeric string is unique for each distinct DNA-sequence. Then , we convert this 4-based number to the int type supported by java, and this is also a one-to-one mapping. So my method has no problem encoding the DNA sequence.


  • 0
    G

    Hi,
    What about overlapping sequences? it seems that your solution does not take into account this kind of sequences :
    AAATTTAAAATTTAAAA
    your method return [AAATTTAAAA]
    It is actually not precised in the problem, but I think overlapping strings should not be contained into the result


  • 0

    this problem is designed to contain the overlapping sequences.


  • 0
    S

    Thank you so much. You explained very well.


  • 0
    R
    This post is deleted!

  • -2
    Q

    I still get memory limit exceeded without any comments. Using integers does save some space. Apparently not enough...


  • 0
    K

    Use one map instead of two sets will do the trick


  • 0
    S

    I just got MLE error when I ran this code.


Log in to reply
 

Looks like your connection to LeetCode Discuss was lost, please wait while we try to reconnect.