Short Java "rolling-hash" solution


  • 40
    S

    Hi guys!

    The idea is to use rolling hash technique or in case of string search also known as Rabin-Karp algorithm. As our alphabet A consists of only 4 letters we can be not afraid of collisions. The hash for a current window slice could be found in a constant time by subtracting the former first character times size of the A in the power of 9 and updating remaining hash by the standard rule: hash = hash*A.size() + curr_char.

    Check out the Java code below.

    Hope it helps!


    public class Solution {
        private static final Map<Character, Integer> A = new HashMap<>();
        static { A.put('A',0); A.put('C',1); A.put('G',2); A.put('T',3); }
        private final int A_SIZE_POW_9 = (int) Math.pow(A.size(), 9);
    
        public List<String> findRepeatedDnaSequences(String s) {
            Set<String> res = new HashSet<>();
            Set<Integer> hashes = new HashSet<>();
            for (int i = 0, rhash = 0; i < s.length(); i++) {
                if (i > 9) rhash -= A_SIZE_POW_9 * A.get(s.charAt(i-10));
                rhash = A.size() * rhash + A.get(s.charAt(i));
                if (i > 8 && !hashes.add(rhash)) res.add(s.substring(i-9,i+1));
            }
            return new ArrayList<>(res);
        }
    }
    

  • 0
    Q

    what's int A_SIZE_POW_9 for?


  • 0
    X

    good idea to use rolling hash!


  • 0
    S

    When we compute a hash we subtract the first char in a window multiplied by "the size of an alphabet in the power of 9" because the size of a window is 10. As it is a fixed value I just extracted it to a constant avoiding recomputing it each time.


  • 0
    G

    What about collisions? You could compute the hash of a string to be the same as another string and then accidentally add it to the result set


  • 0
    G

    Ah NVM I am messing up my basics. the hash function will use the equals() in that case.


  • 0
    C

    this is the right answer!


  • 0
    D

    Re-define hashcode, excellent solution!


  • 0
    G

    @shpolsky A more general observation you want to probably implement a mod m method for more rolling hashes because of integer overflow. It is not an issue in this problem because a rolling hash is always < 3*4^11 thus never overflowing the integer's precision


  • -2
    S

    I red Wiki 's Rabin-Karp Algorithm description. It is said that the base need to be a large prime. But according to your code, can I say the base need to be at least no less than the size of symbols. For example in this case there are 4 symbols (0,1,2,3), you use 4 so that there won't be collision. I feel it's like for decimal numbers, we don't have collision unless each digit is the same.


  • 0
    F

    Hello man. I also tried to use hash method to solve this problem. actually I used the built-in hashCode() function. All test cases passed but one. can you help me figure out what is the problem here?

    here is my code:

    public class Solution {
        public List<String> findRepeatedDnaSequences(String s) {
            List<String> result = new LinkedList<String>();
            Set<Integer> set = new HashSet<Integer>();
            for(int i = 0;  i <= s.length()-10; i++){
                String sub = s.substring(i,i+10);
                int hashNum = sub.hashCode();
                if(set.contains(hashNum)){
                    if(!result.contains(sub))
                        result.add(sub);
                }
                else
                    set.add(hashNum);
            }
            return result;
        }
    }

  • 1
    S

    Hi man! It seems like you've got a collision of default hash codes for different strings. It happens because Java hashCode for strings is the same as I used above but with base 31 which is much bigger than 4. Due to the size of the base for comparably long strings like of 10 chars or so there will be an integer overflow and collisions become an issue. You could avoid that by using Set of Strings solving collisions with linked lists but for this problem it would cause MLE, that's why we need to redefine hash. Hope it helps!


  • 0
    F

    Thank you for your answer! I checked it again and find it is integer overflow like you said. since 31^10 larger than the largest integer value. And also sometimes collisions happen not because of overflow since there are 256 characters but there are only base 31.


  • 0
    G

    Let me explain it to you guys, it is so easy to understand when you treat the 10-char substring as a 10-digit "Quaternary" number!
    How do you identify 12 and 21?
    Ans: 14^1+24^0 = 6;
    24^1+14^0 = 9;
    The mechanism of creating hashcodes without collisions is just like that.


Log in to reply
 

Looks like your connection to LeetCode Discuss was lost, please wait while we try to reconnect.