Python solution with detailed explanation


  • 0
    G

    Solution

    Repeated DNA Sequences https://leetcode.com/problems/repeated-dna-sequences/?tab=Description

    Brute Force

    • The brute force solution is to use a rolling window and store every substring in a set and report a match. This can be inefficient in terms of memory usage - you store approximately 10 * N strings.

    Karp Rabin Optimization

    • The optimization should make use of reduced alphabet size and the fixed length of the strings. We use Karp-Rabin to accomplish this.
    • There are 4 characters and this means we must use base 4.
    • Compute the hash of first 10 characters - i.e. the base 4 representation of first 10 characters. How will you compute "123"? 1, 1 * 10 + 2 = 12, 12 * 10 + 3.
    • While computing the hash for first number, compute 4^9
    • Now as we roll the window, say the string was "1234"(base 10). Then hash for "234" = (123-100) * 10 + 4.
    class Solution(object):
        def findRepeatedDnaSequences(self, s):
            """
            :type s: str
            :rtype: List[str]
            """
            if len(s) < 10:
                return []
            tags, N, base = {"A":0, "C":1, "G":2, "T":3}, 10, 4
            cache, repeats = set([]), set([])
            number, factor = 0, 1
            for i in range(N):
                number, factor = number*base + tags[s[i]], factor*base
            factor /= base # Pre-compute 4^9
            cache.add(number)
            for i in range(N, len(s)):
                number = (number - tags[s[i-N]]*factor)*base + tags[s[i]]
                if number in cache:
                    repeats.add(s[i-(N-1):i+1])
                else:
                    cache.add(number)
            return [x for x in repeats]
    

Log in to reply
 

Looks like your connection to LeetCode Discuss was lost, please wait while we try to reconnect.