A "complete" solution for TinyURL (Leetcode System Design)


  • 42
    Z

    Note: The solution is translated from the Segmentfault post over here. Thanks for liuqi627.

    Question Description

    Over here

    S: Scenario

    Long URL to short URL and reversed.

    N: Need (Assume the system is not massive if you are not sure)

    QPS (queries per second)

    • Daily User: 100M
    • Daily usage per person: (Write) long2short 0.1, (Read) short2long 1
    • Daily request: Write 10M, Read 100M
    • QPS: Since a day is 86400s approximately 100K.

    Write 100, Read 1K

    • Peak QPS: Write 200, Read 2K

    (Thousand level can be handled by a single SSD MySQL Machine)

    Storage

    • 10M new mappings (long URL to short URL) per day
    • assume each mapping takes 100B in average
    • 1GB every day. 1 TB hard drive could stand for 3 years.

    Storage is not the problem for this kind of system. Service like Netflix may have storage issues.

    Through SN analysis, we could have a big picture of the system. In general, this system is not hard and could be handled by a single SSD Machine.

    A: API

    Only one service: URLService

    • Core (Business Logic) Layer:
    • Class: URLService
    • Interface:
    • URLService.encode(String long_url)
    • URLService.decode(String short_url)
    • Web Layer:
    • REST API:
    • GET: /{short_url}, return a http redirect response(301)
    • POST: goo.gl method - google shorten URL

    Request Body: {url=longUrl} e.g. {"longUrl": "http://www.google.com/"}
    Return OK(200), short_url is included in the data

    K: Data Access

    Step 1: Pick a storage structure

    SQL vs NoSQL?

    1. Does it need to support transactions? NoSQL does not support transaction.
    2. Do we need rich SQL query? NoSQL does not support as many queries as SQL.
    3. Pursue development efficiency? Most Web Framework supports SQL database very well (with ORM). It means fewer codes for the system.
    4. Do we need to use AUTO_INCREMENT ID? NoSQL couldn't do this. It only has a global unique Object_id.
    5. Does the system has a high requirement for QPS? NoSQL has high performance. For example, Memcached's QPS could reach million level, MondoDB does 10K level, MySQL only supports K level.
    6. How high is the system's scalability? SQL requires developers write their codes to scale, while NoSQL comes with them (sharding, replica).

    Answer:

    1. No -> NoSQL
    2. No -> NoSQL
    3. Doesn't matter because there are only a few codes. -> NoSQL
    4. Our algorithm needs AUTO_INCREMENT ID. -> SQL
    5. Write 200, Read 2K. Not high. -> SQL
    6. Not high. -> SQL

    System Algorithm

    OK, let's talk about the system algorithm. There are following solutions:

    1. Hash function:

      long_url -> md5/sha1

      • md5 convert a string into 128 binary bits, generally represented as 16 bytes hex:

      http://site.douban.com/chuan -> c93a360dc7f3eb093ab6e304db516653

      • sha1 convert a string into 160 binary bits, generally represented as 20 bytes hex:

      http://site.douban.com/chuan -> dff85871a72c73c3eae09e39ffe97aea63047094

    These two algorithms could make sure hash values are randomly distributed, but the conflicts are inevitable. Any hash algorithm could have inevitable conflicts.

    • Pros: Simple. We could take the first 6 chars of the converted string.

    • Cons: Conflicts.

      Solutions: 1. use (long_url + timestamp) as the hash function key. 2. When conflicts, regenerates the hash value(it's different because timestamp changes).

      Overall, when urls are over 1 billion, there would be a lot of conflicts and the efficiency could be very low.

    1. base62
      Take short_url as a 62 base notation. 6 bits could represent 62^6=57 billion.

    Each short_url represent a decimal digit. It could be the auto_increment_id in SQL database.

    public class URLService {
        HashMap<String, Integer> ltos;
        HashMap<Integer, String> stol;
        static int COUNTER;
        String elements;
        URLService() {
            ltos = new HashMap<String, Integer>();
            stol = new HashMap<Integer, String>();
            COUNTER = 1;
            elements = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
        }
        public String longToShort(String url) {
            String shorturl = base10ToBase62(COUNTER);
            ltos.put(url, COUNTER);
            stol.put(COUNTER, url);
            COUNTER++;
            return "http://tiny.url/" + shorturl;
        }
        public String shortToLong(String url) {
            url = url.substring("http://tiny.url/".length());
            int n = base62ToBase10(url);
            return stol.get(n);
        }
        
        public int base62ToBase10(String s) {
            int n = 0;
            for (int i = 0; i < s.length(); i++) {
                n = n * 62 + convert(s.charAt(i));
            }
            return n;
            
        }
        public int convert(char c) {
            if (c >= '0' && c <= '9')
                return c - '0';
            if (c >= 'a' && c <= 'z') {
                return c - 'a' + 10;
            }
            if (c >= 'A' && c <= 'Z') {
                return c - 'A' + 36;
            }
            return -1;
        }
        public String base10ToBase62(int n) {
            StringBuilder sb = new StringBuilder();
            while (n != 0) {
                sb.insert(0, elements.charAt(n % 62));
                n /= 62;
            }
            while (sb.length() != 6) {
                sb.insert(0, '0');
            }
            return sb.toString();
        }
    }
    

    Step 2: Database Schema

    One table (id, long_url). id is the primary key, ordered by long_url

    The basic system architecture:

    Browser <-> Web <-> Core <-> DB

    O: optimize

    How to improve the response speed?

    Improve the response speed between web server and database

    Use Memcached to improve response speed. When getting long_url, search in the cache first, then database. We could put 90% read request on the cache.

    Improve the response speed between web server and user's browser

    Different locations use different web server and cache server. All the areas share a DB used to match the users to the closest web server (through DNS) when they have a miss on the cache.

    What if we need one more MySQL machine?

    Issues:

    • running out of cache
    • More and more write requests
    • More and more cache misses

    Solutions:

    Database Partitioning

    1. Vertical sharding 2. Horizontal sharding

    The best way is horizontal sharding.

    Currently table structure is (id, long_url). So, which column should be sharding key?

    An easy way is id modulo sharding.

    Here comes another question: How could multiple machines share a global auto_increment_id?

    Two ways: 1. use one more machine to maintain the id. 2. use zookeeper. Both suck.

    So, we do not use global auto_increment_id.

    The pro way is put the sharding key as the first byte of the short_url.

    Another way is to use consistent hashing to break the cycle into 62 pieces. It doesn't matter how many pieces because there probably would not be over 62 machines (it could be 360 or whatever). Each machine is responsible for the service in the part of the cycle.

    write long_url -> hash(long_url)%62 -> put long_url to the specific machine according to hash value -> generate short_url on this machine -> return short_url

    short_url request -> get the sharding key (first byte of the short_url) -> search in the corresponding machine based on sharding key -> return long_url

    Each time we add a new machine, put half of the range of the most used machine to the new machine.

    More Optimization

    Put Chinese DB in China, American DB in the United States. Use geographical information as the sharding key, e.g. 0 for Chinese websites, 1 for American websites.


  • 0
    Z

    how about this algorithm to encode and decode?
    https://discuss.leetcode.com/topic/81637/two-solutions-and-thoughts

    can you combine with your solution?
    this algorithm doesn't use global unique id
    Looking forward to your answer?


  • 0
    Z

    @zpng
    Stefan prefers to solve the problem with fewer codes. TBH, I would use his codes to do so in the interview. As he wrote in the solution,

    code = ''.join(random.choice(Codec.alphabet) for _ in range(6))
    

    you could see he uses "random" module to generate the shortURL. Apparently, it will have much more conflicts compared with using hash function. Of course, you could easliy figure out a solution for Python by looking at the "hashlib". But, this is not the purpose of such a short coding interview. And I guess you should not reley on third party modules during the coding interview. But the story is different when you are doing a system design.


  • 0
    C

    Thanks for your post with so many details. You are such a kind guy.
    It helps me a lot.

    Just one tiny thing, a day is 86400s on my planet. You sure do you come from Earth?
    ε-(´∀`; )


  • 0
    Z

    @cdrover Yep... u r right. Fixed.


  • 0

    Nice discussion, thanks for translating. I just really dislike the nonsense function names "base62ToBase10" and "base10ToBase62", that would be a red flag for me.

    Apparently, it will have much more conflicts compared with using hash function.

    What do you mean? I think random will have fewer conflicts (unless you're doing something not discussed above).


  • 0
    Z

    Oh, hey Stephen. I am a big fan of you.

    So 6 bits could represent 62^6=57 billion.
    When we are colse to that number, we getting more conflicts. The problem is which one has more conflicts? Random or hash functions?

    The truth is I was probly thinking about the goodness of hash functions like md5 and sha256 when I was saying "Apparently, it will have much more conflicts compared with hash function.". That's my bad, I totally forgot the bits are limited to 6...

    So back to the question. You think random will have fewer conflicts, but why?
    I would vote hash function because it totally depends on what strategy you are using. We could use linear probing, double hashing technique, different random seed or key (like use long_url+timestamp as the key). But, TBH, I don't know which one will have fewer conflicts.


  • 1

    @ztlevi The discussion above described the hashing approach as "We could take the first 6 chars of the converted string". If we use the hex string, then clearly it's worse because that only gives 166 possibilities. If instead we use base 62 and the first 6 chars of that, then the first char will have strong bias. Because powers of 2 aren't powers of 62. And bias is bad, it promotes collisions.

    Here's an experiment, sending the numbers 0 to 999999 through md5 and base62:

     0 was the first base62 digit 0 times
     1 was the first base62 digit 130573 times
     2 was the first base62 digit 130120 times
     3 was the first base62 digit 129498 times
     4 was the first base62 digit 130070 times
     5 was the first base62 digit 130136 times
     6 was the first base62 digit 131015 times
     7 was the first base62 digit 104560 times
     8 was the first base62 digit 2097 times
     9 was the first base62 digit 2079 times
    10 was the first base62 digit 2145 times
    11 was the first base62 digit 2094 times
    12 was the first base62 digit 2115 times
    13 was the first base62 digit 2124 times
    14 was the first base62 digit 2172 times
    15 was the first base62 digit 2160 times
    16 was the first base62 digit 2124 times
    17 was the first base62 digit 2124 times
    18 was the first base62 digit 2206 times
    19 was the first base62 digit 2112 times
    20 was the first base62 digit 2095 times
    21 was the first base62 digit 2252 times
    22 was the first base62 digit 2135 times
    23 was the first base62 digit 2136 times
    24 was the first base62 digit 2151 times
    25 was the first base62 digit 2155 times
    26 was the first base62 digit 2067 times
    27 was the first base62 digit 2116 times
    28 was the first base62 digit 2089 times
    29 was the first base62 digit 2127 times
    30 was the first base62 digit 2172 times
    31 was the first base62 digit 2088 times
    32 was the first base62 digit 2089 times
    33 was the first base62 digit 2030 times
    34 was the first base62 digit 2087 times
    35 was the first base62 digit 2117 times
    36 was the first base62 digit 2152 times
    37 was the first base62 digit 2130 times
    38 was the first base62 digit 2121 times
    39 was the first base62 digit 2106 times
    40 was the first base62 digit 2094 times
    41 was the first base62 digit 2059 times
    42 was the first base62 digit 2120 times
    43 was the first base62 digit 2180 times
    44 was the first base62 digit 2078 times
    45 was the first base62 digit 2127 times
    46 was the first base62 digit 2035 times
    47 was the first base62 digit 2123 times
    48 was the first base62 digit 2141 times
    49 was the first base62 digit 2020 times
    50 was the first base62 digit 2199 times
    51 was the first base62 digit 2027 times
    52 was the first base62 digit 2056 times
    53 was the first base62 digit 2085 times
    54 was the first base62 digit 2125 times
    55 was the first base62 digit 2028 times
    56 was the first base62 digit 2090 times
    57 was the first base62 digit 2048 times
    58 was the first base62 digit 2098 times
    59 was the first base62 digit 2093 times
    60 was the first base62 digit 2131 times
    61 was the first base62 digit 2104 times
    

    The code:

    import hashlib
    from collections import Counter
    
    ctr = Counter()
    for i in range(10**6):
        md5 = hashlib.md5(str(i).encode()).digest()
        x = int.from_bytes(md5, 'big')
        while x >= 62:
            x //= 62
        ctr[x] += 1
    
    for x in range(62):
        print('%2d' % x, 'was the first base62 digit', ctr[x], 'times')
    

    You could use the last six digits instead of the first six. That's what I meant with "unless you're doing something not discussed above". That's very good, although still has a tiny bias. Well, pseudo-random probably also has such a tiny bias, but that then depends on the quality of the random number generator, it's not caused by us.


  • 0
    Z

    Got it! I guess these popular hash functions are built upon powers of 2. That's the tricky thing.
    Thanks!

    BTW, why is 0 times for first base62 digit as 0?


  • 0

    @ztlevi Well I did it like that "base10ToBase62" (sigh), so no leading zeros and thus you'd only have a zero if the whole number were zero. If you use fixed length (just replace my while-loop with x //= 62 ** 21) then you get:

     0 was the first base62 digit 128829 times
     1 was the first base62 digit 128468 times
     2 was the first base62 digit 127948 times
     3 was the first base62 digit 127436 times
     4 was the first base62 digit 127967 times
     5 was the first base62 digit 128054 times
     6 was the first base62 digit 128858 times
     7 was the first base62 digit 102440 times
     8 was the first base62 digit 0 times
     9 was the first base62 digit 0 times
    10 was the first base62 digit 0 times
    11 was the first base62 digit 0 times
    12 was the first base62 digit 0 times
    13 was the first base62 digit 0 times
    14 was the first base62 digit 0 times
    15 was the first base62 digit 0 times
    16 was the first base62 digit 0 times
    17 was the first base62 digit 0 times
    18 was the first base62 digit 0 times
    19 was the first base62 digit 0 times
    20 was the first base62 digit 0 times
    21 was the first base62 digit 0 times
    22 was the first base62 digit 0 times
    23 was the first base62 digit 0 times
    24 was the first base62 digit 0 times
    25 was the first base62 digit 0 times
    26 was the first base62 digit 0 times
    27 was the first base62 digit 0 times
    28 was the first base62 digit 0 times
    29 was the first base62 digit 0 times
    30 was the first base62 digit 0 times
    31 was the first base62 digit 0 times
    32 was the first base62 digit 0 times
    33 was the first base62 digit 0 times
    34 was the first base62 digit 0 times
    35 was the first base62 digit 0 times
    36 was the first base62 digit 0 times
    37 was the first base62 digit 0 times
    38 was the first base62 digit 0 times
    39 was the first base62 digit 0 times
    40 was the first base62 digit 0 times
    41 was the first base62 digit 0 times
    42 was the first base62 digit 0 times
    43 was the first base62 digit 0 times
    44 was the first base62 digit 0 times
    45 was the first base62 digit 0 times
    46 was the first base62 digit 0 times
    47 was the first base62 digit 0 times
    48 was the first base62 digit 0 times
    49 was the first base62 digit 0 times
    50 was the first base62 digit 0 times
    51 was the first base62 digit 0 times
    52 was the first base62 digit 0 times
    53 was the first base62 digit 0 times
    54 was the first base62 digit 0 times
    55 was the first base62 digit 0 times
    56 was the first base62 digit 0 times
    57 was the first base62 digit 0 times
    58 was the first base62 digit 0 times
    59 was the first base62 digit 0 times
    60 was the first base62 digit 0 times
    61 was the first base62 digit 0 times
    

  • 0
    S

    hey,

    What's the use of ltos and stol? Seems like you never use them.
    Thanks


  • 0

    @slipperass long to short, short to long


Log in to reply
 

Looks like your connection to LeetCode Discuss was lost, please wait while we try to reconnect.