Suggestion on extra questions


  • 9
    A

    This is a great problem! Reminds that a programming interview is not only a coding interview :)

    I'd like to suggest a couple of extra questions:

    • Keep URLs forever or prune, pros/cons? How we do pruning?
    • What API would you provide to a third-party developer?

  • 18

    Thanks, these are great questions! I've added them to the description page.

    These are interesting questions we encountered day to day as a software engineer. Please feel free to add your thoughts!

    Keep URLs forever or prune, pros/cons? How we do pruning?

    Obviously keeping URLs forever is preferable and storage should be cheap. To give a sense how many URLs we can store in a single machine, let's do some estimation:

    Assume each row in the database consists of an ID (4 bytes), long url (2090 bytes), short url identifier (6 bytes), so each row is about 2100 bytes. That means 10 million URLs take about 21GB. Assume a machine has ~100GB of storage => ~50 million URLs per machine. That's a lot of URLs :)

    If we run out of storage, we can delete the inactive URLs (e.g., define URLs that last accessed timestamp > 30 days). Querying this in a database with >50 million rows should be quick, as long as the last_updated_timestamp column is indexed correctly. Does this sound right?

    Or, if you have a lot of memory, you could try to put as many URLs as possible in an in-memory database such as Redis or Memcached. URLs will be pruned automatically using LRU Cache eviction scheme based on the set expiry time. This is probably unrealistic in real world because memory is so much expensive than disk storage.

    What API would you provide to a third-party developer?

    I think for developers, they will be interested in how to programmatically create URLs and get URL stats like total hits, referrals (where does the click come from?), last accessed time, etc. Any other ideas?


  • 4
    A

    Thanks for sharing the thoughts, they're really helpful!

    Forgive me if I'm asking the stupid question here :) And this might be off this topic a bit. I have few experience on database design as I just switched my role from UI eng to full-stack, so my wondering goes to how to index timestamp in column? I referred at some related topics like creating-an-index-on-a-timestamp-to-optimize-query it sounds to me the timestamp needs be encoded first to an integer as the index and then decoded to date format. As you mentioned we could delete the old (> 30days) URLs when run out of the storage, what if there is no qualified ones (e.g. all < 15 days), and by the assumption the index increments as the time does, should we change the strategy to delete the ones with bigger indexes?

    Appreciate for anyone can help here.


  • 5

    Forgive me if I'm asking the stupid question here :) And this might be off this topic a bit. I have few experience on database design as I just switched my role from UI eng to full-stack

    Welcome @angelvivienne ! There is no stupid question :) The reason we're all here is to learn more everyday. And the only way to improve is to ask more questions and discuss. Congrats on your new role!

    my wondering goes to how to index timestamp in column? I referred at some related topics like creating-an-index-on-a-timestamp-to-optimize-query it sounds to me the timestamp needs be encoded first to an integer as the index and then decoded to date format.

    You can store the timestamp as integer, but you don't have to. At least for MySQL, indexing does not have to be limited to integer only.

    As you mentioned we could delete the old (> 30days) URLs when run out of the storage, what if there is no qualified ones (e.g. all < 15 days), and by the assumption the index increments as the time does, should we change the strategy to delete the ones with bigger indexes?

    We just change the expiry time to 15 days or less (if it's allowed). If not, your strategy works fine as well (I assumed you talk about deleting the ones with smaller indexes instead of bigger indexes). Prioritize pruning the ones with smaller indexes and older last_updated_timestamp.


  • 0
    A

    @1337c0d3r oh thanks for the immediate reply and kind words!


  • 0
    Z

    What if we do want to prune the records by the frequency or by the age? How to reclaim the sequence/id/identifier?

    One possible way is: store reclaimed ids in a separate table, and whenever a new encode request comes in, it will check reclaimed table first.

    BTW, does tinyurl or goo.gl keep the records forever?


  • 1
    G

    @zuozuo the frequent one will have updated last_updated_timestamp, doesn't it?


  • 0
    J

    Do not understand why it mentioned a-z, A-Z. To my understand, URL is case insensitive.


  • 0
    K

    @jlz domain names are case insensitive, but resource/ path/ file name aren't


  • 1
    K

    @kylelam It depends on webserver's configuration. You can make the resource path case sensitive or insensitive.

    For example:
    https://developer.mozilla.org/en-US/docs/WEb

    /docs/ is case sensitive
    /web is case insensitive.


  • 0
    B

    @jlz I think it was placed to see if the applicant applies critical thinking to the case sensitivity to the URL. I'd imagine that there would be many "right" answers to the question of case as long as you considered it in your solution in some way.


Log in to reply
 

Looks like your connection to LeetCode Discuss was lost, please wait while we try to reconnect.