Design Pastebin


  • 12
    V

    It was asked to me in Amazon interview for SDE-2 position.

    What is Pastebin?
    Pastebin web service enable users to store plain text over the network and generate unique URLs to access the uploaded data. It is also used to share data over the network quickly, as users would just need to pass the URL to let other users see it.

    Functional Requirements:

    1. Users should be able to upload plain text called as "Pastes".
    2. Upon uploading the paste user should be provided unique url to access it.
    3. Users should only be able to upload text.
    4. Url's should be expired after certain period of time, if expiration time not provided by the user.

    Non-Functional Requirements:

    1. System should be highly reliable i.e no data loss.
    2. System should be highly available i.e user should be able to access their paste most of the time.
    3. Minimum latency to fetch user pastes.

    Database Design:

    1. User table
      a. name
      b. email
      c. created_date
      d. last_login
      e. userId - PK

    2. Paste
      a. paste_name
      b. url
      c. content
      d. expiration_time
      e. created_date
      f. userId - FK
      g. pasteId - PK

    System APIs

    Can be implemented as Restful Webservices as -

    1. createPaste
    2. updatePaste
    3. getPaste
    4. deletePaste

    High Level Design

    At a high level, the system needs an application server that can serve read and write request. Application server will store paste data on block storage. All the metadata related to paste and user will be stored into a database. At a high level, various cache servers and load balancer can be configured to improve performance and scalabilty of the system.

    0_1517558987394_High Level Design.PNG


  • 2
    F

    @varyani_dinesh

    How about any additional functional/scalability requirements such as:

    • Uniqueness and scalability of URL generation - how do we guarantee the uniqueness at massive scale, say, to accommodate thousands (or more) of simultaneous pasting requests per second? A single server to generate URL most likely would not be sufficient. But if we distribute the generation across multiple servers, what schemes should we use to ensure the URL uniqueness?
    • Storage of the content - Interviewers are (or should be) more interested in how candidates would design scalable solutions for the content storage rather than just using an existing relational/nosql/object database.
    • Content retrieval - In a distributed design, how do we distribute the content retrieval based on a URL?
    • For any solutions with distributed nature, how do we scale up if more users join? How do we handle any of the compute/storage server failures in those solutions?

  • 0
    V

    @flyweight2017 Nice questions to add up. Please see my inline comments for the same -

    • Uniqueness and scalability of URL generation - how do we guarantee the uniqueness at massive scale, say, to accommodate thousands (or more) of simultaneous pasting requests per second? A single server to generate URL most likely would not be sufficient. But if we distribute the generation across multiple servers, what schemes should we use to ensure the URL uniqueness? -
      Please see below basic high level view of architecture. In order to generate unique urls we can utilize the concept of Key Generation Service for creating unique key for each url and storing it in Key DB. Key Generation Service will make sure all the keys inserted in key DB are unique. In order to implement such scenario Key Generation Service can use two or three tables to store keys corresponding to url, one for keys that are not used yet and one for all the used keys. As soon as Key Generation Service provide keys to application server, it can move these to the used keys table. In order to make this process fast Key Generation Service can keep some keys in the memory so that whenever a application server needs them, it can quickly provide them. Thus, whenever Key Generation Service loads some keys in memory, it can move them to used keys table, this way we can make sure each server gets unique keys.
    • Storage of the content - Interviewers are (or should be) more interested in how candidates would design scalable solutions for the content storage rather than just using an existing relational/nosql/object database. -
      Please see below basic high level view of architecture. We can use block storage to store the paste. This storage is needed when we want to store volume of data.
    • Content retrieval - In a distributed design, how do we distribute the content retrieval based on a URL? -
      Upon receiving a read request for the paste, the application service layer contacts the datastore. The datastore searches for the key, and if it is found, returns the paste’s contents by first looking into paste cache and than moving to Paste storage. Otherwise, an error code is returned.
    • For any solutions with distributed nature, how do we scale up if more users join? How do we handle any of the compute/storage server failures in those solutions? -
      We can use database partitioning and consistent hashing techniques to scale up the architecture.

    0_1517557998121_High Level Design.PNG


  • 0
    Y

    I don't understand why there need to be two tables in the database. It seems that a key-value storage is sufficient -- the key is the hash of the text and the value contains the text, the upload time, and the expiration period.

    What is in my mind is that we need a Web server, which accepts the uploaded text and saves the text into the above key-value storage. We also need a daemon, which scans over the storage pass and pass. For each item, if it is expired (current_time >= upload_time + expiration_period), the daemon removes the item; otherwise, the daemon creates a timer to wake it up at the expiration time to remove the item.

    package main
    
    import (
    	"timer"
    )
    
    type TimeoutHandler struct {
    	actionList map[string]int
    	nextPass   chan int
    }
    
    func NewTimeoutHandler() *TimeoutHandler {
    	return &TimeoutHandler{
    		actionList: make(map[string]int),
    		nextPass:   make(chan int, 100),
    	}
    }
    
    func (th *TimeoutHandler) RemoveItemWhenExpired(string key, Value value) {
    	th.actionList[key] = 1
    	t := timer.NewTimer(value.ExpireTime)
    
    	go func(key string) {
    		<-t.C
    		storage.Delete(key)
    		delete(th.actionList, key)
    		if len(th.actionList) <= 0 {
    			th.nextPass <- 1 // trigger the scan of the next pass.
    		}
    		t.Close()
    	}(key)
    }
    
    func (th *TimeoutHandler) Run() {
    	for {
    		for s := storage.NewScanner(); !s.Done(); s.Next() {
    			th.RemoveItemWhenExpired(s.Key(), s.Value())
    		}
    		<-th.nextPass
    	}
    }
    
    func main() {
    	th := NewTimeoutHandler()
    	th.Run()
    }
    

Log in to reply
 

Looks like your connection to LeetCode Discuss was lost, please wait while we try to reconnect.