Describe how you would implement a forum such as Reddit? Be sure to describe all relevant models and their relationships to each other.
Where: I was asked this problem for an interview for BuildZoom.
First of all, I would like to identify the goal and requirements. I want to create a system that will hold many posts that anybody can access if they go to the system's URL. A user can create many posts while other users can view and can optionally choose to either upvote/downvote. The posts with the most upvotes will appear on the "front page". Posts should be able to handle comments from other users and comments can have replies.
User (id, email, username, name, etc.)
Comment (id, text, post_id, user_id)
Post (id, user_id, url)
Picture (id, post_id, url)
Vote (id, post_id, comment_id, count)
Then I would try to explain the relationships between the classes/models:
User has many posts.
Post has many comments. Post belongs to user.
Post has many pictures.
Comment has many comments (replies). self relationship
Comment belongs to Post. Comment belongs to User through Post.
Post has one vote.
Comment has one vote.
Vote belongs to either Post or Comment.
Webpage will have a feed of posts created by different users. As a user you have the ability to create a post, interact with other posts, have notifications about replies to your comments, access a history page of your comments / posts.
To store everything I would set up a relational database in the backend. Traffic can be load balanced on different servers based on location. As we scale, we can add more servers to more locations and this can be achieved quite easily nowadays due to cloud computing services such as AWS.
Followup question: How would you improve page load performance? For example, a post might have many hundreds of comments or even a single comment might have hundreds of replies or we could have a chain of replies on a single comment..
My answer: I would implement some form of lazy-loading in which i only load upon scroll or by clicking "See more comments" etc. There will be a cap on the number of replies shown for each comment for example, 3 comments, before I force a user to click the "See more comments" text.
Other topics you could expand on depending on the interviewer:
- other features your service could offer (i.e sorting by Hot, New etc.)
- more in-depth into databases (i didn't know what else to say beyond relational which I don't think Reddit actually uses)
- platforms service will be offered on (web browser, mobile, etc.)
@hubqwerty it is a very nice question!
Data model looks okay but with one correction
Comment can have nested comment so to identify them you have parent_comment_id
Comment (id, text, post_id, user_id, parent_comment_id)
And to keep simple for vote, it can be
Another suggest i want to make is always have time_created and time_updated row whenever you are updating. it helps for various purpose - mainly analytics and timeline
Few more things to consider here.
Vote is updated very frequently, so we can not update table for every vote - very expensive to update and reads are locked. So we should make it "eventual consistent" - meaning every vote from each server can be updated into a cache (LRU), then we aggregate total and update table every few minutes, and another cache "read comment cache" which keeps total votes for a comment.
For reading vote for a comment, read from comment cache or read from db.
To load all comments for a post - load parent comments, then lazy load child comments if required.
To quickly calculate the best comment/top comment (sum of all upvotes of nested comment) - you can add another field into comment table - root_comment_id
Comment (id, text, post_id, user_id, parent_comment_id, root_comment_id)
Now you can compute total votes for all comments for a root_comment and display the top result. And then lazy load child comment if needed
SQL vs No SQL - I think no sql seems better choice since schema is simple and not require transactional, and mainly we need read focused
Additionally I would do the following..
I would keep the Vote functionality as async call, so browser doesn't need to wait for the server response when the user up/down votes. If there are some miss out on the votes occasionally, that doesn't change the overall trend of the post.
Will pre calculate the top trending posts and have it in cache. This will significantly reduce cost to find the top post when the user asks for. Just make sure to update the cache regularly to keep more recent always. Since the nature of the data changes quiet quickly, cache has to be used in caution.