Follow up questions discussion


  • 0
    A
    1. Imagine you are given a real file system, how will you search files? DFS or BFS ?

    The answer depends on the tree structure. If the branching factor (n) and depth (d) are high, then BFS will take up a lot of memory O(d^n). For DFS, the space complexity is generally the height of the tree - O(d).

    1. If the file content is very large (GB level), how will you modify your solution?
    2. If you can only read the file by 1kb each time, how will you modify your solution?
    3. What is the time complexity of your modified solution? What is the most time consuming part and memory consuming part of it? How to optimize?
    4. How to make sure the duplicated files you find are not false positive?

    Can't answer follow up questions. I would love to know your answers.


  • 3
    Y

    Just want to share my humble opinions for discussion:
    If anyone has a better solution, I would appreciate it if you'd like to correct and enlighten me:-)
    Question 2:
    In real-world file system, we usually store large file in multiple "chunks" (in GFS, one chunk is 64 MB),so we have meta data recording the file size,file name and index of different chunks along with each chunk's checkSum (the xor for the content).
    So when we upload a file, we record the meta data as mentioned above.
    When we need to check for duplicates, we could simply check the meta data:
    1.Check if files are of the same size;
    2.if step 1 passes, compare the first chunk's checkSum
    3.if step 2 passes, check the second checkSum
    ...
    and so on.
    There might be false positive duplicates, because two different files might share the same checkSum.

    Question 3:
    In the way mentioned above, we could read the meta data instead of the entire file, and compare the information KB by KB.

    Question 5:
    Using checkSum, we could quickly and accurately find out the non-duplicated files. But to totally avoid getting the false positive, we need to compare the content chunk by chunk when we find two "duplicates" using checkSum.


  • 2
    S

    In preparing for my Dropbox interview, I came across this problem and really wanted to find the ideas behind the follow up questions (as these were the questions that the interviewer was most interested in, not the code itself). Since this is the only post with the follow up discussion, i'll comment here! @yujun gave a great solution above and I just wanted to add a bit more to help future interviewees.

    To find duplicate files, given input of String array is quite easy. Loop through each String and keep a HashMap of Strings to Set/Collection of Strings: mapping the contents of each file to a set of paths with filename concatenated.

    For me, instead of given a list of paths, I was given a Directory and asked to return List of List of duplicate files for all under it. I chose to represent a Directory like:

    class Directory{
         List<Directory> subDirectories;
         List<File> files;
    }
    

    Given a directory, you are asked how you can find duplicate files given very large files. The idea here is that you cannot store contents in memory, so you need to store the file contents in disk. So you can hash each file content and store the hash as a metadata field for each file. Then as you perform your search, store the hash instead the file's contents in memory. So the idea is you can do a DFS through the root directory and create a HashMap<String, Set<String>> mapping each hash to the Set of filepaths + filenames that correspond to that hash's content.

    (Note: You can choose BFS / DFS to traverse the Path. I chose DFS as it is more memory efficient and quicker to code up.)

    Follow Up: This is great, but it requires you to compute the hash for every single file once, which can be expensive for large files. Is there anyway you can avoid computing the hash for a file?

    One approach is to also maintain a metadata field for each file's size on disk. Then you can take a 2 pass approach:

    1. DFS to map each size to a set of paths that have that size
    2. For each size, if there are more than 2 files there, compute hash of every file, if any files with the same size have the same hash, then they are identical files.

    This way, you only compute hashes if you have multiple files with the same size. So when you do a DFS, you can create a HashMap<Integer, Set<String>>, mapping each file's size to the list of file paths that have that size. Loop through each String in each set, get its hash, check if it exists in your set, if so, add it to your List<String> res otherwise add it into the set. In between each key (switching file sizes), you can add your res to your List<List<String>>.

    Will update with some code soon~~


  • 1
    M

    Hi @simonzhu91 ,

    In your interview, were you asked to code for the follow up questions, or just the basic question where content fits memory?


Log in to reply
 

Looks like your connection to LeetCode Discuss was lost, please wait while we try to reconnect.