C++ clean solution, answers to follow up


  • 27
    E

    Idea is simple:
    Use a hashmap with names vectors to store all files contents, and then prints the duplicates
    ...

    vector<vector<string>> findDuplicate(vector<string>& paths) {
        unordered_map<string, vector<string>> files;
        vector<vector<string>> result;
    
        for (auto path : paths) {
    	    stringstream ss(path);
    	    string root;
    	    string s;
    	    getline(ss, root, ' ');
    	    while (getline(ss, s, ' ')) {
    		    string fileName = root + '/' + s.substr(0, s.find('('));
    		    string fileContent = s.substr(s.find('(') + 1, s.find(')') - s.find('(') - 1);
    		    files[fileContent].push_back(fileName);
    	    }
        }
    
        for (auto file : files) {
    	    if (file.second.size() > 1)
    		    result.push_back(file.second);
        }
    
        return result;
    }
    

    ...
    Follow up questions:

    1. Imagine you are given a real file system, how will you search files? DFS or BFS ?
    In general, BFS will use more memory then DFS. However BFS can take advantage of the locality of files in inside directories, and therefore will probably be faster

    2. If the file content is very large (GB level), how will you modify your solution?
    In a real life solution we will not hash the entire file content, since it's not practical. Instead we will first map all the files according to size. Files with different sizes are guaranteed to be different. We will than hash a small part of the files with equal sizes (using MD5 for example). Only if the md5 is the same, we will compare the files byte by byte

    3. If you can only read the file by 1kb each time, how will you modify your solution?
    This won't change the solution. We can create the hash from the 1kb chunks, and then read the entire file if a full byte by byte comparison is required.

    What is the time complexity of your modified solution? What is the most time consuming part and memory consuming part of it? How to optimize?
    Time complexity is O(n^2 * k) since in worse case we might need to compare every file to all others. k is the file size

    How to make sure the duplicated files you find are not false positive?
    We will use several filters to compare: File size, Hash and byte by byte comparisons.


  • 2
    X

    Thanks for your follow up solution


  • 0
    S

    @elisemory what is n in complexity O(n^2 * k)?


  • 0

    Why is the time complexity o(n^2 * k)?


  • 1
    E

    n is the number of files.
    In worse case we have to compare all files to each other, so we need O(n^2) comparisons.


  • 0
    A

    Thanks for answering the follow up questions


  • 0
    G

    Explanation for the time complexity:
    suppose you have n files and the average length of each file is k.
    the worst case is that all the files are have the same size and the content of the files are basically the same. In that case, you need to compare every file with all the other files.

    File A: "AAAAAAAAAAA"
    File B: "AAAAAAAAAAB"
    File C: "AAAAAAAAAAC"
    File D: "AAAAAAAAAAD"

    here you need to:
    compare A with B,C,D.
    compare B with C,D.
    compare C with D.
    which means you need to compare (n^2)/2 times.
    And for each compare, you need to compare k chars.
    In conclusion, the time complexity is O((n^2)*k)


  • 2
    G

    Can anybody explain a little bit of the BFS DFS question? Why BFS uses more memory than DFS does?


  • 0
    N

    @elisemory
    Doesn't look-up in a hash map take O(1)? Which would make the total time complexity O(nk)?


  • 0
    B

    @nkstnkst You are right - the look-up in a hashMaps indeed takes O(1); but in the code, for each path in the input paths, we are determining the substring and carrying out the task of inserting in the hashMap. This is what makes it an O(n^2) solution. Hope this is useful.

    Thanks.


  • 0
    N

    @BatCoder Thank you for your reply but I'm still confused. Let's make sure we are on the same page. The author said that time complexity is O(n^2 * k) where n is the number of files and k is the file size, and I disagree with this. Here's my logic.

    For each path, determining the substring takes O(k). Inserting the substring in the hashMap (which consists of look-up and push_back) takes O(1). And we have n paths. So in total the program takes O(nk). Am I wrong?


  • 0
    E

    @nkstnkst The O(n^2) refers to the modified solution in the follow up questions. In worse case we need to compare every file with all other files, hence the O(n^2)


  • 0
    N

    @elisemory Oh I didn't realize it's not about the original question. You are right!


Log in to reply
 

Looks like your connection to LeetCode Discuss was lost, please wait while we try to reconnect.