- Imagine you are given a real file system, how will you search files? DFS or BFS ?
The answer depends on the tree structure. If the branching factor (n) and depth (d) are high, then BFS will take up a lot of memory O(d^n). For DFS, the space complexity is generally the height of the tree - O(d).
- If the file content is very large (GB level), how will you modify your solution?
- If you can only read the file by 1kb each time, how will you modify your solution?
- What is the time complexity of your modified solution? What is the most time consuming part and memory consuming part of it? How to optimize?
- How to make sure the duplicated files you find are not false positive?
Can't answer follow up questions. I would love to know your answers.
Just want to share my humble opinions for discussion:
If anyone has a better solution, I would appreciate it if you'd like to correct and enlighten me:-)
In real-world file system, we usually store large file in multiple "chunks" (in GFS, one chunk is 64 MB),so we have meta data recording the file size,file name and index of different chunks along with each chunk's checkSum (the xor for the content).
So when we upload a file, we record the meta data as mentioned above.
When we need to check for duplicates, we could simply check the meta data:
1.Check if files are of the same size;
2.if step 1 passes, compare the first chunk's checkSum
3.if step 2 passes, check the second checkSum
and so on.
There might be false positive duplicates, because two different files might share the same checkSum.
In the way mentioned above, we could read the meta data instead of the entire file, and compare the information KB by KB.
Using checkSum, we could quickly and accurately find out the non-duplicated files. But to totally avoid getting the false positive, we need to compare the content chunk by chunk when we find two "duplicates" using checkSum.