Thanks for writing this awesome answer and your response to the follow-up questions,
I just wanted to make up a few points that you missed:
MD5 is definitely one way to hash a file, another more optimal alternative is to use SHA256. Reference
Also, to answer this
What is the most time consuming part and memory consuming part of it? How to optimize?part:
Comparing the file (by size, by hash and eventually byte by byte) is the most time consuming part.
Generating hash for every file will be the most memory consuming part.
We follow the above procedure will optimize it, since we compare files by size first, only when sizes differ, we'll generate and compare hashes, and only when hashes are the same, we'll compare byte by byte.
Also, using better hashing algorithm will also reduce memory/time.