First non repeating word in a file? File size can be 100GB.


  • 0
    E

    Given a text file of size say 100GB? Task is to find first non repeating word in this file?
    constraints: You can traverse the file only once.


  • 0
    N

    The Second Edition of the 20-volume Oxford English Dictionary contains full entries for 171,476 words in current use and 47,156 obsolete words. To this may be added around 9,500 derivative words included as subentries. ref(https://en.oxforddictionaries.com/explore/how-many-words-are-there-in-the-english-language)

    With that said. we are talking about roughly 230K words. which is not that big. So if you just store each word you come across in the file in the dictionary and maintain their count. And once you finished parsing. loop through dict where the count is 1.
    Now as you have to find the first nonrepeating word. you have to use ordered dict (in python) to maintain word's order in the file.

    from collections import OrderedDict
    wordsCount = OrderedDict({})
    
    with open(fname) as f:
        content = f.readlines()
        for word in content.split(" "):
            if word not in wordsCount:
                wordsCount[word] = 1
            else:
                wordsCount[word] += 1
    
    for word, count in wordsCount.items():
        if count == 1:
            print(word)
            break
    

    this is going to take forever, But you can split file and do this on the individual file and combine the result.
    you can split file in linux/ubuntu with split -l 500 myfile myfile_split_ which generates myfile_split_aa , myfile_split_ab , etc. and then launch multi processes to speedup the process
    something like this

    from collections import OrderedDict
    from multiprocessing import Process
    import os
    
    fileWordCounts = []
    files = [] # 10000 split files 
    
    def countWords(fname):
        print('module name:', __name__)
        print('parent process:', os.getppid())
        print('process id:', os.getpid())
        
        wordsCount = OrderedDict({})    
        with open(fname) as f:
            content = f.readlines()
            for word in content.split(" "):
                if word not in wordsCount:
                    wordsCount[word] = 1
                else:
                    wordsCount[word] += 1
        fileWordCounts.append(wordsCount)
            
    i = 0
    while i < len(files):
        with Pool(10) as p:
            p.map(countWords, files[i:10])
        i+= 10
    
    #Now you will have fileWordCounts which will list of word count dictionaries.
    #combine the result and look for the first word with count 1
    

  • 0
    E

    @nshah thanks for suggesting a solution, but its clearly provided that you can read a file once and there are billions of words which are just random words not necessarily dictionary words


  • 1
    E

    @essar Can you use multiple machines ? Using Trie can also save a lot of space.


Log in to reply
 

Looks like your connection to LeetCode Discuss was lost, please wait while we try to reconnect.