# From a given string of words, get the top n occuring words

• Suppose you are given a book,
get the top n occurring words from it.

• ``````public class Solution {
private final int maxCapacity;
private final List<String> topWords;
private final HashMap<String, WordInfo> wordsInfo = new HashMap<>();
private final Comparator<String> frecuencyComparator = (s1, s2) -> Integer.valueOf(wordsInfo.get(s2).frecuency)
.compareTo(wordsInfo.get(s1).frecuency);

public Solution(int maxCapacity) {
this.maxCapacity = maxCapacity;
topWords = new ArrayList<>(maxCapacity);
}

// If we have the list of words we can call this method
public List<String> getTopNList(List<String> words) {
for (String word : words) {
}
}

// If we don't have the list of words we have to read the stream and call
// this method for each word
String lowerCaseWord = word.trim().toLowerCase();
if (topWords.isEmpty()) {
return;
}
if (wordInfo.isTopList) {
Collections.sort(topWords, frecuencyComparator);
return;
}
WordInfo lastWordInfo = wordsInfo.get(topWords.get(topWords.size() - 1));
if (wordInfo.frecuency > lastWordInfo.frecuency) {
Collections.sort(topWords, frecuencyComparator);
if (topWords.size() > maxCapacity) {
wordsInfo.get(topWords.remove(topWords.size() - 1)).isTopList = false;
}
} else if (topWords.size() < maxCapacity) {
}
}

private void addTopWord(String word, WordInfo wordInfo) {
wordInfo.isTopList = true;
}

WordInfo wordInfo = wordsInfo.get(word);
if (wordInfo == null) {
wordInfo = new WordInfo();
wordsInfo.put(word, wordInfo);
} else {
wordInfo.frecuency++;
}
return wordInfo;
}

private class WordInfo {
boolean isTopList;
int frecuency = 1;
}
}``````

• This solution works for medium file size with around 5000 lines .I hope that works..Just replace 12 with whatever value of n you want..!!

`````` *
*/

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Set;

/**
* @author arpit2408
*
*/
public class NHighestOccurChar {

/**
* @param args
*/
public static void main(String[] args) {
ArrayList<String> words=new ArrayList<String>();

try {
for (String line : Files.readAllLines(Paths.get("src/Randomwords.txt"))) {
for (String part : line.split("\\s+")) {
}
}
HighesOccuringChar(12,words);
}
catch (IOException e) {
e.printStackTrace();
}
}
public static void HighesOccuringChar(int n,ArrayList<String> words )
{
HashMap<Integer,String> frequnencyofwords=new HashMap<>();
int counter=0;
for(String s:words)
{
counter=0;
for(String sub:words)
if(s.equals(sub))
counter++;
frequnencyofwords.put(counter,s);
}
Set<Integer> keys=frequnencyofwords.keySet();
ArrayList<Integer> sortedkeys=new ArrayList<Integer>(keys);
Collections.sort(sortedkeys);
for(int i=0;i<n;i++)
{
System.out.println(frequnencyofwords.get(sortedkeys.get(sortedkeys.size()-1-i)));
}

}

}

``````

• @_l33t
Here is a possible solution for reference.

1. Scan the input strings in a given book.
2. Parser the words by a split of " " and remove all the marks.
3. Put the analyzed word into a HashMap. Key is word and the value is occurrence.
4. Walk through HashMap entrySet. Copy the values into a new int array.
5. Use quick sort partition method, get the n-th largest occurrence position in the new value array.
6. Quick sort the n-th largest array in desc order.
7. Put the n-th largest occurrences into a new LinkedHashMap<Integer, List<String>/handle duplicate occurs case/> named topNMap.
8. Go through the occurrence HashMap again. If the occurrence in topNMap, then put the work into topNMap.
9. Go through topNMap, output the top N words.

1. Handle the occurrence duplicated situation. For example, both "book", "sort" and "word" have the same occurrence of 10. The frequnencyofwords can only store one of them and remove the rest. So the solution is not accurate.
2. Sorting the whole list of occurrence is not the best solution. You can rank N first. Then sort only those N items.

• I think we can use HashMap and Priority Queue to solve this problem. My below solution for Top Hot Queries seems to work here in this case too.

Time Complexity - O(N Log(K)) + O(K) => ~ O(N Log(K))
Space Complexity - O(N) + O(K) => ~ O(N)

Will really appreciate if you guys have some other better solution.

``````package Amazon;

import java.util.ArrayList;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.PriorityQueue;

public class TopHotQueries {
class Query {
String query;
int count;

public Query(String query, int count) {
this.query = query;
this.count = count;
}
}

private Map<String, Query> map;
private PriorityQueue<Query> pq;
private static final int LIMIT = 3;

public TopHotQueries() {
map = new HashMap<>();
pq = new PriorityQueue<>(LIMIT, new Comparator<Query>() {
@Override
public int compare(Query q1, Query q2) {
return q2.count - q1.count;
}
});
}

private Query insertToMao(String queryStr) {
Query query;

if (map.containsKey(queryStr)) {
query = map.get(queryStr);
} else {
query = new Query(queryStr, 0);
}

query.count = ++query.count;
map.put(queryStr, query);

return query;
}

private List<Query> getTopK() {
List<Query> queryList = new ArrayList<>();
PriorityQueue<Query> hotPQ = pq;

int counter = 0;
while(!hotPQ.isEmpty()) {

++counter;

if(counter == LIMIT) {
break;
}
}

return queryList;
}

private void insertToPQ(Query prevQuery, Query curQuery) {
if (null != prevQuery) {
pq.remove(prevQuery);
}

if (pq.size() == LIMIT) {
if (pq.peek().count < curQuery.count) {
pq.poll();
}
}

}

public void insert(String queryStr) {
Query prevQuery = map.get(queryStr);

Query curQuery = insertToMao(queryStr);
insertToPQ(prevQuery, curQuery);
}

public static void main(String[] args) {
// TODO Auto-generated method stub
TopHotQueries thq = new TopHotQueries();
thq.insert("tablet");
thq.insert("kindle");
thq.insert("mac");
thq.insert("earpod");
thq.insert("books");
thq.insert("tablet");
thq.insert("earpod");
thq.insert("kindle2");
thq.insert("tablet");
thq.insert("tablet1");
thq.insert("tablet2");
thq.insert("tablet");
thq.insert("mac");
thq.insert("earpod");
thq.insert("mac");

for(Query top : thq.getTopK()) {
System.out.println(top.query + " : " + top.count);
}
}

}

``````

• You can do this in couple of ways, I'll suggest 2 here that I think they are different on what have been suggested so far.

Idea1:
Use max heap only. We're going to have an Entry:

``````Entry {
String word
int freq
}

override equals(String other) {
return word.equals(other)
}

override hashcode() {
return word.hashcode()
}
``````

And we're going to build a max heap with a custom comparator that compare the frequency of entries.

The process would be :
2- If word in heap, then increase its frequency, else add it to heap with freq = 1
3- After reading all words, remove first N elements from heap

Time Complexity would be O(K Log K) where K is the number of words

Idea 2:
Build a frequency hashtable, then do counting sort. The reason counting sort would be good here is that on average the maximum frequency that a word could happen in a book is what? a 1000 times? 2000? 5000? Obviously we should study the data (the words) more close to see if counting sort won't eat up lots of memory.
2- If word in hashtable then increase its frequency, else insert it with freq = 1
3- Do one pass on the hashtable and get the max freq
4- Build an array with this freq
5- Do counting sort
6- Get the largest N words

Time complexity would be O(K) where K is the number of words.
Space complexity will be dominated by the array created for counting sort.

• I will divide this question into 2 sub questions. 1th, get words frequency. 2th, find the n largest elements.

1th can be done using hashmap.

2th is classic, maybe quickselect?

• C#

``````

public List<Result> GetTopNWords(int n, string s)
{
Dictionary<string, int> dictResults = new Dictionary<string, int>();
string[] split = s.Split(' ');
for (int i = 0;i<split.Length;i++)
{
if(!dictResults .ContainsKey(s[i]))
{
}
else
{
dictResults [s[i]] +=1;
}
}

int[] frequencies = new int[dictResults.Values.Count];
dictResults.Values.CopyTo(frequencies, 0);
string[] words = new string[dictResults.Values.Count];
dictResults.Keys.CopyTo(words, 0);

Array.Sort(frequencies, words);
Dictionary<string, int> topNWords = new Dictionary<string, int>();

for(int i = 0;i<n;i++)
{