Data structure to count occurrences of words

Question

I have some code for a assignment where the main marks are awarded based on efficiency. The program uses a given wordgenerator which generates random words and adds them to a user-made data structure. You also have to provide 3 methods, one to add a item, one to remove a item and one to count how many times a word is present in the structure. Here is my code:

public class WordStoreImp implements WordStore{ public class Node{ public String data; public Node next; public Node(String addingWord, Node theNext){ data = addingWord; next = theNext; } } private Node[] array; int usedAmount=0; public WordStoreImp(int n){ array = new Node[n]; } public void add(String word){ int position = hashFunction(word); array[position]=new Node(word,array[position]); } public int count(String word){ int number = 0; int position = hashFunction(word); Node temp = array[position]; while(temp!=null){ if(temp.data.equals(word)){ number++; } temp=temp.next; } return number; } public int hashFunction(String a){ int sum = 1; for(int i = 0; i<a.length(); i++){ char b = a.charAt(i); int value = (int) b; sum *= value; } sum = sum % array.length; if(sum<0){ return -sum; } return sum; } public void addthings(String word, int n){ for(int i = 0; i<n; i++){ add(word); } } public int printSize(){ int count = 0; for(int i = 0; i<array.length; i++){ if(array[i] != null){ count++; } } System.out.println(count); return count; } public static void main(String[] args) { WordStoreImp a = new WordStoreImp(100); a.addthings("abc", 100); System.out.println(a.count("abc")); System.out.println(a.count("abc")); } }

So far I have written two of the methods, add and count. Whilst my add function is very quick, the count method is rather slower and takes a lot more time to compute. Is there a more efficient way to write this method or change things here to make it quicker?

Also - we arent allowed to use any methods or data structures that are inbuilt Java methods, such as HashMaps.

"based on efficiency" Efficiency of what? Are you looking for improvement of memory usage, speed, logarithmic complexity, etc.? Please clarify. — Mast, CommentedDec 10, 2017 at 20:29
I'm looking for speed efficiency, for it to be as quick as possible — Andy Orchard, CommentedDec 10, 2017 at 21:28
If you're not against using something besides an in-house HashMap, you could look into a Trie. It should be faster than the LinkedList backing (which should also contain a count instead of duplicates). Do note that just increasing the size of the HashMap may also help with lowering the collision rate — phflack, CommentedDec 11, 2017 at 18:24
You haven't satisfied the requirement to provide a method to remove an item. — 200_success, CommentedDec 13, 2017 at 6:08
The String class has a .hashCode() method. Are you allowed to use that? — 200_success, CommentedDec 13, 2017 at 6:11

George Koehler · Accepted Answer · 2017-12-13 05:05:22Z

There are many ideas to speed up hashtables, or hash maps. Some ideas appear in an article about hashtables in PHP, and a comment describing hashtables in Ruby. I present some ideas here, but these ideas might or might not speed up your program, because I have not tested them.

Add a counter for each word. If the word "the" appears 20 times, your code adds 20 entries for "the". The idea is to have 1 entry for "the", with the counter set to 20. This would be like using Java's HashMap<String, int> to map the word to the counter.

For this idea, you would add an int count; to class Node. To add n copies of word, you would searches for the Node with word, and add n to its count. If the Node didn't exist, you would insert a new Node and set its count. Your table would have fewer nodes, so searches might be faster. Also, there would be only one node for each word, so you would stop searching if you found the node.

Change the hash function. Your hash function seems to multiply the character codes in a string. If a string has one even-numbered character (like 'l', 'n', 'r', 't'), then the product is a multiple of 2. If the string hash two such characters, then the product is a multiple of 4, and so on. With 100 buckets, your hash might put most words in the multiple-of-4 buckets, and some words in the other even buckets, and almost no words in the odd buckets. Then your program would search for most words in the power-of-4 buckets, but those buckets would have long lists of nodes.

A better hash function might use a rotation (also known as a barrel shift) to spread the bits, with addition to mix in each character.

int hash = 0; for (int i = 0; i < a.length(); i++) { hash = Integer.rotateLeft(hash, 5); hash += (int)a.charAt(i); }

Use power-of-2 buckets. Division a / b and modulus a % b tend to be slow when b isn't constant. Your hash function has sum = sum % array.length. If the length was a power of 2, you would use sum = sum & (array.length - 1) to get the modulus. For this idea, your constructor WordStoreImp(n) would round up n to a power of 2, with code like

if (n < 16) n = 16; else n = Integer.highestOneBit(n - 1) << 1;

For example, if n was 100, then the highest 1-bit in 99 is 64, and 64 shifted left by 1 is 128. Because 128 is 2 to the 7th power, so hash & 127 extracts the low 7 bits of hash as an integer from 0 to 127.

Switch from open chaining to open addressing. Your hash table uses an open chain (or linked list) in each bucket. You allocate memory with new Node, but memory allocation might be slow. The idea is to switch to open addressing and remove the new Node calls.

With open addressing, each bucket holds exactly one word. You would allocate the buckets with

words = new String[n]; counts = new int[n];

This sets each word to null and each count to zero. A bucket is empty when its word is null. To search for a word, if the bucket has a different word, you check the next buckets until you find the correct word, or an empty bucket.

pos = hashFunction(word); while (strings[pos] != null && !strings[pos].equals(word)) pos = (pos + 1) & mask; // where mask = strings.length - 1

At this point, count[pos] is the correct count for this word. If the bucket is empty, then the count is still zero. If you are adding the word, then you would put strings[pos] = word and increase the count.

In open addressing, a bucket with a word can never become empty; you can only remove a word by setting its count to zero, and leaving the word in the bucket. To be fast, at least half of the buckets must be empty. You would keep a count of the number of words in the table. When the count reaches n/2, you would allocate a new table with twice as many buckets, and move all the words to the new table (skipping words where the count is zero).

Rene Saarsoo · Accepted Answer · 2017-12-10 21:29:29Z

So, you need a data-structure to perform three operations:

adding strings
removing strings
counting strings

Currently your addItems() and count() operation both have O(n) complexity - adding 100 words takes 100 times as long as adding one word; similarly counting a word that occurs 100 times takes 100 times longer than counting a word with one occurrence.

To improve the performance:

Consider whether you really need to store the exact same word 100 times?

For further improvement:

What happens when two different words result in the same hashcode?

Stack Exchange Network

Data structure to count occurrences of words

2 Answers 2

Hot Network Questions

Data structure to count occurrences of words

2 Answers 2

Related

Hot Network Questions