Thursday, December 03, 2009

Word pair matching challenge

In trolling the net I found a problem that I thought would be fun. It was posted by CitrusByte and is as follows: Given a dictionary, output all word pairs where both words share all their letters except the last two, which are distinct and reversed.

My solution is a bit long to post inline, but go check it out on github.

My first attempt kept all the words around while the first letter remained the same. It ran in 4 minutes using a ~450k words. This last implementation runs in 5 sec on the same list. Yea! I was surprised too. Just goes to show that you don't always understand the real nature of your data until you test it.

The reason for the speed up it drawn on my board in pictures and diagrams, but I'll try to explain it. When the list of words we can check against is sorted, and our list of possible matches is sorted, when we have passed a given word (represented by the current word being greater than the given word) there is no possibility of a match existing so we can remove the word from the list.

If we are only considering words of the same length in a given list and we remove all words that we have passed, then the word we are looking for should be on top if it is in the list at all. Remember this is using a sorted list of matches. We have to resort the list each time we add an item, but resorting the list each time isn't too bad because the lists stay very small.

No comments: