I’ve started a new website.
Most of you probably know me as a guy who rants about stuff on the web. But some of you may be aware of my real job as asoftware developer.
Most of the time I work on large, mature web products: The code base typically has over a million lines of code, with tens of thousands of code commits dating back 10 to 15 years. I’m just one member of a dozen or so teams working on the product, with each team consisting of a product manager, a team lead, a scrum master, a few software engineers, and some quality assurance engineers — perhaps 50 to 120 people in total across all teams. And that’s just to develop the code. We also have a team to manage the build-and-deploy pipeline, and an Ops team to keep all the servers running.
Funny thing, as a consequence of this efficient specialization, despite working on web applications almost every day, it’s actually been many years since I built a brand new web site up from scratch. It’s been even longer since I set up a public web server that was accessible on the internet. I understand the basic concepts and the steps involved in building and launching a website, but I haven’t actually done it lately.[1]I’m not counting simple learning demo projects — the kind where you follow a page of instructions to generate a site and publish it to the cloud.
I decided it was time to remedy that.
Meanwhile. the New York Times had been spamming my Twitter timeline with ads for its Spelling Bee puzzle. For those who don’t know, a Spelling Bee puzzle looks like this:
Your goal is to spell words using only the 7 letters in the puzzle and you must include the center letter. For example, using the above letters, you can spell words like “guilty” and “liturgy.” You’re allowed to use a letter more than once, so you can also spell “utility” and “guiltily.” On the other hand, words must be at least 4 letters long, so you can’t spell “it” or “rut.” And for various reasons, Spelling Bee limits the words it will accept. Strange words like “gulgul” (A plant from India) and “grrrl” (A reference to 1990s “riot grrrl” culture) are disallowed, as are a lot of offensive words and proper nouns.
Spelling Bee words are scored as follows:
- 4-letter words get one point
- longer words get one point per letter
- if a word uses all seven puzzle letters, it gets a double score
So “gull” gets 1 point for being only 4 letters long, “guilty” gets 6 points, “utility” gets 7 points, “guiltily” gets 8 points, and “liturgy” gets 14 because it uses all the letters.
I was intrigued by this puzzle because it seemed like I could probably write a program to find words for it. Eventually, I had an “Aha!” moment and realized a Spelling Bee word generator would be a fun thing to share with others on a website.[2]I know it should be Wordle, but Wordle was only just beginning to hit big and I had seen Spelling Bee first. Also, a website to help with Wordle would be more complicated.
So I set off in three directions at once. First, I began reading books and documentation that explained how to build websites using Microsoft’s ASP.NET Core package, and second, I began to read up on hosting websites on the Microsoft Azure cloud. These are important personal learning goals for me, but I won’t bore you with all the details.
The third thing I did was to try to come up with a simple algorithm for generating lists of candidate words. As I said, I don’t know which words Spelling Bee will accept because I don’t have access to their dictionary. I would need to provide one of my own.
Unfortunately, I don’t have access to a high-quality modern dictionary in computer-readable form. I imagine companies like Merriam-Webster would be willing to license me a copy of their dictionary for a price, but my guess is that price would be way too high for a hobby project like mine. My compromise was to create three different sized dictionaries built up by combining publicly-available word lists:
- Limited dictionary – 146,513 words
- List of Stemmed and Lementized English words from Kaggle
- List of 10000 Words from Kaggle
- These words all come from sources that have been carefully curated by language experts, so they are likely to be mostly valid English words.
- Standard dictionary – 719,972 words
- All the words in the Limited dictionary
- Words extracted from the English languages Wiktionary
- This adds a huge dump of words that people have defined in Wiktionary over the years. It’s not exactly curated, but most of the words should be legitimate.
- Enhanced dictionary – 837,983 words
- All the words in the Standard dictionary
- List of 370k English words corpus from Kaggle
- List of 479k English Words from Kaggle
- The words in this collection were automatically extracted from text samples by data mining algorithms, and may of them are probably junk.
One problem with these dictionaries is that they all seem to have a lot more words than Spelling Bee will accept. I don’t have a simple way of dealing with that, so I’ll leave it up to the users to figure out which words to try.
As for the search algorithm, My first thought was that there aren’t that many possible combinations of 7 letters. The code to generate all possible permutations of 4, 5, 6, and 7 letters[3]By my calculations there are 11640 of them should be very fast. Then it would be a matter of just looking up each of the word candidates in a sorted dictionary. Easy peasy, right?
Or do you see the problem? I had missed the rule that said letters could be used more than once. But if I started generating permutations of letters that I could use over and over, there was no limit to how many candidate words I could generate. I had turned a nice puzzle into an unbounded problem. Oops. Let’s try this again…
I’ll start with the dictionary of words, and check each one against the puzzle to see if the word could be formed from those letters. This bounds the problem to the size of the dictionary. That’s a lot more practical.
I wrote some code, tested it, and it worked — put in the puzzle parameters, get out a list of words. I came up with a way to load a dictionary into a data structure that was quick to search. It could find all matching words in the 800,000-word dictionary in less than a millisecond.[4]For those who are curious, I loaded the dictionary into a tree. The root node had 26 children, one for each letter from ‘a’ to ‘z’, representing the first letter of a word in the dictionary. Each of those children had up to 26 children, representing the 2nd letter of a word. Starting at the 4th level, nodes also had a flag indicating that the end of a word had been reached. That didn’t mean the tree stopped there: There could be a flag at level 4 for … Continue reading The problem was, the data structure took up a lot of memory, and I’m trying to spend as little as I can on cloud resources. Ultimately, I fell back on scanning the dictionary file every time, but caching the results, so someone else asking for the same puzzle would get a fast response.
I’ve finally got a website site up and running. I plan to do more things with it than just the Spelling Bee word list generator, so I gave it a fairly general name that captures where I want to go with it: WindyTHINK. I plan to do more with it in the future, but not all of what I do will be obvious to users, some will be tinkering with the back end just for fun. But I do have some ideas for other pages I could build. (And feel free to offer suggestions.) For now, however, there’s just the Spelling Bee word generator.
Footnotes
↑1 | I’m not counting simple learning demo projects — the kind where you follow a page of instructions to generate a site and publish it to the cloud. |
---|---|
↑2 | I know it should be Wordle, but Wordle was only just beginning to hit big and I had seen Spelling Bee first. Also, a website to help with Wordle would be more complicated. |
↑3 | By my calculations there are 11640 of them |
↑4 | For those who are curious, I loaded the dictionary into a tree. The root node had 26 children, one for each letter from ‘a’ to ‘z’, representing the first letter of a word in the dictionary. Each of those children had up to 26 children, representing the 2nd letter of a word. Starting at the 4th level, nodes also had a flag indicating that the end of a word had been reached. That didn’t mean the tree stopped there: There could be a flag at level 4 for “read” and then another flag three levels below that for “reading.” A search of the tree consisted of a depth-first traversal, following only edges for letters in the puzzle. Upon reaching an end-of-word flag, the word would be checked for the presence of the required letter and then emitted as a result. I realize this is overkill because the puzzles only change once a day, so optimizing for changing puzzles isn’t much of a win. But it was more fun. |
Leave a ReplyCancel reply