In this assignment, you will create a program that talks like Shakespeare! You’ll do this using a natural language processing technique called “N-gram Language Modeling.”
Please download shakespeare.txt, which contains the complete works of William Shakespeare. Save it to your program’s directory, next to your .go file. Recall that to read the contents of a file into a string, you can use the following code:
1 2 3 4 5 6 7 | contents, err := ioutil.ReadFile("shakespeare.txt") if err != nil { fmt.Fprintf(os.Stderr, "Error reading file %s: %s\n", "shakespeare.txt", err.Error()) os.Exit(1) } words := strings.Fields(string(contents)) |
In this code, contents
holds a slice of bytes (representing the individual characters in the file), which is then converted into a string and passed to the strings.Fields
function. This function, in the strings
package, consumes a string and produces a slice of strings, each representing one word of the original string. By default, it separates words on whitespace.
Once you have a slice of all the words Shakespeare ever wrote, it is time to analyze the text. The goal is to figure out, for each word, which words usually follow it. More precisely, given some word (like “thou”), you should know how many times any other word (“hast”, “dost”, “knowest”) occurred immediately after it.
To store this information, you will use Go maps. At the top level, you will have an analysis, which is a map[string]wordcount
. The key of the map is a word; the value is a wordcount
, a type you will write shortly that stores statistics on the number of times different words appear. For example, analysis["the"]
would contain a wordcount
counting the number of times each word in the text appeared immediately following the word “the.”
A wordcount
is a struct, containing two values:
1 2 3 4 | type wordcount struct { counts map[string]int total int } |
The idea here is that the counts
map stores the actual statistics: “prince” occurred 3 times, “king” occurred 5 times, and so on. The total
stores the total sum of all ints in the counts
map. This allows you to easily compute, say, the fraction of times “prince” occurs in some context: just divide counts["prince"]
by total
.
Write a method func (w wordcount) observeWord(word string) wordcount
, which causes the wordcount to update its statistics in response to observing a certain word, and returns the new wordcount. In particular, the total
should go up by 1, as should the value of counts[word]
(where word
is the argument, the word that was observed). Remember that when a new wordcount
is created, the counts
map will be nil
; check for this possibility, and if it is nil
, make(...)
the map before trying to add to it.
In your main
function, after you obtain the words
slice, loop through it and build up your analysis
, which (again) is a map[string]wordcount
. Every time you observe two words in a row, word1
and then word2
, call observeWord(word2)
on the wordcount
associated with word1
, storing the result of observeWord
(the updated wordcount
) back in the analysis map. At the end of your processing, if I want to find out how many times the phrase “thou hast” occurs in Shakespeare, I should be able to write analysis["thou"].counts["hast"]
.
Now, it’s time to generate the text.
Create a method (w wordcount) generateRandomWord() string
for wordcount
s that picks a random word based on the statistics it’s storing. That is, if the wordcount
has observed “prince” 5 times and “princess” 4 times, it should generate “prince” with probability 5/9 and “princess” with probability 4/9.
One way to do this is (as we discussed in class) to generate a random integer between 0 and total-1
. Use rand.Intn(total)
to do this. Then, loop through the entries of the counts
map, subtracting the stored number from your random integer. When the integer goes below 0, whichever word you’re currently processing is your random word. Here is a visualization of the process:
total observed: 26 random number: 16 the 10 -10 = 6 sad 3 -3 = 3 happy 5 -5 = -2, so "happy" is the word what 8
Prompt the user for a word to begin Shakespeare’s new opus with:
1 2 3 | var startingWord string fmt.Print("Enter a starting word: ") fmt.Scanln(&startingWord) |
Then, use analysis[startingWord].generateRandomWord()
to generate the next word. Use that word to generate the next word using the same strategy. Keep doing this, in a loop, at least 100 times. (You can decide how long you want Shakespeare’s new text to be.) Print each word as you generate it, putting spaces between them.
Please submit your code in a file named YourNameShakespeare.go
. Leave a Google Classroom comment describing how you tested your code.