Shakespearean robots

Assignment overview

In this assignment, you will create a program that talks like Shakespeare! You’ll do this using a natural language processing technique called “N-gram Language Modeling.”

Reading the file

Please download shakespeare.txt, which contains the complete works of William Shakespeare. Save it to your program’s directory, next to your .go file. Recall that to read the contents of a file into a string, you can use the following code:

contents, err := ioutil.ReadFile("shakespeare.txt")
if err != nil {
    fmt.Fprintf(os.Stderr, "Error reading file %s: %s\n", "shakespeare.txt", err.Error())
    os.Exit(1)
}

words := strings.Fields(string(contents))

In this code, contents holds a slice of bytes (representing the individual characters in the file), which is then converted into a string and passed to the strings.Fields function. This function, in the strings package, consumes a string and produces a slice of strings, each representing one word of the original string. By default, it separates words on whitespace.

Analyzing the text

Once you have a slice of all the words Shakespeare ever wrote, it is time to analyze the text. The goal is to figure out, for each word, which words usually follow it. More precisely, given some word (like “thou”), you should know how many times any other word (“hast”, “dost”, “knowest”) occurred immediately after it.

To store this information, you will use Go maps. At the top level, you will have an analysis, which is a map[string]wordcount. The key of the map is a word; the value is a wordcount, a type you will write shortly that stores statistics on the number of times different words appear. For example, analysis["the"] would contain a wordcount counting the number of times each word in the text appeared immediately following the word “the.”

A wordcount is a struct, containing two values:

type wordcount struct {
  counts map[string]int
  total  int
}

The idea here is that the counts map stores the actual statistics: “prince” occurred 3 times, “king” occurred 5 times, and so on. The total stores the total sum of all ints in the counts map. This allows you to easily compute, say, the fraction of times “prince” occurs in some context: just divide counts["prince"] by total.

1A method for observing words

Write a method func (w wordcount) observeWord(word string) wordcount, which causes the wordcount to update its statistics in response to observing a certain word, and returns the new wordcount. In particular, the total should go up by 1, as should the value of counts[word] (where word is the argument, the word that was observed). Remember that when a new wordcount is created, the counts map will be nil; check for this possibility, and if it is nil, make(...) the map before trying to add to it.

2Build the analysis

In your main function, after you obtain the words slice, loop through it and build up your analysis, which (again) is a map[string]wordcount. Every time you observe two words in a row, word1 and then word2, call observeWord(word2) on the wordcount associated with word1, storing the result of observeWord (the updated wordcount) back in the analysis map. At the end of your processing, if I want to find out how many times the phrase “thou hast” occurs in Shakespeare, I should be able to write analysis["thou"].counts["hast"].

Generating new text

Now, it’s time to generate the text.

3A method for choosing a random word

Create a method (w wordcount) generateRandomWord() string for wordcounts that picks a random word based on the statistics it’s storing. That is, if the wordcount has observed “prince” 5 times and “princess” 4 times, it should generate “prince” with probability 5/9 and “princess” with probability 4/9.

One way to do this is (as we discussed in class) to generate a random integer between 0 and total-1. Use rand.Intn(total) to do this. Then, loop through the entries of the counts map, subtracting the stored number from your random integer. When the integer goes below 0, whichever word you’re currently processing is your random word. Here is a visualization of the process:

total observed: 26
random number:  16

the   10        -10  = 6
sad    3        -3   = 3
happy  5        -5   = -2, so "happy" is the word
what   8

4Generate the random text

Prompt the user for a word to begin Shakespeare’s new opus with:

var startingWord string
fmt.Print("Enter a starting word: ")
fmt.Scanln(&startingWord)

Then, use analysis[startingWord].generateRandomWord() to generate the next word. Use that word to generate the next word using the same strategy. Keep doing this, in a loop, at least 100 times. (You can decide how long you want Shakespeare’s new text to be.) Print each word as you generate it, putting spaces between them.

Submission

Please submit your code in a file named YourNameShakespeare.go. Leave a Google Classroom comment describing how you tested your code.