A data analysis toolkit

Assignment overview

In this homework, you will be developing and using a toolkit for data analysis.

In class, we wrote two functions, extract and where, that allow you to write nice-sounding, English-like expressions for analyzing data, for instance

1
2
3
4
5
6
7
; Extract the "WORD" field from all entries with Part Of Speech
; equal to "pron".
(map
  (extract "WORD")
  (filter
     (where "PART OF SPEECH" string=? "pron")
     DATASET))

This pattern, of mapping an extract function over a filtered version of a dataset, is so common, that it can be useful to define another function that does exactly that:

1
2
3
4
5
6
; from : (Listof DataEntry) (DataEntry -> Type1) (DataEntry -> Bool) -> (Listof Type1)
; Takes in a function, a predicate, and a list, and performs
; the function on everything satisfying the predicate in the list
; (deleting everything else).
(define (from dataset mapping-func filtering-func)
   (map mapping-func (filter filtering-func dataset)))

With this function, we can write things like

1
2
3
(from DATASET
  (extract "WORD")
  (where "PART OF SPEECH" string=? "pron"))

which is starting to sound very English-like! Note that this function looks cool, but is quite simple. We are simply mapping the (extract "WORD") function over the list of data entries, filtering out anything that’s not a pronoun using the (where "PART OF SPEECH" string=? "pron") predicate.

Improved where clauses

We will now focus on allowing more complex “where” clauses in our toolkit.

1Combining predicates

The where function we wrote in class allows you to check that one field satisfies a condition. Ideally, we would be able to combine predicates to check multiple conditions. For example, we might want to find words with frequency greater than 100,000 that are nouns. We can write (where "PART OF SPEECH" string=? "n") to make a noun-checking predicate, and (where "FREQUENCY" > 100000) to make a frequency-checking predicate. Now we need some way to combine them, to say, both of these things must be true. In other cases, we might want to check that either one condition or the other must be true.

Write the functions where-and and where-or, which each take in two existing where clauses (i.e., predicates), and produce a single predicate.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
; where-and : (Type1 → Boolean) (Type1 → Boolean) → (Type1 → Boolean)
; Produces a predicate that is true only when both of the argument
; predicates are satisfied.

; Examples:
; (where-and (where "RANK" < 100) (where "PART OF SPEECH" string=? "n"))
;     Says #true for data-entries that are nouns with rank less than 100.
;
; (where-and even? divisible-by-three?)
;     Says #true for numbers that even and divisible by three.

(define (where-and clause1 clause2)
  ...)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
; where-or : (Type1 → Boolean) (Type1 → Boolean) → (Type1 → Boolean)
; Produces a predicate that is true when either of the argument
; predicates are satisfied.

; Example:
; (where-or even? odd?)
;     Says #true for any integer
;
; (where-or (where "RANK" < 10) (where "FREQUENCY" = 800))
;    Says #true for any words that have frequency 800 OR rank < 10

(define (where-or clause1 clause2)
  ...)

2Checking a transformation of a field

Write the function where-transformed, which allows you to apply some transformation function to a field before comparing it to a value. This is clearest with examples:

1
(where-transformed string-length "WORD" > 5)

would be a predicate that checks if the string-length of the WORD field is greater than 5. Similarly,

1
(where-map (lambda (x) (string-ith x 0)) "WORD" equal? "a")

would check if the WORD field begins with an "a". (You can check the vanilla where function’s code as a starting point.)

3A do-nothing where clause

Sometimes, we don’t want to use a where clause–we want to get all rows from the database. Write a function all-rows that can be used as follows:

1
(from DATASET (extract "WORD") all-rows)

This command would extract the “WORD” field of every row. In other words, all-rows is basically a where clause that says “Yes” to every row, so all of them are included in the answer.

4Comparing two fields

Sometimes, we wish to compare two fields of a dataset. For example, if we have a dataset of children where each row has a “favorite adult” field and a “mom” field, and we want to see all the children whose favorite adult is their mom.

We might like to write something like

1
2
3
(from DATASET
  (extract "CHILD NAME")
  (where-compare "FAVORITE ADULT" equal? "MOM"))

Write the where-compare function.

Analysis

Now it’s time to put your toolkit to use.

5Analyzing a dataset

Choose one dataset, linked below, download it, and write functions to answer the corresponding question about it. In addition, write and answer one more question about your dataset.

Hints:

Datasets

  • Academy Awards Demographics
    Write a function that takes in a category (“Best Actor”, etc.) and calculates the average age of a winner in that category.

  • McDonald’s Reviews
    Write a function that takes in a city and returns the average length of a review from someone in that city.

  • Blockbuster Database
    Write a function that calculates the average amount of money made by a movie with a given rating (“G”, “PG”, etc.)

  • Police-Involved Fatalities Since May 2013
    Of all fatalities in which the deceased was potentially unarmed (i.e., “Unarmed?” column is not “No”), in what percent of them was the deceased white and in what percent non-white?