Introduction

Over Christmas 2017, my wife, Jill, resolved that our family (to include babysitters) would read our children over 1000 new books in 2018. More specifically, 29 December 2017 - 28 December 2018.

A goal like this takes dedication, love, and of course, a Data Science-y approach.

Special thanks to Ally, Zia, and Bookbuddy for help in accomplishing this goal!

Also, special thanks to drob and juliasilge who privde the tidytext package and the entire R open-source community I rely on to do this analysis.

Since I use this page as a way to log my ideas and teach others (and future me) how I did this, I’ll include the code.

Exploratory Data Analysis

Libraries

The libraries I will use for this analysis are the following:

library(tidyverse)
library(lubridate) 
library(janitor)
library(stringr)
library(tidytext)
library(forcats)
library(zoo)
library(kableExtra)
library(DT)

Read in Data

Lets read in the data and get a feel for what information we have. Since my wife, our babysitter, and I have scanned the books in bookbuddy, I need to upload all the data and join it together. We have the data in many different spreadsheets to I’ll have to read them all in separately. I had to change some of the columns to character to make joining these dataframes go more smooth later on.

book = read_csv("data/children_book/BookBuddy.csv") %>% mutate(ISBN = as.numeric(ISBN))
book1 = read_csv("data/children_book/BookBuddy (1).csv") %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>% mutate(`Date Added` = as.character(`Date Added`))
book2 = read_csv("data/children_book/BookBuddy (2).csv") %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>% mutate(`Date Added` = as.character(`Date Added`))
book3 = read_csv("data/children_book/BookBuddy (3).csv") %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>% mutate(`Date Added` = as.character(`Date Added`))
book4 = read_csv("data/children_book/BookBuddy (4).csv") %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>% mutate(`Date Added` = as.character(`Date Added`))
book5 = read_csv("data/children_book/BookBuddy (5).csv") %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>% mutate(`Date Added` = as.character(`Date Added`))
book6 = read_csv("data/children_book/BookBuddy (6).csv") %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>% mutate(`Date Added` = as.character(`Date Added`))
book7 = read_csv("data/children_book/BookBuddy (7).csv")  %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>% mutate(`Date Added` = as.character(`Date Added`))
book8 = read_csv("data/children_book/BookBuddy (8).csv")  %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>% mutate(`Date Added` = as.character(`Date Added`))
book9 = read_csv("data/children_book/BookBuddy (9).csv")  %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>% mutate(`Date Added` = as.character(`Date Added`))
book10 = read_csv("data/children_book/BookBuddy (10).csv")  %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>% mutate(`Date Added` = as.character(`Date Added`))
book11 = read_csv("data/children_book/BookBuddy (11).csv")  %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>% mutate(`Date Added` = as.character(`Date Added`))
book12 = read_csv("data/children_book/BookBuddy (12).csv") %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>%  mutate(`Date Added` = as.character(`Date Added`))
book13 = read_csv("data/children_book/BookBuddy (13).csv") %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>% mutate(`Date Added` = as.character(`Date Added`))
book14 = read_csv("data/children_book/BookBuddy (14).csv") %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>% mutate(`Date Added` = as.character(`Date Added`))
book15 = read_csv("data/children_book/BookBuddy (15).csv") %>% mutate(ISBN = as.numeric(ISBN)) %>% mutate(OCLC = as.character(OCLC)) %>% mutate(`Date Published` = mdy(`Date Published`)) %>% mutate(`Date Added` = as.character(`Date Added`))

Join and Clean Data

Now that we’ve read it in, I need to do a few things to support future analysis.

  1. Many of the column names have spaces which does not support data munging. I’ll use the clean_names command from the janitor package to clean this up.

  2. I have a lot of duplicate books. I’ll keep only the rows that have distinct Title and Author. I used the distinct command because some of us scanned the same books. We do not want to account for them twice.

  3. I need to convert the DatePublished and DateAdded to characters (don’t worry, I’ll bring it back later.)

  4. It seemed like every Berenstain Bear book had a little bit different Author. Lets clean that up a little too.

  5. There were some DateAdded dates that do not exist. For a while, when one of our babysitters was saving the data, this information was not being saved. We assigned the book the most recent data from when we caught the error (as seen in the last line below). This is not a reasonable assumption, but its the best we can do.

  6. Lastly, I’ll only use the following columns: Title, Author, Publisher, YearPublished, Genre, Summary, DateAdded, DatePublished, and GoogleVolumeId. The rest of the columns contain blank information.

booksread =
  book %>% bind_rows(book1) %>% bind_rows(book2) %>% bind_rows(book3)  %>%
  bind_rows(book4) %>% bind_rows(book5) %>% bind_rows(book6) %>% 
  bind_rows(book7) %>% bind_rows(book8) %>% bind_rows(book9) %>% 
  bind_rows(book10) %>%  bind_rows(book11) %>% bind_rows(book12) %>% 
  bind_rows(book13) %>% bind_rows(book14) %>% bind_rows(book15) %>%
  clean_names(case = "upper_camel") %>% 
  distinct(Title, Author, .keep_all = TRUE) %>% 
  mutate(DatePublished = ymd(str_sub(DatePublished,1,10)))%>%
  mutate(DateAdded = ymd(str_sub(DateAdded,1,10))) %>%
  select(Title, Author, Publisher, YearPublished, Genre, Summary, NumberOfPages, DateAdded, DatePublished,GoogleVolumeId) %>%
  mutate(Author = ifelse(grepl("Bere", Author)==TRUE, "Berenstein", Author)) %>%
  mutate(DateAdded = na.locf(.$DateAdded, fromlast = TRUE))

Here is the data for you to explore:

booksread %>% 
  select(-Summary) %>% 
  datatable()

Analysis

To answer the primary question: How many books did we read?

nrow(booksread)
## [1] 1070

What is the span of the dates of our our books?

booksread$DateAdded %>% min
## [1] "2017-12-29"
booksread$DateAdded %>% max
## [1] "2018-12-18"

Top Authors

What other interesting things can we find? Who were our top authors?

booksread %>%
  group_by(Author) %>%
  summarise(n = n()) %>%
  arrange(desc(n)) %>% 
  filter(n>5) %>%
  kable()
Author n
Berenstein 72
Mercer Mayer 17
Dr. Seuss 15
James Dean 14
Victoria Kann 14
Cynthia Rylant 11
Jane O’Connor 9
Gail Gibbons 6
Kate Banks 6
Rosemary Wells 6
booksread %>%
  group_by(Author) %>%
  summarise(n = n()) %>%
  filter(n>5) %>%
  ggplot(aes(x=fct_reorder(Author, desc(n)), y = n)) +
  geom_bar(stat = "identity") +
  labs(title = "Most Read Authors", x = "Authors", y = "Number of Books") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

I’m a big fan of the Berenstain Bears, Dr. Seuss, and Pete The Cat (James Dean), but not so much Fancy Nancy (Thanks Jane O’Connor)….

Timeline

booksread %>%
  group_by(DateAdded) %>%
  summarise(n = n()) %>%
  ungroup() %>%
  mutate(BooksByDate = cumsum(n)) %>%
  ggplot(aes(x=DateAdded, y = BooksByDate)) +
  geom_line() + 
  geom_point() +
  labs(title = "Books Read in 2018", x = "Date", y = "Total Books Read")

There are a few things to note from the data above.

  1. We were more vigilant at scanning books as we read them at the beginning – vs scanning them all at once at the end.

  2. We had a couple droughts of book reading – in particular August was a slow period.

  3. The big jump in December was a data collection error mentioned above. These were likely spread across the previous 3 months.

booksread %>%
  mutate(month = month(DateAdded)) %>%
  group_by(month) %>%
  summarise(BooksInMonth = n()) %>%
  kable()
month BooksInMonth
1 98
2 59
3 90
4 95
5 58
6 90
7 69
8 22
9 109
10 23
11 225
12 132

A Book by… God?

And just as a side note, I happen to see we had a book authored by ‘God’. Lets investigate.

booksread %>%
  filter(Author=="God") %>%
  select(Title, Author) %>%
  kable()
Title Author
The Lord’s prayer God

Okay then.

Publication Date

When were our books published?

ggplot(data = booksread, aes(YearPublished)) +
  geom_bar() +
  labs(title = "Books By Year", y = "Total Books", x = "Year Published")

It appears we’ve read some classics… as well as some of the more ‘pop culture’ children’s books.

Length of Books

I can hear you now…“So you’ve read over 1000 books this year. Its nice that they are all picture books…”"

ggplot(data = booksread, aes(NumberOfPages)) +
  geom_histogram() +
  ggtitle("Books By Number of Pages") + ylab("Total Books")

What?! We’ve read books over 100 pages to our kids…

booksread %>%
  filter(NumberOfPages>100) %>%
  arrange(desc(NumberOfPages)) %>%
  select(Title, Author, NumberOfPages) %>%
  kable()
Title Author NumberOfPages
Disney’s Once Upon a Princess Various, 592
Brick City Warren Elsmore 256
The Curious Kid’s Science Book Asia Citro 224
Good Night Stories for Rebel Girls Elena Favilli, Francesca Cavallo 212
LEGO Adventure Book, Vol. 1 Megan H. Rothrock 200
The LEGO Ideas Book Daniel Lipkowitz 198
Lots of Look & Finds Disney Princess Author 193
Little Critter’s Read-it-yourself Storybook Mercer Mayer 192
Pinkalicious: The Pinkamazing Storybook Collection Victoria Kann 192
Space Encyclopedia Christine Pulliam, Patricia Daniels 191
My Friend the Moon André Dahan 160
Big Book of the Berenstain Bears Berenstein 160
Yoga games for children Danielle Bersma 146
Yoga Games for Children Danielle Bersma, Marjoke Visscher 146
The LEGO Build-It Book, Vol. 1 Nathanael Kuipers, Mattia Zamboni 132

Book Genres

What types of books do we read our kids?

It appears a pretty interesting array of genres. And some uniquely named ones if you search around.

booksread %>%
  filter(!is.na(Genre)) %>%
  group_by(Genre) %>%
  summarise(n = n()) %>%
  arrange(desc(n)) %>%
  datatable()

Text Analysis

So, that was some interesting descriptive analysis. Lets look at the undertones our children are absorbing from what we are reading.

I’d really like to know what type of words / sentiment we are reading our children – but until I can figure out a way to get the words from all the books, I’ll have to rely on the Summary provided by BookBuddy.

Here’s an example of what we’ll be looking at:

booksread %>%
  filter(Title=="Mama Cat Has Three Kittens") %>%
  select(Summary)
## # A tibble: 1 x 1
##   Summary                                                                                                                                             
##   <chr>                                                                                                                                               
## 1 Some kittens march to the beat of a different drummer.Mama Cat has three kittens, Fluffy, Skinny, and Boris. Where Mama Cat leads, Fluffy and Skinn~

Preparation

So, lets prepare the data. I’d like to bring along some information while I tokenize the summaries (Titles, Authors, and Genres).

text_df = data_frame(line = 1:length(booksread$Summary), text = as.character(booksread$Summary), book = booksread$Title, Genre = booksread$Genre, Author = booksread$Author)

cleanedbooks = text_df %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words, by = "word")

cleanedbooks %>% datatable()

In this process, I’d like to note several things:

  1. I removed “stop words” – words such as ‘a’, ‘an’, ‘the’ and many other words that do not contain sentiment.

  2. In order to classify the text, I’ll rely on the NRC data base. Using this database, I’ll assign emotion to many words to determine the sentiment of what we’ve been reading our children.

  3. I’ll also like to translate NRCs emotions into Positive, Negative, and Neutral classifications.

  4. I also recognize that some of the words are double counted using this technique. For example the word “joyful” is classified 3 times (as joy, positive, and trust). I will make the assumption that words like this are stronger and therefore these multi-counts will account for the added emphasis.

nrcbooks = cleanedbooks %>%
  inner_join(get_sentiments("nrc"), by = "word") %>%
  mutate(sent = ifelse(sentiment=="joy"|sentiment=="positive"|sentiment=="suprise"|sentiment=="trust","Positive Words",
                       ifelse(sentiment=="disgust"|sentiment=="fear"|sentiment=="negative"|sentiment=="sadness","Negative Words","Neutral Words"))) %>%
  mutate(sent = as.factor(sent))

In the table below, you can see an example of the words in each book and the assigned sentiment to each word.

nrcbooks %>% 
  head(15) %>%
  kable()
line book Genre Author word sentiment sent
1 And Then Comes Summer Juvenile Fiction Tom Brenner joyful joy Positive Words
1 And Then Comes Summer Juvenile Fiction Tom Brenner joyful positive Positive Words
1 And Then Comes Summer Juvenile Fiction Tom Brenner joyful trust Positive Words
1 And Then Comes Summer Juvenile Fiction Tom Brenner sun anticipation Neutral Words
1 And Then Comes Summer Juvenile Fiction Tom Brenner sun joy Positive Words
1 And Then Comes Summer Juvenile Fiction Tom Brenner sun positive Positive Words
1 And Then Comes Summer Juvenile Fiction Tom Brenner sun surprise Neutral Words
1 And Then Comes Summer Juvenile Fiction Tom Brenner sun trust Positive Words
1 And Then Comes Summer Juvenile Fiction Tom Brenner tribute positive Positive Words
1 And Then Comes Summer Juvenile Fiction Tom Brenner anticipation anticipation Neutral Words
1 And Then Comes Summer Juvenile Fiction Tom Brenner yawn negative Negative Words
1 And Then Comes Summer Juvenile Fiction Tom Brenner cheerful joy Positive Words
1 And Then Comes Summer Juvenile Fiction Tom Brenner cheerful positive Positive Words
1 And Then Comes Summer Juvenile Fiction Tom Brenner cheerful surprise Neutral Words
1 And Then Comes Summer Juvenile Fiction Tom Brenner jump joy Positive Words

Total Sentiment

Lets look at the sentiment my children are receiving in their books (as told by the summaries)

ggplot(data = nrcbooks, aes(sent, color = sentiment, fill = sentiment)) +
  geom_bar() +
  ggtitle("Type of Words in my Children's Books") + xlab("Sentiment") + ylab("Total Count")

It looks like, as far as the summaries indicate, that our kids are hearing positive words.

Sentiment By Authors

Lets see which authors provide more sentiment.

nrcbooks %>%
  mutate(sent = case_when(sent == "Positive Words" ~ 1,
                          sent == "Neutral Words" ~ 0,
                          sent == "Negative Words" ~ -1)) %>%
  group_by(Author) %>%
  summarise(NetSentiment = sum(sent)/sum(abs(sent)), TotalSent = n()) %>%
  datatable()

Lets take a look visually.

nrcbooks %>%
  group_by(Author) %>%
  add_tally() %>%
  filter(n>160) %>%
ggplot(aes(sent, color = sentiment, fill = sentiment)) +
  geom_bar() +
  facet_wrap("Author", scales = "free_y") +
  ggtitle("Type of Words my Children Listen to in Books") + xlab(" ") + ylab("Total Count")  +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Sentiment by Genre

Lets see if there’s a genre that has any sentiment sway. Of note, I find it funny that there is a ‘Bears’ genre.

nrcbooks %>%
  group_by(Genre) %>%
  add_tally() %>%
  filter(!is.na(Genre)) %>%
  filter(n>100) %>%
ggplot(aes(sent, fill = sent)) +
  geom_bar() +
  facet_wrap("Genre", scales = "free_y") +
  ggtitle("Sentiment by Genre") + xlab("Sentiment")

Happiest and Saddest Book

My wife remembers being in tears in a couple of books she read our kids. I’ll filter for ‘sad’ words to make this determination.

nrcbooks %>%
  filter(sentiment == "sadness") %>%
  group_by(book, Author) %>% 
  summarise(n = n()) %>%
  filter(n>5) %>%
  arrange(desc(n)) %>%
  kable()
## `summarise()` has grouped output by 'book'. You can override using the `.groups` argument.
book Author n
Snap! Hazel Hutchins 8
Have You Seen My New Blue Socks? Eve Bunting 7
The Day the Crayons Came Home Drew Daywalt 7
The Legend of the Poinsettia Tomie De Paola 7
The Night Pirates Peter Harris 7
A Fine Dessert Emily Jenkins 6
Black Cat & White Cat Claire Garralon 6
Boo Hoo Bird Jeremy Tankard 6
Butterfly Tree Sandra Markle 6
Hedgehog, Pig, and the Sweet Little Friend Lena Anderson 6
Snowflakes Fall Patricia MacLachlan 6

Thought it doesn’t show up here, “Big Cat, Little Cat” is a very sad book (according to my wife).

Lets see if we can find it.

nrcbooks %>%
  filter(grepl("Cat",book)==TRUE) %>%
  filter(sentiment == "sadness") %>%
  group_by(book, Author) %>% 
  summarise(n = n()) %>%
  filter(n>1) %>%
  arrange(desc(n)) %>%
  kable()
## `summarise()` has grouped output by 'book'. You can override using the `.groups` argument.
book Author n
Black Cat & White Cat Claire Garralon 6
Hero Cat Eileen Spinelli 5
Drat that Fat Cat! Pat Thomson 4
KokoCat, Inside and Out Lynda Graham-Barber 4
Pete the Cat and His Magic Sunglasses James Dean, Kimberly Dean 4
Pete the Cat and the Bad Banana James Dean 4
Big Cat, Little Cat Elisha Cooper 2
Fat Cat Margaret Read MacDonald 2
Gracie the Lighthouse Cat Ruth Brown 2
King of the Cats Paul Galdone 2
Millions of Cats Wanda Gág 2
Pete the Cat Saves Christmas Eric Litwin 2
Scaredy-Cat, Splat! Rob Scotton 2
The Cat Who Walked Across France Kate Banks 2
The Pig Is in the Pantry, the Cat Is on the Shelf Shirley Mozelle 2

There it is, 7th on the list. Thought it doesn’t register as being all that sad, it reduced my wife to tears… and again as she told me the story.

Spoiler alert, the Black Cat dies in the end.

Here’s the summary, maybe you’ll cry too.

booksread %>%
filter(Title=="Big Cat, Little Cat") %>%
  select(Summary)
## # A tibble: 1 x 1
##   Summary                                                                                                                                             
##   <chr>                                                                                                                                               
## 1 There was a cat who lived alone.Until the daya new cat came . . . And so a story of friendship begins, following two cats through their days, month~

And if you are a glutton for punishment, you can listen to it read here:

Video

How about happy books? I’ll define this as positive words.

nrcbooks %>%
  mutate(sent = case_when(sent == "Positive Words" ~ 1,
                          sent == "Neutral Words" ~ 0,
                          sent == "Negative Words" ~ -1)) %>%
  group_by(book, Author) %>% 
  summarise(n = n()) %>%
  filter(n>80) %>%
  arrange(desc(n)) %>%
  kable()
## `summarise()` has grouped output by 'book'. You can override using the `.groups` argument.
book Author n
The Day the Crayons Came Home Drew Daywalt 135
Templeton Gets His Wish Greg Pizzoli 117
Fraidyzoo Thyra Heder 98
KokoCat, Inside and Out Lynda Graham-Barber 95
Bloom Doreen Cronin 92
Fireboat Maira Kalman 89
Being Frank Donna W. Earnhardt 87
The Legend of the Poinsettia Tomie De Paola 87
Boo Hoo Bird Jeremy Tankard 83
A Fine Dessert Emily Jenkins 82
The Pajama Zoo Parade Agnes Green 81

In the chart below, you can see that there appears to be a relationship between longer books and more positive sentiment.

nrcbooks %>%
  mutate(sent = case_when(sent == "Positive Words" ~ 1,
                          sent == "Neutral Words" ~ 0,
                          sent == "Negative Words" ~ -1)) %>%
  group_by(book) %>%
  summarise(NetSentiment = sum(sent)/sum(abs(sent)), TotalSent = n()) %>%
  left_join(booksread, by = c("book"="Title")) %>%
  ggplot(aes(x = NumberOfPages, y = NetSentiment)) +
  geom_point() +
  labs(title = "Sentiment By Page Length", x = "Number of Pages", y = "Net Sentiment")
## Warning: Removed 26 rows containing missing values (geom_point).