Suppose you have several text files that each contain some data about a common set of people, words, countries, companies, or any other sampling unit. This post covers a two-step process to first merge these files into a single R data frame, and then write it to a MySQL database. It would be easy enough to write the resulting data frame to disk as a text file or .RData file. But there are several reasons why one might want to store a dataset in a relational database instead. These reasons include security, performance, ease of collaboration, and memory efficiency.
Read text files into R
Several text files (.csv format) exist in my local directory. Across files there is only one shared column name ('Concept'), which contains a set of nouns. Each file contains a different set of these nouns. Each file also has a unique set of named columns corresponding to unique psycholinguistic variables of interest. I want to create a data frame that consists of only those nouns that are shared across all files, but that contains all variables of interest across all files.
# Get the names of all .csv files in the localdir f = list.files(pattern='\\.csv', full.names=T) if(length(f) < 1) stop('No files to read') # Read each file into a data frame, and place them into a list df.list = lapply(f, function(x) {read.csv(file=x, header=T)})
Merge data frames
Use the reshape package to merge data frames recursively on the basis of a single, shared, identifying column. That is, the shared column across all data frames containing the sampling unit.
library(reshape) full_df = merge_recurse(df.list) # Obviously there are several scenarios when you wouldn't want to do this, # but I will remove all non-complete rows (rows with missing data) df = full_df[complete.cases(full_df), ] # Get a snapshot of the contents of the new data frame (sanity check) head(df)
## Concept Familiarity Pain Smell Color Taste Sound Grasp Motion Length ## 5 anchor 6.92 4.36 2.46 2.58 1.54 2.91 2.00 2.19 6 ## 7 apartment 7.64 2.32 3.52 3.67 2.16 2.95 1.38 1.32 9 ## 9 apron 7.33 1.46 1.90 5.00 1.50 1.16 6.52 1.52 5 ## 14 axe 6.90 7.39 2.10 2.91 1.55 3.68 6.62 3.63 3 ## 15 bag 7.45 2.75 1.86 3.87 1.63 2.48 7.04 2.38 3 ## 16 bagpipe 6.39 2.16 1.77 5.04 1.57 6.50 3.48 2.16 7 ## SUBTLWF LDT_RT SD_RT domain Num_Feats_No_Tax ## 5 7.41 606 744 NLT 13 ## 7 83.04 612 687 NLT 17 ## 9 2.67 661 700 NLT 13 ## 14 4.88 567 668 NLT 14 ## 15 94.04 489 667 NLT 12 ## 16 0.29 765 792 NLT 8
Read the new data frame into a MySQL database table
First, I created a free account at Amazon Relational Database Service
Then I used the free OS X data management program Sequel Pro to access my new RDBS account. Within Sequel Pro I created a new MySQL database that I named 'DB1'. Next I will populate a new table in DB1 with the data from the R data frame created above.
library(DBI) library(RMySQL) # Any potentially sensitive information (e.g., passwords, usernames) is # stored in a local text file to prevent me having to type them in the code. priv = read.delim('private.txt', header=TRUE, stringsAsFactors = FALSE) # Connect to the MySQL database via a DBI (database interface) con <- dbConnect(RMySQL::MySQL(), dbname = 'DB1', username = priv$dbUser, password = priv$dbPass, host = priv$dbHost ) # If a table with the same name exists in this db, remove it if (dbExistsTable(con, 'word_ratings')){ dbRemoveTable(con, 'word_ratings') }
## [1] TRUE
# Write the R data frame to a new table called 'word_ratings' dbWriteTable(con, 'word_ratings', value = df)
## [1] TRUE
# Error checking if (!dbExistsTable(con, 'word_ratings')){ stop('dbWriteTable failed to create a table') }
Query the new database
Done. Let's query the new database to double check the contents. For example, select short words that refer to very familiar objects (where this familiarity measure ranges from 1 to 8).
dbGetQuery(con, 'SELECT * FROM word_ratings WHERE Familiarity > 7.6 AND Length < 5')
## row_names Concept Familiarity Pain Smell Color Taste Sound Grasp Motion ## 1 17 ball 7.68 4.48 1.96 4.96 1.68 3.50 6.30 6.21 ## 2 54 book 7.78 2.67 3.21 3.65 1.64 2.21 7.26 2.50 ## 3 60 bowl 7.62 2.58 1.90 3.92 2.38 1.59 7.04 1.38 ## 4 61 box 7.66 2.38 1.70 2.60 1.42 1.74 5.36 2.30 ## 5 80 bus 7.71 5.35 3.86 5.96 1.79 6.00 1.32 6.64 ## 6 99 car 7.64 6.35 3.59 6.14 1.21 6.35 1.44 6.59 ## 7 281 lime 7.62 2.88 6.04 6.76 5.80 1.46 7.42 1.65 ## 8 295 menu 7.69 2.09 1.96 3.50 1.91 1.58 6.78 1.41 ## 9 345 pen 7.68 3.05 1.88 3.90 1.48 2.16 7.62 4.82 ## 10 354 pie 7.61 1.82 6.43 4.64 7.50 1.88 6.05 1.63 ## 11 398 rock 7.65 6.00 1.70 1.80 1.48 3.17 6.35 1.80 ## 12 437 sink 7.70 2.13 2.64 2.91 1.24 3.74 1.75 1.65 ## Length SUBTLWF LDT_RT SD_RT domain Num_Feats_No_Tax ## 1 4 104.96 501 656 NLT 9 ## 2 4 176.98 512 621 NLT 16 ## 3 4 21.45 511 786 NLT 14 ## 4 3 89.75 528 625 NLT 13 ## 5 3 74.18 519 611 NLT 16 ## 6 3 483.06 472 636 NLT 15 ## 7 4 3.29 535 785 LT 12 ## 8 4 9.96 551 799 NLT 8 ## 9 3 24.73 511 583 NLT 15 ## 10 3 28.75 509 667 NLT 10 ## 11 4 86.16 530 630 NLT 9 ## 12 4 16.92 568 719 NLT 16
# Don't forget to close the connection dbDisconnect(con)
## [1] TRUE
In an upcoming post I'll read this data back into R to perform some statistical analyses and create some data visualizations.
See these posts for more information about interfacing R and MySQL databases:
- http://blog.rstudio.org/2015/01/09/rmysql-0-1-0/
- http://www.jason-french.com/blog/2014/07/03/using-r-with-mysql-databases/
- http://moderndata.plot.ly/connecting-r-to-mysql-mariadb/