Text files to R to MySQL

Suppose you have several text files that each contain some data about a common set of people, words, countries, companies, or any other sampling unit. This post covers a two-step process to first merge these files into a single R data frame, and then write it to a MySQL database. It would be easy enough to write the resulting data frame to disk as a text file or .RData file. But there are several reasons why one might want to store a dataset in a relational database instead. These reasons include security, performance, ease of collaboration, and memory efficiency.

Read text files into R

Several text files (.csv format) exist in my local directory. Across files there is only one shared column name ('Concept'), which contains a set of nouns. Each file contains a different set of these nouns. Each file also has a unique set of named columns corresponding to unique psycholinguistic variables of interest. I want to create a data frame that consists of only those nouns that are shared across all files, but that contains all variables of interest across all files.

# Get the names of all .csv files in the localdir
f = list.files(pattern='\\.csv', full.names=T)
if(length(f) < 1) stop('No files to read')

# Read each file into a data frame, and place them into a list
df.list = lapply(f, function(x) {read.csv(file=x, header=T)})

Merge data frames

Use the reshape package to merge data frames recursively on the basis of a single, shared, identifying column. That is, the shared column across all data frames containing the sampling unit.

library(reshape)
full_df = merge_recurse(df.list)

# Obviously there are several scenarios when you wouldn't want to do this, 
#  but I will remove all non-complete rows (rows with missing data)
df = full_df[complete.cases(full_df), ] 

# Get a snapshot of the contents of the new data frame (sanity check) 
head(df)
##      Concept Familiarity Pain Smell Color Taste Sound Grasp Motion Length
## 5     anchor        6.92 4.36  2.46  2.58  1.54  2.91  2.00   2.19      6
## 7  apartment        7.64 2.32  3.52  3.67  2.16  2.95  1.38   1.32      9
## 9      apron        7.33 1.46  1.90  5.00  1.50  1.16  6.52   1.52      5
## 14       axe        6.90 7.39  2.10  2.91  1.55  3.68  6.62   3.63      3
## 15       bag        7.45 2.75  1.86  3.87  1.63  2.48  7.04   2.38      3
## 16   bagpipe        6.39 2.16  1.77  5.04  1.57  6.50  3.48   2.16      7
##    SUBTLWF LDT_RT SD_RT domain Num_Feats_No_Tax
## 5     7.41    606   744    NLT               13
## 7    83.04    612   687    NLT               17
## 9     2.67    661   700    NLT               13
## 14    4.88    567   668    NLT               14
## 15   94.04    489   667    NLT               12
## 16    0.29    765   792    NLT                8

Read the new data frame into a MySQL database table

First, I created a free account at Amazon Relational Database Service

Then I used the free OS X data management program Sequel Pro to access my new RDBS account. Within Sequel Pro I created a new MySQL database that I named 'DB1'. Next I will populate a new table in DB1 with the data from the R data frame created above.

library(DBI)
library(RMySQL)

# Any potentially sensitive information (e.g., passwords, usernames) is
# stored in a local text file to prevent me having to type them in the code.
priv = read.delim('private.txt', header=TRUE, stringsAsFactors = FALSE)

# Connect to the MySQL database via a DBI (database interface)  
con <- dbConnect(RMySQL::MySQL(),
    dbname = 'DB1',
    username = priv$dbUser,
    password = priv$dbPass,
    host = priv$dbHost
)

# If a table with the same name exists in this db, remove it
if (dbExistsTable(con, 'word_ratings')){
  dbRemoveTable(con, 'word_ratings')
}
## [1] TRUE
# Write the R data frame to a new table called 'word_ratings'
dbWriteTable(con, 'word_ratings', value = df)
## [1] TRUE
# Error checking
if (!dbExistsTable(con, 'word_ratings')){
  stop('dbWriteTable failed to create a table')
}

Query the new database

Done. Let's query the new database to double check the contents. For example, select short words that refer to very familiar objects (where this familiarity measure ranges from 1 to 8).

dbGetQuery(con, 'SELECT * FROM word_ratings WHERE Familiarity > 7.6 AND Length < 5')
##    row_names Concept Familiarity Pain Smell Color Taste Sound Grasp Motion
## 1         17    ball        7.68 4.48  1.96  4.96  1.68  3.50  6.30   6.21
## 2         54    book        7.78 2.67  3.21  3.65  1.64  2.21  7.26   2.50
## 3         60    bowl        7.62 2.58  1.90  3.92  2.38  1.59  7.04   1.38
## 4         61     box        7.66 2.38  1.70  2.60  1.42  1.74  5.36   2.30
## 5         80     bus        7.71 5.35  3.86  5.96  1.79  6.00  1.32   6.64
## 6         99     car        7.64 6.35  3.59  6.14  1.21  6.35  1.44   6.59
## 7        281    lime        7.62 2.88  6.04  6.76  5.80  1.46  7.42   1.65
## 8        295    menu        7.69 2.09  1.96  3.50  1.91  1.58  6.78   1.41
## 9        345     pen        7.68 3.05  1.88  3.90  1.48  2.16  7.62   4.82
## 10       354     pie        7.61 1.82  6.43  4.64  7.50  1.88  6.05   1.63
## 11       398    rock        7.65 6.00  1.70  1.80  1.48  3.17  6.35   1.80
## 12       437    sink        7.70 2.13  2.64  2.91  1.24  3.74  1.75   1.65
##    Length SUBTLWF LDT_RT SD_RT domain Num_Feats_No_Tax
## 1       4  104.96    501   656    NLT                9
## 2       4  176.98    512   621    NLT               16
## 3       4   21.45    511   786    NLT               14
## 4       3   89.75    528   625    NLT               13
## 5       3   74.18    519   611    NLT               16
## 6       3  483.06    472   636    NLT               15
## 7       4    3.29    535   785     LT               12
## 8       4    9.96    551   799    NLT                8
## 9       3   24.73    511   583    NLT               15
## 10      3   28.75    509   667    NLT               10
## 11      4   86.16    530   630    NLT                9
## 12      4   16.92    568   719    NLT               16
# Don't forget to close the connection
dbDisconnect(con)
## [1] TRUE

In an upcoming post I'll read this data back into R to perform some statistical analyses and create some data visualizations.

See these posts for more information about interfacing R and MySQL databases: