Reading and Writing From Files
In this chapter, you’ll learn how to load flat files in R with the readr package, which is part of the core tidyverse. Often times, the data you are using will be found in a file format such as the csv. You will be tasked with taking this csv and reading it into a table to continue your dataframe to continue your analyses. This examples below should help you do so!
Most of readr’s functions are concerned with turning flat files into data frames:
read_csv()reads comma delimited files
read_csv2()reads semicolon separated files (common in countries where
,is used as the decimal place)
read_tsv()reads tab delimited files
read_delim()reads in files with any delimiter.
read_fwf()reads fixed width files. You can specify fields either by their widths with
fwf_widths()or their position with
read_table()reads a common variation of fixed width files where columns are separated by white space.
read_log()reads Apache style log files.
These functions all have similar syntax: once you’ve mastered one, you can use the others with ease.
The syntax of the the ‘read.table’ is as follows:
mydata <- read.table(file, header=TRUE, sep="\t")
Breaking down this line of code we have:
read.table: syntax for using the function to read data into a table
filespecifies the file that you’d like to read. If the file is in a different directory than you current working directory, specify the path to the file
header=TRUEif our data has a header, we specify that the header is
TRUE.This will allow for a suitable space to be created on the table
sep="\t"We are telling the function how are data is separated. In this case the
\ttells the function that two data points are separated by a tab.
mydata <-the result of running the
read.tablefunction is stored in the location
Various other arguments can be specified within the
() as well. E.g.
dec = ".": tells the system, in my file, decimals are specified by the `.` character
fill = TRUE: will fill empty field with empty fields instead of a default string such as NA
row.names = 1: row names exist
Taking a look at your data and the arguments available, you can make your input more or less specific.
Learning Through Example: Reading a CSV
The data used for this example will be retrieved from a publicly shared google drive. A library called
googledrive allows us to directly download files that are available on the drive. As a little disclaimer, in most situations you would have to save the file you are reading in a folder. When reading the file, you would specify the path to the folder it is stored in or set your working directory to be this folder.
Reading a File From the File System:
Example: We will be reading a file that contains information about births in the US using a
For this example, you can directly download the file using this line of code:
library(googledrive) drive_deauth() downloaded_file <- drive_download("https://docs.google.com/spreadsheets/d/1uBDFnfJR1WPw608ARNdEpJb1aJpmC5vFjxqqzzt_BaM/edit#gid=438060498", type = "csv")
For the rest of this chapter we’ll focus on
read_csv(). Not only are csv files one of the most common forms of data storage, but once you understand
read_csv(), you can easily apply your knowledge to all the other functions in readr.
births <- read.csv('US_births_2000-2014_SSA.csv', header=TRUE, sep=",") births
Comparing this to the example above, we know that:
- the first argument is the file name
- the second argument specifies that the file contains a header
- the last argument specifies that values within the file are separated by commas
As an added reference, I’ve shown you a way read this file from a folder in your file system:
births <- ("~/Downloads/US_births_2000-2014_SSA.csv") mydata <- read.csv(births, header=TRUE, sep=",") mydata
When you run
read_csv() it prints out a column specification that gives the name and type of each column.
Reading an Inline CSV
Example: We will be reading an inline csv and understanding ways of modifying the read function
This is useful for experimenting with readr and for creating reproducible examples to share with others:
read_csv("a, b, c 1, 2, 3, 4, 5, 6")
In both cases
read_csv() uses the first line of the data for the column names, which is a very common convention. There are two cases where you might want to tweak this behavior:
Sometimes there are a few lines of metadata at the top of the file. You can use
skip = n to skip the first
n lines; or use
comment = "#" to drop all lines that start with (e.g.)
read_csv( read_csv("The first line of metadata x,y,z 1,2,3", comment = "#")
read_csv("# A comment I want to skip x,y,z 1,2,3", comment = "#")
Defining Column Names:
The data might not have column names. You can use
col_names = FALSE to tell
read_csv() not to treat the first row as headings, and instead label them sequentially from
read_csv("1,2,3\n4,5,6,col_names = FALSE)
"\n" is a convenient shortcut for adding a new line.)
Alternatively you can pass
col_names a character vector which will be used as the column names:
read_csv("read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
Another option that commonly needs tweaking is
na: this specifies the value (or values) that are used to represent missing values in your file:
read_csv("a,b,c\n1,2,.", na = ".")
This is all you need to know to read ~75% of CSV files that you’ll encounter in practice. You can also easily adapt what you’ve learned to read tab separated files with
read_tsv() and fixed width files with
read_fwf(). To read in more challenging files, you’ll need to learn more about how readr parses each column, turning them into R vectors.