Skip to content

Reading and Writing From Files

Reading and Writing From Files

In this chapter, you’ll learn how to load flat files in R with the readr package, which is part of the core tidyverse. Often times, the data you are using will be found in a file format such as the csv. You will be tasked with taking this csv and reading it into a table to continue your dataframe to continue your analyses. This examples below should help you do so!

library(tidyverse)

Readr Syntax

Most of readr’s functions are concerned with turning flat files into data frames:

  • read_csv() reads comma delimited files
  •  read_csv2() reads semicolon separated files (common in countries where , is used as the decimal place)
  •  read_tsv() reads tab delimited files
  • read_delim() reads in files with any delimiter.
  • read_fwf() reads fixed width files. You can specify fields either by their widths with fwf_widths() or their position with fwf_positions().
  • read_table() reads a common variation of fixed width files where columns are separated by white space.
  • read_log() reads Apache style log files.

These functions all have similar syntax: once you’ve mastered one, you can use the others with ease.

The syntax of the the ‘read.table’ is as follows:
mydata <- read.table(file, header=TRUE, sep="\t")

Breaking down this line of code we have:

  • read.table: syntax for using the function to read data into a table
  • file specifies the file that you’d like to read. If the file is in a different directory than you current working directory, specify the path to the file
  • header=TRUE if our data has a header, we specify that the header is TRUE.This will allow for a suitable space to be created on the table
  • sep="\t" We are telling the function how are data is separated. In this case the \t tells the function that two data points are separated by a tab.
  • mydata <- the result of running the read.table function is stored in the location mydata

Various other arguments can be specified within the () as well. E.g.

  • dec = ".": tells the system, in my file, decimals are specified by the `.` character
  • fill = TRUE: will fill empty field with empty fields instead of a default string such as NA
  • row.names = 1: row names exist

Taking a look at your data and the arguments available, you can make your input more or less specific.

Learning Through Example: Reading a CSV

The data used for this example will be retrieved from a publicly shared google drive. A library called googledrive allows us to directly download files that are available on the drive. As a little disclaimer, in most situations you would have to save the file you are reading in a folder. When reading the file, you would specify the path to the folder it is stored in or set your working directory to be this folder.

Reading a File From the File System:

Example: We will be reading a file that contains information about births in the US using a read.csv function.

For this example, you can directly download the file using this line of code:

library(googledrive)
drive_deauth()
downloaded_file <- drive_download("https://docs.google.com/spreadsheets/d/1uBDFnfJR1WPw608ARNdEpJb1aJpmC5vFjxqqzzt_BaM/edit#gid=438060498", type = "csv")

For the rest of this chapter we’ll focus on read_csv(). Not only are csv files one of the most common forms of data storage, but once you understand read_csv(), you can easily apply your knowledge to all the other functions in readr.

births <- read.csv('US_births_2000-2014_SSA.csv', header=TRUE, sep=",")
births

Comparing this to the example above, we know that:

  • the first argument is the file name
  • the second argument specifies that the file contains a header
  • the last argument specifies that values within the file are separated by commas

As an added reference, I’ve shown you a way read this file from a folder in your file system:

births <- ("~/Downloads/US_births_2000-2014_SSA.csv")
mydata <- read.csv(births, header=TRUE, sep=",")
mydata

When you run read_csv() it prints out a column specification that gives the name and type of each column.

Reading an Inline CSV

Example: We will be reading an inline csv and understanding ways of modifying the read function

This is useful for experimenting with readr and for creating reproducible examples to share with others:

read_csv("a, b, c
1, 2, 3, 
4, 5, 6")

Skipping Metadata

In both cases read_csv() uses the first line of the data for the column names, which is a very common convention. There are two cases where you might want to tweak this behavior:

Sometimes there are a few lines of metadata at the top of the file. You can use skip = n to skip the first n lines; or use comment = "#" to drop all lines that start with (e.g.) #.

read_csv( 
read_csv("The first line of metadata 
         x,y,z 
         1,2,3", comment = "#")

 read_csv("# A comment I want to skip 
         x,y,z 
         1,2,3", comment = "#")

Defining Column Names:

The data might not have column names. You can use col_names = FALSE to tell read_csv() not to treat the first row as headings, and instead label them sequentially from X1 to Xn:

 read_csv("1,2,3\n4,5,6,col_names = FALSE)


("\n" is a convenient shortcut for adding a new line.)

Alternatively you can pass col_names a character vector which will be used as the column names:

 read_csv("read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))

NA Values

Another option that commonly needs tweaking is na: this specifies the value (or values) that are used to represent missing values in your file:

 read_csv("a,b,c\n1,2,.", na = ".")


This is all you need to know to read ~75% of CSV files that you’ll encounter in practice. You can also easily adapt what you’ve learned to read tab separated files with read_tsv() and fixed width files with read_fwf(). To read in more challenging files, you’ll need to learn more about how readr parses each column, turning them into R vectors.