Skip to content

Understanding Data Frames

Understanding Data-frames Further

Subsets, slicing and accessing data-frames

Subsetting is a useful process to extract specific information from a larger dataset.

One Dimensional Object:

The syntax of subsetting a one dimensional object (like a vector or factor):

Example: I am going to create a vector (RB) which is a list, then subset some values from the list based on their indices.

RB <- c("Red","Orange","Yellow","Green","Blue","Purple")
RB[c(1,3)]

  • RB[c(1,3)]returns the first and third elements in the list. To access elements in a data structure, the name of the data structure followed by a pair of square brackets [] is used. To further specify the elements to be extracted, their numerical indices are specified within the parentheses().

Two Dimensional Object:

The syntax of subsetting a two dimensional object (like matrices or data frames):

Example: Let’s use the data frame we explores in a previous section, birthwt

bw_sum$smoke
bw_sum$birthweight[1:3]

  • bw_sum$smoke : the $ is used to subset the smoke column from our bw_sum data frame. The contents of the column are printed as a list.
  • bw_sum$birthweight[1:3] Again, a $ is used to subset the information in the birthweight column of the data frame. To selectively return the first 3 rows, I specify the argument, [1:3]

Missing values

Missing values, or NAs (“not availables”) can make comparison quite tricky. NA represents an unknown value. Missing values are “contagious”: almost any operation involving an unknown value will also be unknown. Take a look at these examples. So confusing!

Anything that has an NA, will also be NA.

Luckily, there is a solution! You can remove all NA values from your data.

The filter() function only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly:

Example: we are creating a tibble (a simplified version of a data frame) containing the values 1, NA, and 3. Using the filter function, let’s find all values that are greater than 1.

df <- tibble(x =c(1, NA, 3))
filter (df, x>1)


The result is a single value, 3. As the function doesn’t know whether NA is >1 or not, it excludes it for good measure!
If you want to account for missing values along with true statements, you can specify an additional condition is.na(x)

filter(df, is.na(x) | x>1)

Removing NAs explicitly: You may see the na.rm argument used in some functions. This argument removes any NA values from the table.