3  Subset and Reorganize Columns

3.1 Introduction

A subset is a part of a larger dataset, i.e. selection of rows and columns of a data table.

Functions from the Tidyverse packages (dplyr and tidyr) will be used in this tutorial. As example data a lipidomics dataset is being used here.

library(here)
library(tidyverse)

mydata <- read_csv(file = here("data/Testdata_wide.csv"))

3.2 Select columns

Subset by selecting specific columns using dplyr::select(). Use ticks ` ` when column names contain white spaces. To select a consecutive range of columns use the operator :, and to refer to the last column use last_col().

mysubset1 <- mydata |> select(DataFileName, SPLType, `CE 14:0`:`CE 22:6`) 
mysubset2 <- mydata |> select(DataFileName, SPLType, `TG 48:0`:last_col()) 

3.3 Remove columns

To remove specific columns, add the ! operator in front of the column name(s)

mysubset <- mydata |> 
  select(!VialPosition, !(`CE 14:0`:`CE 24:4`), !`TG 48:0`:last_col()) 

3.4 Select columns with specific text in their names

To select columns with a specific string (text) in column names use the dplyr functions contains(), starts_with(), and ends_with():

mysubset <- mydata |> select(DataFileName, contains("ISTD")) 
mysubset <- mydata |> select(DataFileName, starts_with("Cer"))
mysubset <- mydata |> select(DataFileName, ends_with(":4")) 
mysubset <- mydata |> select(DataFileName, contains("ISTD") & starts_with("PC"))

# Using regular expressions
mysubset <- mydata |> select(DataFileName, matches("ISTD|^PC")) 

3.5 Select columns using text patterns (regular expressions)

To selecting column contain specific text patterns use matches(). This function uses regular expressions to match column names. A few key elements in regular expression strings:

  • To search for different strings: | (OR operator)
  • Starts with: ^ Ends with: $
  • Escape character: \\. Symbols as :|\/[] etc have a function in regular expression, if you want search for them you need add \\ in front
  • Selections [ ] are used to indicate any of the characters is allowed. Use - to indicate ranges, and
    ^ to exclude any of the characters. Examples: [abcd], [a-d], [1234], [1-4], [^SI]
  • Dot . indicates any character, \\d any digit, \\D any non-digit, \\s any whitespace, \\S any non-whitespace
  • Plus + means any, and {n} a defined number n of the preceding characters or symbols.
    Examples: [1-9]+, [1-9]{6}, \\d{6}, \\s+

You can also use select with other functions returning TRUE/FALSE by using where():

mysubset <- mydata |> select(DataFileName, where(is.numeric)) 

3.6 Reorder columns

Use relocate() with the same helper functions as used with select(), e.g., contains():

mydata2 <- mydata |> relocate(SPLType, DataFileName) 
mydata2 <- mydata |> relocate(contains("ISTD"), .before = `CE 14:0`) 
mydata2 <- mydata |> relocate(contains("ISTD"), .after = last_col())

Here an example to sort column names alphabetically

mydata2 <- mydata |> relocate(order(colnames(mydata)))

You can also reorder when using select() function. Use everything() to add remaining columns

mydata2 <- mydata |> select(SPLType, DataFileName, contains("ISTD"), everything())