3 Subset and Reorganize Columns
3.1 Introduction
A subset is a part of a larger dataset, i.e. selection of rows and columns of a data table.
Functions from the Tidyverse packages (dplyr and tidyr) will be used in this tutorial. As example data a lipidomics dataset is being used here.
3.2 Select columns
Subset by selecting specific columns using dplyr::select(). Use ticks ` ` when column names contain white spaces. To select a consecutive range of columns use the operator :, and to refer to the last column use last_col().
3.3 Remove columns
To remove specific columns, add the ! operator in front of the column name(s)
3.4 Select columns with specific text in their names
To select columns with a specific string (text) in column names use the dplyr functions contains(), starts_with(), and ends_with():
mysubset <- mydata |> select(DataFileName, contains("ISTD"))
mysubset <- mydata |> select(DataFileName, starts_with("Cer"))
mysubset <- mydata |> select(DataFileName, ends_with(":4"))
mysubset <- mydata |> select(DataFileName, contains("ISTD") & starts_with("PC"))
# Using regular expressions
mysubset <- mydata |> select(DataFileName, matches("ISTD|^PC")) 3.5 Select columns using text patterns (regular expressions)
To selecting column contain specific text patterns use matches(). This function uses regular expressions to match column names. A few key elements in regular expression strings:
- To search for different strings:
|(OR operator) - Starts with:
^Ends with:$ - Escape character:
\\. Symbols as :|\/[] etc have a function in regular expression, if you want search for them you need add\\in front - Selections
[ ]are used to indicate any of the characters is allowed. Use-to indicate ranges, and^to exclude any of the characters. Examples:[abcd],[a-d],[1234],[1-4],[^SI] - Dot
.indicates any character,\\dany digit,\\Dany non-digit,\\sany whitespace,\\Sany non-whitespace - Plus
+means any, and{n}a defined number n of the preceding characters or symbols.
Examples:[1-9]+,[1-9]{6},\\d{6},\\s+
You can also use select with other functions returning TRUE/FALSE by using where():
mysubset <- mydata |> select(DataFileName, where(is.numeric)) 3.6 Reorder columns
Use relocate() with the same helper functions as used with select(), e.g., contains():
Here an example to sort column names alphabetically
You can also reorder when using select() function. Use everything() to add remaining columns
mydata2 <- mydata |> select(SPLType, DataFileName, contains("ISTD"), everything())