3 Subset and Reorganize Columns
3.1 Introduction
A subset is a part of a larger dataset, i.e. selection of rows and columns of a data table.
Functions from the Tidyverse packages (dplyr
and tidyr
) will be used in this tutorial. As example data a lipidomics dataset is being used here.
3.2 Select columns
Subset by selecting specific columns using dplyr::select()
. Use ticks ` `
when column names contain white spaces. To select a consecutive range of columns use the operator :
, and to refer to the last column use last_col()
.
3.3 Remove columns
To remove specific columns, add the !
operator in front of the column name(s)
3.4 Select columns with specific text in their names
To select columns with a specific string (text) in column names use the dplyr
functions contains()
, starts_with()
, and ends_with()
:
mysubset <- mydata |> select(DataFileName, contains("ISTD"))
mysubset <- mydata |> select(DataFileName, starts_with("Cer"))
mysubset <- mydata |> select(DataFileName, ends_with(":4"))
mysubset <- mydata |> select(DataFileName, contains("ISTD") & starts_with("PC"))
# Using regular expressions
mysubset <- mydata |> select(DataFileName, matches("ISTD|^PC"))
3.5 Select columns using text patterns (regular expressions)
To selecting column contain specific text patterns use matches()
. This function uses regular expressions to match column names. A few key elements in regular expression strings:
- To search for different strings:
|
(OR operator) - Starts with:
^
Ends with:$
- Escape character:
\\
. Symbols as :|\/[] etc have a function in regular expression, if you want search for them you need add\\
in front - Selections
[ ]
are used to indicate any of the characters is allowed. Use-
to indicate ranges, and^
to exclude any of the characters. Examples:[abcd]
,[a-d]
,[1234]
,[1-4]
,[^SI]
- Dot
.
indicates any character,\\d
any digit,\\D
any non-digit,\\s
any whitespace,\\S
any non-whitespace - Plus
+
means any, and{
n}
a defined number n of the preceding characters or symbols.
Examples:[1-9]+
,[1-9]{6}
,\\d{6}
,\\s+
You can also use select with other functions returning TRUE/FALSE by using where()
:
mysubset <- mydata |> select(DataFileName, where(is.numeric))
3.6 Reorder columns
Use relocate()
with the same helper functions as used with select()
, e.g., contains():
Here an example to sort column names alphabetically
You can also reorder when using select()
function. Use everything()
to add remaining columns
mydata2 <- mydata |> select(SPLType, DataFileName, contains("ISTD"), everything())