2 Import data into R

2.1 Introduction

Importing data is often the first step in a data analysis project. Quality and integrity of imported data is key for the subsequent analysis - Quality In Quality Out. A good overview of data import in R helps to ensure integrity, completeness and efficient import of data .

2.2 Libraries

library(here)
library(tidyverse)
library(readxl)

2.3 Read CSV files

# Read a CSV file
mydata <- read_csv(file = here("data/Testdata_wide.csv"), 
                   col_names = TRUE, 
                   trim_ws = TRUE,
                   na = c("", "NA", "ND"))

The console output summarizes the data types assigned to the different columns

#> Rows: 215 Columns: 430
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr    (3): DataFileName, SPLType, VialPosition
#> dbl  (426): CE 14:0, CE 15:0, CE 16:0, CE 16:1, CE 16:2, CE 17:0, CE 17:1, C...
#> dttm   (1): AcqTimeStamp
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Output

A data frame (tibble)

Parameters

col_names = TRUE means the first row defines column names (default is TRUE)
trim_ws = TRUE removes white-spaces before and after text (default is TRUE)
na defines which text values are to be treated as missing values (NA). See also Section 2.5.2

2.4 Read EXCEL sheets

# Read a sheet from an EXCEL xlsx file
mydata <- read_xlsx(path = here::here("data/Testdata_wide.xlsx"),
                            sheet = "Sheet1",
                            trim_ws = TRUE, 
                            na = c("", "NA", "ND"), )

Output

A data frame (tibble)

Parameters

sheet defines the Excel sheet to be imported
For other parameters, see Section 2.3

2.5 Remarks

Verify the integrity of imported data - do not just assume all is fine

2.5.1 Date File Formats

Use structured and consistent data files as input data
First row should contain column names.
Avoid rows before and after the table (e.g. descriptions or notes)
Rows should contain observations (e.g. samples) and columns the variables (e.g. compounds)

2.5.2 Missing Values

read_csv() or read_xlsx() assign empty/missing values as NA (Not Available).
Data files may contain specific text values for missing values (e.g.,“ND”,”LOD”, “hemolytic”). By default, read_csv() and read_xlsx() assign columns with such values as text (character) instead as numeric - even if there is just one such text value. You can use the na parameter to set text values to be interpreted as NA (see example)
Column that contain only missing values will be assigned the data type logical

2.5.3 Data Types

Note: read_csv() or read_xlsx() will by default guess the data type of each column.
Typical data types are numbers (dbl), text (chr), logical (lgl) and factor (fct)
If a column contains only one cell with text, the full column will be defined as text.
Columns that contains only missing values will be assigned the data type logical
You can define the column types already at import (see ?read_csv).

2.6 More Readings

(“11 Data Import | r for Data Science,” n.d.)