2  Import data into R

2.1 Introduction

Importing data is often the first step in a data analysis project. Quality and integrity of imported data is key for the subsequent analysis - Quality In Quality Out. A good overview of data import in R helps to ensure integrity, completeness and efficient import of data .

2.2 Libraries

2.3 Read CSV files

# Read a CSV file
mydata <- read_csv(file = here("data/Testdata_wide.csv"), 
                   col_names = TRUE, 
                   trim_ws = TRUE,
                   na = c("", "NA", "ND"))

The console output summarizes the data types assigned to the different columns

#> Rows: 215 Columns: 430
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr    (3): DataFileName, SPLType, VialPosition
#> dbl  (426): CE 14:0, CE 15:0, CE 16:0, CE 16:1, CE 16:2, CE 17:0, CE 17:1, C...
#> dttm   (1): AcqTimeStamp
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Output

A data frame (tibble)

Parameters

  • col_names = TRUE means the first row defines column names (default is TRUE)
  • trim_ws = TRUE removes white-spaces before and after text (default is TRUE)
  • na defines which text values are to be treated as missing values (NA). See also Section 2.5.2

2.4 Read EXCEL sheets

# Read a sheet from an EXCEL xlsx file
mydata <- read_xlsx(path = here::here("data/Testdata_wide.xlsx"),
                            sheet = "Sheet1",
                            trim_ws = TRUE, 
                            na = c("", "NA", "ND"), )

Output

A data frame (tibble)

Parameters

  • sheet defines the Excel sheet to be imported
  • For other parameters, see Section 2.3

2.5 Remarks

Verify the integrity of imported data - do not just assume all is fine

2.5.1 Date File Formats

  • Use structured and consistent data files as input data
  • First row should contain column names.
  • Avoid rows before and after the table (e.g. descriptions or notes)
  • Rows should contain observations (e.g. samples) and columns the variables (e.g. compounds)

2.5.2 Missing Values

  • read_csv() or read_xlsx() assign empty/missing values as NA (Not Available).
  • Data files may contain specific text values for missing values (e.g.,“ND”,”LOD”, “hemolytic”). By default, read_csv() and read_xlsx() assign columns with such values as text (character) instead as numeric - even if there is just one such text value. You can use the na parameter to set text values to be interpreted as NA (see example)
  • Column that contain only missing values will be assigned the data type logical

2.5.3 Data Types

  • Note: read_csv() or read_xlsx() will by default guess the data type of each column.
  • Typical data types are numbers (dbl), text (chr), logical (lgl) and factor (fct)
  • If a column contains only one cell with text, the full column will be defined as text.
  • Columns that contains only missing values will be assigned the data type logical
  • You can define the column types already at import (see ?read_csv).

2.6 More Readings

  1. (“11 Data Import | r for Data Science,” n.d.)