Exploring the data.table package

Using data.table R package

Under the hood, many function of operation included in the rbms package are done using the data.table package. The data.table package offer fast and powerful operation on large data set with a very compact syntax.

In this session, you will:

read csv file with data.table
merge data set with one or multiple key
rename column names
extract basic statistics

You will need to load the following R packages and the data found in the bms_workshop_data folder that can be downloaded here. Once you have downloaded the data, unzip the folder and add the data in your R project directory, your current R working directory, or set your working directly accordingly setwd().

library(data.table)

## data.table 1.14.8 using 4 threads (see ?getDTthreads).  Latest news: r-datatable.com

All data are stored in rds format, this format is highly efficient for storing R object, but you could also have them in any other format. To load .rds data, we use the function readRDS()

b_count_sub <- readRDS("bms_workshop_data/work_count.rds")
m_visit_sub <- readRDS("bms_workshop_data/work_visit.rds")
m_site_sub <- readRDS("bms_workshop_data/work_site.rds")

## to read csv into a data.table use fread()

bcz_class <- data.table::fread("bms_workshop_data/GEnS_v3_classification.csv")

b_count_sub

##        bms_id transect_id_serial visit_date year month day       species_name
##     1:  FRBMS            FRBMS.1 2009-05-22 2009     5  22    Maniola jurtina
##     2:  FRBMS            FRBMS.1 2009-08-01 2009     8   1    Maniola jurtina
##     3:  FRBMS            FRBMS.1 2009-08-01 2009     8   1 Polyommatus icarus
##     4:  FRBMS            FRBMS.1 2009-09-19 2009     9  19 Polyommatus icarus
##     5:  FRBMS            FRBMS.1 2010-07-02 2010     7   2    Maniola jurtina
##    ---                                                                       
## 59598:  UKBMS          UKBMS.300 2018-07-03 2018     7   3    Maniola jurtina
## 59599:  UKBMS          UKBMS.300 2018-07-10 2018     7  10    Maniola jurtina
## 59600:  UKBMS          UKBMS.300 2018-07-16 2018     7  16    Maniola jurtina
## 59601:  UKBMS          UKBMS.300 2018-07-16 2018     7  16 Polyommatus icarus
## 59602:  UKBMS          UKBMS.300 2018-07-23 2018     7  23    Maniola jurtina
##        count
##     1:     5
##     2:     5
##     3:     2
##     4:     6
##     5:    20
##    ---      
## 59598:    26
## 59599:    16
## 59600:     5
## 59601:     1
## 59602:     2

When you look at a data.table object, you can will see the first and last five row, with all column names.

You can access specific column the same way as you would do with a data.frame with the $ sign or a vector of names such as c(“col1”, “col2, …,”col7”).
You can subset the data.table with an logical argument of indices in the first position after the square bracket (e.g my_data_table[ col1 <= 453, ])
You can create new column with the “:=” symbol and a function (e.g. my_data_table[col1 <= 453, new_col := mean(col1)])
You can compute operation by group using a “by” argument in third position within the square brackets (e.g. my_data_table[col1 <= 453, new_col := mean(col1), by = col3])

Extract columns

my_new_dt <- b_count_sub[ , .(bms_id, transect_id_serial, year, visit_date)]
my_new_dt

##        bms_id transect_id_serial year visit_date
##     1:  FRBMS            FRBMS.1 2009 2009-05-22
##     2:  FRBMS            FRBMS.1 2009 2009-08-01
##     3:  FRBMS            FRBMS.1 2009 2009-08-01
##     4:  FRBMS            FRBMS.1 2009 2009-09-19
##     5:  FRBMS            FRBMS.1 2010 2010-07-02
##    ---                                          
## 59598:  UKBMS          UKBMS.300 2018 2018-07-03
## 59599:  UKBMS          UKBMS.300 2018 2018-07-10
## 59600:  UKBMS          UKBMS.300 2018 2018-07-16
## 59601:  UKBMS          UKBMS.300 2018 2018-07-16
## 59602:  UKBMS          UKBMS.300 2018 2018-07-23

Extract unique

unique(my_new_dt$bms_id)

## [1] "FRBMS" "NLBMS" "UKBMS"

unique(my_new_dt[, .(bms_id, year)])

##     bms_id year
##  1:  FRBMS 2009
##  2:  FRBMS 2010
##  3:  FRBMS 2011
##  4:  FRBMS 2012
##  5:  FRBMS 2013
##  6:  FRBMS 2014
##  7:  FRBMS 2015
##  8:  FRBMS 2017
##  9:  FRBMS 2018
## 10:  FRBMS 2016
## 11:  FRBMS 2008
## 12:  NLBMS 2008
## 13:  NLBMS 2009
## 14:  NLBMS 2010
## 15:  NLBMS 2011
## 16:  NLBMS 2012
## 17:  NLBMS 2013
## 18:  NLBMS 2014
## 19:  NLBMS 2015
## 20:  NLBMS 2016
## 21:  NLBMS 2017
## 22:  NLBMS 2018
## 23:  UKBMS 2008
## 24:  UKBMS 2009
## 25:  UKBMS 2010
## 26:  UKBMS 2011
## 27:  UKBMS 2012
## 28:  UKBMS 2013
## 29:  UKBMS 2014
## 30:  UKBMS 2015
## 31:  UKBMS 2016
## 32:  UKBMS 2017
## 33:  UKBMS 2018
##     bms_id year

Subset

my_new_dt[bms_id == "NLBMS", ]

##        bms_id transect_id_serial year visit_date
##     1:  NLBMS            NLBMS.1 2008 2008-05-23
##     2:  NLBMS            NLBMS.1 2008 2008-06-07
##     3:  NLBMS            NLBMS.1 2008 2008-07-26
##     4:  NLBMS            NLBMS.1 2008 2008-07-26
##     5:  NLBMS            NLBMS.1 2008 2008-08-15
##    ---                                          
## 26200:  NLBMS          NLBMS.300 2016 2016-07-03
## 26201:  NLBMS          NLBMS.300 2016 2016-07-25
## 26202:  NLBMS          NLBMS.300 2017 2017-06-26
## 26203:  NLBMS          NLBMS.300 2017 2017-07-14
## 26204:  NLBMS          NLBMS.300 2018 2018-06-15

Work with date

my_new_dt[ , month := month(visit_date)][ , c("day", "year") := .(mday(visit_date), year(visit_date))]
my_new_dt

##        bms_id transect_id_serial year visit_date month day
##     1:  FRBMS            FRBMS.1 2009 2009-05-22     5  22
##     2:  FRBMS            FRBMS.1 2009 2009-08-01     8   1
##     3:  FRBMS            FRBMS.1 2009 2009-08-01     8   1
##     4:  FRBMS            FRBMS.1 2009 2009-09-19     9  19
##     5:  FRBMS            FRBMS.1 2010 2010-07-02     7   2
##    ---                                                    
## 59598:  UKBMS          UKBMS.300 2018 2018-07-03     7   3
## 59599:  UKBMS          UKBMS.300 2018 2018-07-10     7  10
## 59600:  UKBMS          UKBMS.300 2018 2018-07-16     7  16
## 59601:  UKBMS          UKBMS.300 2018 2018-07-16     7  16
## 59602:  UKBMS          UKBMS.300 2018 2018-07-23     7  23

Count object

my_new_dt[ , .N, by = bms_id]

##    bms_id     N
## 1:  FRBMS  1074
## 2:  NLBMS 26204
## 3:  UKBMS 32324

rename column

## change names
setnames(b_count_sub, "transect_id_serial", "SITE_ID")
setnames(b_count_sub, c("species_name", "bms_id"), c("SPECIES", "BMS_ID"))
names(b_count_sub) <- toupper(names(b_count_sub))
b_count_sub

##        BMS_ID   SITE_ID VISIT_DATE YEAR MONTH DAY            SPECIES COUNT
##     1:  FRBMS   FRBMS.1 2009-05-22 2009     5  22    Maniola jurtina     5
##     2:  FRBMS   FRBMS.1 2009-08-01 2009     8   1    Maniola jurtina     5
##     3:  FRBMS   FRBMS.1 2009-08-01 2009     8   1 Polyommatus icarus     2
##     4:  FRBMS   FRBMS.1 2009-09-19 2009     9  19 Polyommatus icarus     6
##     5:  FRBMS   FRBMS.1 2010-07-02 2010     7   2    Maniola jurtina    20
##    ---                                                                    
## 59598:  UKBMS UKBMS.300 2018-07-03 2018     7   3    Maniola jurtina    26
## 59599:  UKBMS UKBMS.300 2018-07-10 2018     7  10    Maniola jurtina    16
## 59600:  UKBMS UKBMS.300 2018-07-16 2018     7  16    Maniola jurtina     5
## 59601:  UKBMS UKBMS.300 2018-07-16 2018     7  16 Polyommatus icarus     1
## 59602:  UKBMS UKBMS.300 2018-07-23 2018     7  23    Maniola jurtina     2

Merge data sets

## merge data set
setnames(m_site_sub, "transect_id_serial", "SITE_ID")

setkey(b_count_sub, SITE_ID)
setkey(m_site_sub, SITE_ID)

merge(b_count_sub, m_site_sub[bms_id == "FRBMS" , .(SITE_ID, transect_lon_1km, transect_lat_1km)])

##       SITE_ID BMS_ID VISIT_DATE YEAR MONTH DAY            SPECIES COUNT
##    1: FRBMS.1  FRBMS 2009-05-22 2009     5  22    Maniola jurtina     5
##    2: FRBMS.1  FRBMS 2009-08-01 2009     8   1    Maniola jurtina     5
##    3: FRBMS.1  FRBMS 2009-08-01 2009     8   1 Polyommatus icarus     2
##    4: FRBMS.1  FRBMS 2009-09-19 2009     9  19 Polyommatus icarus     6
##    5: FRBMS.1  FRBMS 2010-07-02 2010     7   2    Maniola jurtina    20
##   ---                                                                  
## 1070: FRBMS.9  FRBMS 2014-07-18 2014     7  18    Maniola jurtina    30
## 1071: FRBMS.9  FRBMS 2014-08-21 2014     8  21    Maniola jurtina    33
## 1072: FRBMS.9  FRBMS 2014-08-21 2014     8  21 Polyommatus icarus     6
## 1073: FRBMS.9  FRBMS 2014-09-16 2014     9  16    Maniola jurtina     2
## 1074: FRBMS.9  FRBMS 2014-09-16 2014     9  16 Polyommatus icarus    18
##       transect_lon_1km transect_lat_1km
##    1:          3953000          2967000
##    2:          3953000          2967000
##    3:          3953000          2967000
##    4:          3953000          2967000
##    5:          3953000          2967000
##   ---                                  
## 1070:          3737000          2835000
## 1071:          3737000          2835000
## 1072:          3737000          2835000
## 1073:          3737000          2835000
## 1074:          3737000          2835000

merge(b_count_sub, m_site_sub[bms_id == "FRBMS" , .(SITE_ID, transect_lon_1km, transect_lat_1km)], all.x = TRUE)

##         SITE_ID BMS_ID VISIT_DATE YEAR MONTH DAY            SPECIES COUNT
##     1:  FRBMS.1  FRBMS 2009-05-22 2009     5  22    Maniola jurtina     5
##     2:  FRBMS.1  FRBMS 2009-08-01 2009     8   1    Maniola jurtina     5
##     3:  FRBMS.1  FRBMS 2009-08-01 2009     8   1 Polyommatus icarus     2
##     4:  FRBMS.1  FRBMS 2009-09-19 2009     9  19 Polyommatus icarus     6
##     5:  FRBMS.1  FRBMS 2010-07-02 2010     7   2    Maniola jurtina    20
##    ---                                                                   
## 59598: UKBMS.99  UKBMS 2018-08-28 2018     8  28 Polyommatus icarus     8
## 59599: UKBMS.99  UKBMS 2018-09-02 2018     9   2    Maniola jurtina    25
## 59600: UKBMS.99  UKBMS 2018-09-02 2018     9   2 Polyommatus icarus     7
## 59601: UKBMS.99  UKBMS 2018-10-01 2018    10   1    Maniola jurtina    14
## 59602: UKBMS.99  UKBMS 2018-10-01 2018    10   1 Polyommatus icarus     2
##        transect_lon_1km transect_lat_1km
##     1:          3953000          2967000
##     2:          3953000          2967000
##     3:          3953000          2967000
##     4:          3953000          2967000
##     5:          3953000          2967000
##    ---                                  
## 59598:               NA               NA
## 59599:               NA               NA
## 59600:               NA               NA
## 59601:               NA               NA
## 59602:               NA               NA

data_m <- merge(b_count_sub, m_site_sub[ , .(SITE_ID, transect_lon_1km, transect_lat_1km)], all.x = TRUE)
data_m

##         SITE_ID BMS_ID VISIT_DATE YEAR MONTH DAY            SPECIES COUNT
##     1:  FRBMS.1  FRBMS 2009-05-22 2009     5  22    Maniola jurtina     5
##     2:  FRBMS.1  FRBMS 2009-08-01 2009     8   1    Maniola jurtina     5
##     3:  FRBMS.1  FRBMS 2009-08-01 2009     8   1 Polyommatus icarus     2
##     4:  FRBMS.1  FRBMS 2009-09-19 2009     9  19 Polyommatus icarus     6
##     5:  FRBMS.1  FRBMS 2010-07-02 2010     7   2    Maniola jurtina    20
##    ---                                                                   
## 59598: UKBMS.99  UKBMS 2018-08-28 2018     8  28 Polyommatus icarus     8
## 59599: UKBMS.99  UKBMS 2018-09-02 2018     9   2    Maniola jurtina    25
## 59600: UKBMS.99  UKBMS 2018-09-02 2018     9   2 Polyommatus icarus     7
## 59601: UKBMS.99  UKBMS 2018-10-01 2018    10   1    Maniola jurtina    14
## 59602: UKBMS.99  UKBMS 2018-10-01 2018    10   1 Polyommatus icarus     2
##        transect_lon_1km transect_lat_1km
##     1:          3953000          2967000
##     2:          3953000          2967000
##     3:          3953000          2967000
##     4:          3953000          2967000
##     5:          3953000          2967000
##    ---                                  
## 59598:          3604000          3177000
## 59599:          3604000          3177000
## 59600:          3604000          3177000
## 59601:          3604000          3177000
## 59602:          3604000          3177000

There is many way to manipulate tabular data sets in R, dplyr and data.table are especially efficient with large data set, data.table can manipulate millions of row with ease. Learn one of these tool will help you seriously in every R task and analysis as 80% of the work is to organize the data to make them fit for the analysis or build a graph.