R - Data wrangling utilizing the most common functions from tidyverse package collection

Welcome. In this post, I am going to show you how you may make data wrangling from a dataset utilizing the dplyr, plyr, and tidyr packages from tidyverse package collection.

Last updated on Nov 4, 2021 14 min read R

Image credit: Pott L.P.

Data wrangling utilizing the most common functions from tidyverse package collection.

Loading the package

library(tidyverse)

URL link of data from data collection and remote sensing on GitHub https://github.com/luanpott10/Class You will need the raw file https://raw.githubusercontent.com/luanpott10/Class/main/data_crops.csv

Loading the data

data <- read.csv("https://raw.githubusercontent.com/luanpott10/Class/main/data_crops.csv")

A rapid view of the data

Functions that can be used

`glimpse()`

`head()`

`tail()`

`summary()`

data |> glimpse()
## Rows: 100
## Columns: 9
## $ ï..n      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1~
## $ latitude  <dbl> -28.59329, -30.87459, -28.84481, -30.65571, -28.85978, -28.4~
## $ longitude <dbl> -52.64978, -51.72551, -53.46126, -55.11848, -53.61491, -55.0~
## $ class     <chr> "soybean", "soybean", "soybean", "soybean", "soybean", "soyb~
## $ b0_GCVI   <dbl> -1.48824394, -1.07909429, 0.16900872, -0.92421901, -0.671991~
## $ b1_GCVI   <dbl> -3.8113625, -1.8287302, -2.0509982, -2.5976772, -2.8033531, ~
## $ b2_GCVI   <dbl> -6.3380713, -4.8378959, -2.7416265, -4.2464180, -4.3012776, ~
## $ b3_GCVI   <dbl> 1.2918025, 2.0158091, -1.1819744, 0.8886908, -0.2094595, 2.6~
## $ b4_GCVI   <dbl> -4.631691, -2.661626, -2.707749, -3.352147, -3.434427, -7.33~
data |> head(5)
##   ï..n  latitude longitude   class    b0_GCVI   b1_GCVI   b2_GCVI    b3_GCVI
## 1    1 -28.59329 -52.64978 soybean -1.4882439 -3.811363 -6.338071  1.2918025
## 2    2 -30.87459 -51.72551 soybean -1.0790943 -1.828730 -4.837896  2.0158091
## 3    3 -28.84481 -53.46126 soybean  0.1690087 -2.050998 -2.741627 -1.1819744
## 4    4 -30.65571 -55.11848 soybean -0.9242190 -2.597677 -4.246418  0.8886908
## 5    5 -28.85978 -53.61491 soybean -0.6719911 -2.803353 -4.301278 -0.2094595
##     b4_GCVI
## 1 -4.631691
## 2 -2.661626
## 3 -2.707749
## 4 -3.352147
## 5 -3.434427
data |> tail(5)
##     ï..n  latitude longitude class  b0_GCVI    b1_GCVI  b2_GCVI   b3_GCVI
## 96    96 -28.53374 -52.84692  corn 5.628807  1.4232943 5.189684 -3.280593
## 97    97 -27.67757 -54.75618  corn 7.803896  5.7830925 7.469813 -1.889733
## 98    98 -28.32241 -51.33973  corn 3.854038 -0.4112832 2.829001 -2.718110
## 99    99 -28.13302 -51.36585  corn 2.423372 -1.6455027 1.057501 -2.884991
## 100  100 -28.05637 -54.54361  corn 7.691086  4.8666124 7.726856 -2.407301
##       b4_GCVI
## 96   2.666456
## 97   5.124163
## 98   0.711688
## 99  -1.349620
## 100  4.228607
data |> summary()
##       ï..n           latitude        longitude         class          
##  Min.   :  1.00   Min.   :-33.67   Min.   :-55.86   Length:100        
##  1st Qu.: 25.75   1st Qu.:-28.90   1st Qu.:-53.62   Class :character  
##  Median : 50.50   Median :-28.47   Median :-52.83   Mode  :character  
##  Mean   : 50.50   Mean   :-28.83   Mean   :-53.03                     
##  3rd Qu.: 75.25   3rd Qu.:-27.96   3rd Qu.:-52.18                     
##  Max.   :100.00   Max.   :-27.22   Max.   :-50.48                     
##     b0_GCVI           b1_GCVI           b2_GCVI           b3_GCVI        
##  Min.   :-10.400   Min.   :-7.3446   Min.   :-19.354   Min.   :-6.16632  
##  1st Qu.: -1.373   1st Qu.:-3.0439   1st Qu.: -5.601   1st Qu.:-1.93480  
##  Median :  0.650   Median :-1.7779   Median : -2.768   Median :-0.07191  
##  Mean   :  1.181   Mean   :-0.9485   Mean   : -1.731   Mean   :-0.04668  
##  3rd Qu.:  4.529   3rd Qu.: 1.1793   3rd Qu.:  3.622   3rd Qu.: 1.53019  
##  Max.   : 13.737   Max.   : 6.5656   Max.   : 17.014   Max.   : 6.54275  
##     b4_GCVI      
##  Min.   :-7.332  
##  1st Qu.:-4.117  
##  Median :-2.728  
##  Mean   :-1.386  
##  3rd Qu.: 2.069  
##  Max.   : 5.124

Select

`select` function is used for selecting columns in a data frame.

Here I brought you some examples of the `select` function for selecting different columns in different ways.

data |> select(1) |> head(6)
##   ï..n
## 1    1
## 2    2
## 3    3
## 4    4
## 5    5
## 6    6
data |> select(last_col()) |> head(6)
##     b4_GCVI
## 1 -4.631691
## 2 -2.661626
## 3 -2.707749
## 4 -3.352147
## 5 -3.434427
## 6 -7.331731
data |> select(c(class,b0_GCVI,b1_GCVI,b2_GCVI,b3_GCVI,b4_GCVI)) |> head(6)
##     class    b0_GCVI   b1_GCVI    b2_GCVI    b3_GCVI   b4_GCVI
## 1 soybean -1.4882439 -3.811363  -6.338071  1.2918025 -4.631691
## 2 soybean -1.0790943 -1.828730  -4.837896  2.0158091 -2.661626
## 3 soybean  0.1690087 -2.050998  -2.741627 -1.1819744 -2.707749
## 4 soybean -0.9242190 -2.597677  -4.246418  0.8886908 -3.352147
## 5 soybean -0.6719911 -2.803353  -4.301278 -0.2094595 -3.434427
## 6 soybean -5.1542411 -7.344623 -10.879841  2.6003060 -7.331731
data |> select(-c(ï..n,latitude,longitude)) |> head(6)
##     class    b0_GCVI   b1_GCVI    b2_GCVI    b3_GCVI   b4_GCVI
## 1 soybean -1.4882439 -3.811363  -6.338071  1.2918025 -4.631691
## 2 soybean -1.0790943 -1.828730  -4.837896  2.0158091 -2.661626
## 3 soybean  0.1690087 -2.050998  -2.741627 -1.1819744 -2.707749
## 4 soybean -0.9242190 -2.597677  -4.246418  0.8886908 -3.352147
## 5 soybean -0.6719911 -2.803353  -4.301278 -0.2094595 -3.434427
## 6 soybean -5.1542411 -7.344623 -10.879841  2.6003060 -7.331731
data |> select(starts_with("b")) |> head(6)
##      b0_GCVI   b1_GCVI    b2_GCVI    b3_GCVI   b4_GCVI
## 1 -1.4882439 -3.811363  -6.338071  1.2918025 -4.631691
## 2 -1.0790943 -1.828730  -4.837896  2.0158091 -2.661626
## 3  0.1690087 -2.050998  -2.741627 -1.1819744 -2.707749
## 4 -0.9242190 -2.597677  -4.246418  0.8886908 -3.352147
## 5 -0.6719911 -2.803353  -4.301278 -0.2094595 -3.434427
## 6 -5.1542411 -7.344623 -10.879841  2.6003060 -7.331731
data |> select(ends_with("GCVI")) |> head(6)
##      b0_GCVI   b1_GCVI    b2_GCVI    b3_GCVI   b4_GCVI
## 1 -1.4882439 -3.811363  -6.338071  1.2918025 -4.631691
## 2 -1.0790943 -1.828730  -4.837896  2.0158091 -2.661626
## 3  0.1690087 -2.050998  -2.741627 -1.1819744 -2.707749
## 4 -0.9242190 -2.597677  -4.246418  0.8886908 -3.352147
## 5 -0.6719911 -2.803353  -4.301278 -0.2094595 -3.434427
## 6 -5.1542411 -7.344623 -10.879841  2.6003060 -7.331731
data |> select(where(is.numeric)) |> head(6)
##   ï..n  latitude longitude    b0_GCVI   b1_GCVI    b2_GCVI    b3_GCVI   b4_GCVI
## 1    1 -28.59329 -52.64978 -1.4882439 -3.811363  -6.338071  1.2918025 -4.631691
## 2    2 -30.87459 -51.72551 -1.0790943 -1.828730  -4.837896  2.0158091 -2.661626
## 3    3 -28.84481 -53.46126  0.1690087 -2.050998  -2.741627 -1.1819744 -2.707749
## 4    4 -30.65571 -55.11848 -0.9242190 -2.597677  -4.246418  0.8886908 -3.352147
## 5    5 -28.85978 -53.61491 -0.6719911 -2.803353  -4.301278 -0.2094595 -3.434427
## 6    6 -28.41732 -55.03171 -5.1542411 -7.344623 -10.879841  2.6003060 -7.331731

Arrange

`arrange` function is used for orders the rows of a data frame by the values of selected columns.

In some cases we would like to see the ordered rows according a determined column, for that we use `arrange` function.

data |> arrange(b4_GCVI) |> head(6)
##   ï..n  latitude longitude   class    b0_GCVI   b1_GCVI    b2_GCVI  b3_GCVI
## 1    6 -28.41732 -55.03171 soybean  -5.154241 -7.344623 -10.879841 2.600306
## 2   44 -27.94593 -52.35423 soybean  -6.849492 -6.901615 -13.810637 4.647782
## 3   11 -29.02970 -54.92061 soybean -10.400090 -6.194132 -19.353834 6.542745
## 4   31 -28.20905 -51.63171 soybean  -3.497307 -5.756842  -9.003927 1.384130
## 5   18 -27.71189 -52.56109 soybean  -4.217597 -5.241977 -10.647408 3.294080
## 6   14 -32.11317 -53.16475 soybean  -4.723501 -4.550052 -11.359418 2.981974
##     b4_GCVI
## 1 -7.331731
## 2 -7.022087
## 3 -6.669860
## 4 -6.664257
## 5 -5.579582
## 6 -5.544666

Relocate

`relocate` function change column positions.

Generally in a machine learning model we have the last column as the label, for do that we can use the `relocate` function.

data |> relocate(-class) |> head(6)
##   ï..n  latitude longitude    b0_GCVI   b1_GCVI    b2_GCVI    b3_GCVI   b4_GCVI
## 1    1 -28.59329 -52.64978 -1.4882439 -3.811363  -6.338071  1.2918025 -4.631691
## 2    2 -30.87459 -51.72551 -1.0790943 -1.828730  -4.837896  2.0158091 -2.661626
## 3    3 -28.84481 -53.46126  0.1690087 -2.050998  -2.741627 -1.1819744 -2.707749
## 4    4 -30.65571 -55.11848 -0.9242190 -2.597677  -4.246418  0.8886908 -3.352147
## 5    5 -28.85978 -53.61491 -0.6719911 -2.803353  -4.301278 -0.2094595 -3.434427
## 6    6 -28.41732 -55.03171 -5.1542411 -7.344623 -10.879841  2.6003060 -7.331731
##     class
## 1 soybean
## 2 soybean
## 3 soybean
## 4 soybean
## 5 soybean
## 6 soybean

Filter

`filter` function is used to subset a data frame, retaining all rows that satisfy your conditions.

The most common operators that are useful to build the conditions are:

`==`, `>`, `>=`

`&`, `|`, `!`, `xor()`

`is.na()`

`between()`, `near()`

We also may put more than one filter in the same `filter` function.

data |> filter(class == "soybean", b4_GCVI > -4) |> head(6)
##   ï..n  latitude longitude   class    b0_GCVI   b1_GCVI   b2_GCVI    b3_GCVI
## 1    2 -30.87459 -51.72551 soybean -1.0790943 -1.828730 -4.837896  2.0158091
## 2    3 -28.84481 -53.46126 soybean  0.1690087 -2.050998 -2.741627 -1.1819744
## 3    4 -30.65571 -55.11848 soybean -0.9242190 -2.597677 -4.246418  0.8886908
## 4    5 -28.85978 -53.61491 soybean -0.6719911 -2.803353 -4.301278 -0.2094595
## 5    9 -30.80493 -55.27339 soybean -0.9067734 -3.203305 -4.744385  0.5329000
## 6   13 -31.31976 -53.99590 soybean -2.4681277 -2.352393 -7.690125  3.9750807
##     b4_GCVI
## 1 -2.661626
## 2 -2.707749
## 3 -3.352147
## 4 -3.434427
## 5 -3.771788
## 6 -1.823997

Rename

`rename` changes the names of individual variables.

Column names in datasets should be short, inuitive and complete. For that many times we need to rename columns for the dataset.

data |> rename(n = ï..n) |> head(6)
##   n  latitude longitude   class    b0_GCVI   b1_GCVI    b2_GCVI    b3_GCVI
## 1 1 -28.59329 -52.64978 soybean -1.4882439 -3.811363  -6.338071  1.2918025
## 2 2 -30.87459 -51.72551 soybean -1.0790943 -1.828730  -4.837896  2.0158091
## 3 3 -28.84481 -53.46126 soybean  0.1690087 -2.050998  -2.741627 -1.1819744
## 4 4 -30.65571 -55.11848 soybean -0.9242190 -2.597677  -4.246418  0.8886908
## 5 5 -28.85978 -53.61491 soybean -0.6719911 -2.803353  -4.301278 -0.2094595
## 6 6 -28.41732 -55.03171 soybean -5.1542411 -7.344623 -10.879841  2.6003060
##     b4_GCVI
## 1 -4.631691
## 2 -2.661626
## 3 -2.707749
## 4 -3.352147
## 5 -3.434427
## 6 -7.331731

Mutate

`mutate` is utilized to create or transform variables.

`across` makes it easy to apply the same transformation to multiple columns.

`across` generally it is used into `summarise()` and `mutate()` functions.

In this example we are transforming - rounding to 2 decimals the double variables except latitude and longitude.

data |> 
  mutate(across(where(is.double) & !c(latitude, longitude), ~ round(.x,2))) |> head(6)
##   ï..n  latitude longitude   class b0_GCVI b1_GCVI b2_GCVI b3_GCVI b4_GCVI
## 1    1 -28.59329 -52.64978 soybean   -1.49   -3.81   -6.34    1.29   -4.63
## 2    2 -30.87459 -51.72551 soybean   -1.08   -1.83   -4.84    2.02   -2.66
## 3    3 -28.84481 -53.46126 soybean    0.17   -2.05   -2.74   -1.18   -2.71
## 4    4 -30.65571 -55.11848 soybean   -0.92   -2.60   -4.25    0.89   -3.35
## 5    5 -28.85978 -53.61491 soybean   -0.67   -2.80   -4.30   -0.21   -3.43
## 6    6 -28.41732 -55.03171 soybean   -5.15   -7.34  -10.88    2.60   -7.33

Recode

`recode` is utilized to recodes a numeric vector, character vector, or factor according to simple recode specifications.

For more complicated criteria, use `case_when()`.

data |> mutate(specie=recode(class,
                             soybean="Glycine max",
                             corn="Zea mays")) |> head(6)
##   ï..n  latitude longitude   class    b0_GCVI   b1_GCVI    b2_GCVI    b3_GCVI
## 1    1 -28.59329 -52.64978 soybean -1.4882439 -3.811363  -6.338071  1.2918025
## 2    2 -30.87459 -51.72551 soybean -1.0790943 -1.828730  -4.837896  2.0158091
## 3    3 -28.84481 -53.46126 soybean  0.1690087 -2.050998  -2.741627 -1.1819744
## 4    4 -30.65571 -55.11848 soybean -0.9242190 -2.597677  -4.246418  0.8886908
## 5    5 -28.85978 -53.61491 soybean -0.6719911 -2.803353  -4.301278 -0.2094595
## 6    6 -28.41732 -55.03171 soybean -5.1542411 -7.344623 -10.879841  2.6003060
##     b4_GCVI      specie
## 1 -4.631691 Glycine max
## 2 -2.661626 Glycine max
## 3 -2.707749 Glycine max
## 4 -3.352147 Glycine max
## 5 -3.434427 Glycine max
## 6 -7.331731 Glycine max

Summarise

`summarise` creates a new data frame.

The objective of the function is summarise a data frame for an aspect.

The most common functions used into `summarise` function are:

`mean()`, `median()`

`sd()`, `IQR()`, `mad()`

`min()`, `max()`, `quantile()`

`first()`, `last()`, `nth()`

`n()`, `n_distinct()`

`any()`, `all()`

We also may put more than one function in `summarise` function.

data |> summarise(n = n(), min = min(b4_GCVI), max = max(b4_GCVI), mean = mean(b4_GCVI))
##     n       min      max      mean
## 1 100 -7.331731 5.124163 -1.386388

Group by

`group_by` takes an existing data frame and converts it into a grouped data frame where operations are performed by group. ungroup() removes grouping.

Generally the `group_by` function is used before the `summarise` function to generate summaries by group.

data |> group_by(class) |> summarise(n = n(), min = min(b4_GCVI), max = max(b4_GCVI), mean = mean(b4_GCVI))
## # A tibble: 2 x 5
##   class       n   min   max  mean
##   <chr>   <int> <dbl> <dbl> <dbl>
## 1 corn       50 -3.33  5.12  1.34
## 2 soybean    50 -7.33 -1.29 -4.11

Pull

`pull` selects a column in a data frame and transforms it into a vector.

When we are leading with data wrangling we can use `pull` function to extract columns as a vector.

data |> pull(b4_GCVI) |> head(6)
## [1] -4.631691 -2.661626 -2.707749 -3.352147 -3.434427 -7.331731

Join

`join` function joins two data frames together.

Joining tables, data frames with foreign key, the `by` in the `join` function is the most important for relational databases.

The types of join are:

`inner_join` : only rows with matching keys in both x and y;

`left_join` : all rows in x, adding matching columns from y;

`right_join` : all rows in y, adding matching columns from x;

`full_join` : all rows in x with matching columns in y, then the rows of y that don’t match x.

You can se the differences with the below example.

data |> select(ï..n,latitude,longitude,class,b0_GCVI) -> data_1
data |> select(ï..n,latitude,longitude,class,b1_GCVI) -> data_2

data_1 <- data_1[1:60,]
data_2 <- data_2[51:100,]

data_x <- inner_join(data_1,data_2, by = c("ï..n","latitude","longitude","class"))
data_x |> glimpse()
## Rows: 10
## Columns: 6
## $ ï..n      <int> 51, 52, 53, 54, 55, 56, 57, 58, 59, 60
## $ latitude  <dbl> -27.93454, -28.12877, -27.89983, -28.43043, -28.58020, -27.7~
## $ longitude <dbl> -52.08487, -51.32797, -54.50413, -51.63524, -51.75654, -51.7~
## $ class     <chr> "corn", "corn", "corn", "corn", "corn", "corn", "corn", "cor~
## $ b0_GCVI   <dbl> 4.5186558, 1.2942828, 7.8575535, -2.8303745, -0.2717497, 5.2~
## $ b1_GCVI   <dbl> 0.8883052, -2.2323170, 5.9253740, -5.7838035, -3.2403061, 1.~
data_x <- left_join(data_1,data_2, by = c("ï..n","latitude","longitude","class"))
data_x |> glimpse()
## Rows: 60
## Columns: 6
## $ ï..n      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1~
## $ latitude  <dbl> -28.59329, -30.87459, -28.84481, -30.65571, -28.85978, -28.4~
## $ longitude <dbl> -52.64978, -51.72551, -53.46126, -55.11848, -53.61491, -55.0~
## $ class     <chr> "soybean", "soybean", "soybean", "soybean", "soybean", "soyb~
## $ b0_GCVI   <dbl> -1.48824394, -1.07909429, 0.16900872, -0.92421901, -0.671991~
## $ b1_GCVI   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
data_x <- right_join(data_1,data_2, by = c("ï..n","latitude","longitude","class"))
data_x |> glimpse()
## Rows: 50
## Columns: 6
## $ ï..n      <int> 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, ~
## $ latitude  <dbl> -27.93454, -28.12877, -27.89983, -28.43043, -28.58020, -27.7~
## $ longitude <dbl> -52.08487, -51.32797, -54.50413, -51.63524, -51.75654, -51.7~
## $ class     <chr> "corn", "corn", "corn", "corn", "corn", "corn", "corn", "cor~
## $ b0_GCVI   <dbl> 4.5186558, 1.2942828, 7.8575535, -2.8303745, -0.2717497, 5.2~
## $ b1_GCVI   <dbl> 0.8883052, -2.2323170, 5.9253740, -5.7838035, -3.2403061, 1.~
data_x <- full_join(data_1,data_2, by = c("ï..n","latitude","longitude","class"))
data_x |> glimpse()
## Rows: 100
## Columns: 6
## $ ï..n      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1~
## $ latitude  <dbl> -28.59329, -30.87459, -28.84481, -30.65571, -28.85978, -28.4~
## $ longitude <dbl> -52.64978, -51.72551, -53.46126, -55.11848, -53.61491, -55.0~
## $ class     <chr> "soybean", "soybean", "soybean", "soybean", "soybean", "soyb~
## $ b0_GCVI   <dbl> -1.48824394, -1.07909429, 0.16900872, -0.92421901, -0.671991~
## $ b1_GCVI   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~

Unite

`unite` function unites the values of two columns into one.

When we would like to unite two columns to use as a merged column we can use `unite` function.

data |> unite ("n_class",ï..n,class,sep="_") -> data_united
data_united |> head(6)
##     n_class  latitude longitude    b0_GCVI   b1_GCVI    b2_GCVI    b3_GCVI
## 1 1_soybean -28.59329 -52.64978 -1.4882439 -3.811363  -6.338071  1.2918025
## 2 2_soybean -30.87459 -51.72551 -1.0790943 -1.828730  -4.837896  2.0158091
## 3 3_soybean -28.84481 -53.46126  0.1690087 -2.050998  -2.741627 -1.1819744
## 4 4_soybean -30.65571 -55.11848 -0.9242190 -2.597677  -4.246418  0.8886908
## 5 5_soybean -28.85978 -53.61491 -0.6719911 -2.803353  -4.301278 -0.2094595
## 6 6_soybean -28.41732 -55.03171 -5.1542411 -7.344623 -10.879841  2.6003060
##     b4_GCVI
## 1 -4.631691
## 2 -2.661626
## 3 -2.707749
## 4 -3.352147
## 5 -3.434427
## 6 -7.331731

Separate

`separate` function separates a character column into multiple columns with a regular expression or numeric locations.

When we would like to separate a column to have two or more columns, we can use the `separate` function

data_united |> separate(n_class,c("ï..n","class"),sep="_") |> head(6)
##   ï..n   class  latitude longitude    b0_GCVI   b1_GCVI    b2_GCVI    b3_GCVI
## 1    1 soybean -28.59329 -52.64978 -1.4882439 -3.811363  -6.338071  1.2918025
## 2    2 soybean -30.87459 -51.72551 -1.0790943 -1.828730  -4.837896  2.0158091
## 3    3 soybean -28.84481 -53.46126  0.1690087 -2.050998  -2.741627 -1.1819744
## 4    4 soybean -30.65571 -55.11848 -0.9242190 -2.597677  -4.246418  0.8886908
## 5    5 soybean -28.85978 -53.61491 -0.6719911 -2.803353  -4.301278 -0.2094595
## 6    6 soybean -28.41732 -55.03171 -5.1542411 -7.344623 -10.879841  2.6003060
##     b4_GCVI
## 1 -4.631691
## 2 -2.661626
## 3 -2.707749
## 4 -3.352147
## 5 -3.434427
## 6 -7.331731

Gather

`gather` function gathers columns into key-value pairs.

When we would like to gather columns into two new columns containing a specific column name and the respectively column values we can use the `gather` function.

data |> gather(key="Feature",value="Value",b0_GCVI:b4_GCVI) -> data_gathered
data_gathered |> head(6)
##   ï..n  latitude longitude   class Feature      Value
## 1    1 -28.59329 -52.64978 soybean b0_GCVI -1.4882439
## 2    2 -30.87459 -51.72551 soybean b0_GCVI -1.0790943
## 3    3 -28.84481 -53.46126 soybean b0_GCVI  0.1690087
## 4    4 -30.65571 -55.11848 soybean b0_GCVI -0.9242190
## 5    5 -28.85978 -53.61491 soybean b0_GCVI -0.6719911
## 6    6 -28.41732 -55.03171 soybean b0_GCVI -5.1542411

Spread

`spread` function spreads a key-value pair across multiple columns.

When we need to distribute the pair of key-value columns into a field of cells we can use the `spread` function.

data_gathered %>% spread(key = Feature,value = Value) |> head(6)
##   ï..n  latitude longitude   class    b0_GCVI   b1_GCVI    b2_GCVI    b3_GCVI
## 1    1 -28.59329 -52.64978 soybean -1.4882439 -3.811363  -6.338071  1.2918025
## 2    2 -30.87459 -51.72551 soybean -1.0790943 -1.828730  -4.837896  2.0158091
## 3    3 -28.84481 -53.46126 soybean  0.1690087 -2.050998  -2.741627 -1.1819744
## 4    4 -30.65571 -55.11848 soybean -0.9242190 -2.597677  -4.246418  0.8886908
## 5    5 -28.85978 -53.61491 soybean -0.6719911 -2.803353  -4.301278 -0.2094595
## 6    6 -28.41732 -55.03171 soybean -5.1542411 -7.344623 -10.879841  2.6003060
##     b4_GCVI
## 1 -4.631691
## 2 -2.661626
## 3 -2.707749
## 4 -3.352147
## 5 -3.434427
## 6 -7.331731

Final

Here are the most common functions for data wrangling utilizing tidyverse package.

R Markdown data wrangling tidyverse

R - Data wrangling utilizing the most common functions from tidyverse package collection

Data wrangling utilizing the most common functions from tidyverse package collection.

Loading the package

Loading the data

A rapid view of the data

Functions that can be used

glimpse()

head()

tail()

summary()

Select

select function is used for selecting columns in a data frame.

Here I brought you some examples of the select function for selecting different columns in different ways.

Arrange

arrange function is used for orders the rows of a data frame by the values of selected columns.

In some cases we would like to see the ordered rows according a determined column, for that we use arrange function.

Relocate

relocate function change column positions.

Generally in a machine learning model we have the last column as the label, for do that we can use the relocate function.

Filter

filter function is used to subset a data frame, retaining all rows that satisfy your conditions.

The most common operators that are useful to build the conditions are:

==, >, >=

&, |, !, xor()

is.na()

between(), near()

We also may put more than one filter in the same filter function.

Rename

rename changes the names of individual variables.

Column names in datasets should be short, inuitive and complete. For that many times we need to rename columns for the dataset.

Mutate

mutate is utilized to create or transform variables.

across makes it easy to apply the same transformation to multiple columns.

across generally it is used into summarise() and mutate() functions.

In this example we are transforming - rounding to 2 decimals the double variables except latitude and longitude.

Recode

recode is utilized to recodes a numeric vector, character vector, or factor according to simple recode specifications.

For more complicated criteria, use case_when().

Summarise

summarise creates a new data frame.

The objective of the function is summarise a data frame for an aspect.

The most common functions used into summarise function are:

mean(), median()

sd(), IQR(), mad()

min(), max(), quantile()

first(), last(), nth()

n(), n_distinct()

any(), all()

We also may put more than one function in summarise function.