Posts | Data Science in Agriculture

R - Data wrangling utilizing the most common functions from tidyverse package collection

Thu, 14 Oct 2021 00:00:00 +0000

Data wrangling utilizing the most common functions from tidyverse package collection.

Loading the package

library(tidyverse)

URL link of data from data collection and remote sensing on GitHub https://github.com/luanpott10/Class You will need the raw file https://raw.githubusercontent.com/luanpott10/Class/main/data_crops.csv

Loading the data

data <- read.csv("https://raw.githubusercontent.com/luanpott10/Class/main/data_crops.csv")

A rapid view of the data

Functions that can be used

`glimpse()`

`head()`

`tail()`

`summary()`

data |> glimpse()
## Rows: 100
## Columns: 9
## $ ï..n <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1~
## $ latitude <dbl> -28.59329, -30.87459, -28.84481, -30.65571, -28.85978, -28.4~
## $ longitude <dbl> -52.64978, -51.72551, -53.46126, -55.11848, -53.61491, -55.0~
## $ class <chr> "soybean", "soybean", "soybean", "soybean", "soybean", "soyb~
## $ b0_GCVI <dbl> -1.48824394, -1.07909429, 0.16900872, -0.92421901, -0.671991~
## $ b1_GCVI <dbl> -3.8113625, -1.8287302, -2.0509982, -2.5976772, -2.8033531, ~
## $ b2_GCVI <dbl> -6.3380713, -4.8378959, -2.7416265, -4.2464180, -4.3012776, ~
## $ b3_GCVI <dbl> 1.2918025, 2.0158091, -1.1819744, 0.8886908, -0.2094595, 2.6~
## $ b4_GCVI <dbl> -4.631691, -2.661626, -2.707749, -3.352147, -3.434427, -7.33~
data |> head(5)
## ï..n latitude longitude class b0_GCVI b1_GCVI b2_GCVI b3_GCVI
## 1 1 -28.59329 -52.64978 soybean -1.4882439 -3.811363 -6.338071 1.2918025
## 2 2 -30.87459 -51.72551 soybean -1.0790943 -1.828730 -4.837896 2.0158091
## 3 3 -28.84481 -53.46126 soybean 0.1690087 -2.050998 -2.741627 -1.1819744
## 4 4 -30.65571 -55.11848 soybean -0.9242190 -2.597677 -4.246418 0.8886908
## 5 5 -28.85978 -53.61491 soybean -0.6719911 -2.803353 -4.301278 -0.2094595
## b4_GCVI
## 1 -4.631691
## 2 -2.661626
## 3 -2.707749
## 4 -3.352147
## 5 -3.434427
data |> tail(5)
## ï..n latitude longitude class b0_GCVI b1_GCVI b2_GCVI b3_GCVI
## 96 96 -28.53374 -52.84692 corn 5.628807 1.4232943 5.189684 -3.280593
## 97 97 -27.67757 -54.75618 corn 7.803896 5.7830925 7.469813 -1.889733
## 98 98 -28.32241 -51.33973 corn 3.854038 -0.4112832 2.829001 -2.718110
## 99 99 -28.13302 -51.36585 corn 2.423372 -1.6455027 1.057501 -2.884991
## 100 100 -28.05637 -54.54361 corn 7.691086 4.8666124 7.726856 -2.407301
## b4_GCVI
## 96 2.666456
## 97 5.124163
## 98 0.711688
## 99 -1.349620
## 100 4.228607
data |> summary()
## ï..n latitude longitude class
## Min. : 1.00 Min. :-33.67 Min. :-55.86 Length:100
## 1st Qu.: 25.75 1st Qu.:-28.90 1st Qu.:-53.62 Class :character
## Median : 50.50 Median :-28.47 Median :-52.83 Mode :character
## Mean : 50.50 Mean :-28.83 Mean :-53.03
## 3rd Qu.: 75.25 3rd Qu.:-27.96 3rd Qu.:-52.18
## Max. :100.00 Max. :-27.22 Max. :-50.48
## b0_GCVI b1_GCVI b2_GCVI b3_GCVI
## Min. :-10.400 Min. :-7.3446 Min. :-19.354 Min. :-6.16632
## 1st Qu.: -1.373 1st Qu.:-3.0439 1st Qu.: -5.601 1st Qu.:-1.93480
## Median : 0.650 Median :-1.7779 Median : -2.768 Median :-0.07191
## Mean : 1.181 Mean :-0.9485 Mean : -1.731 Mean :-0.04668
## 3rd Qu.: 4.529 3rd Qu.: 1.1793 3rd Qu.: 3.622 3rd Qu.: 1.53019
## Max. : 13.737 Max. : 6.5656 Max. : 17.014 Max. : 6.54275
## b4_GCVI
## Min. :-7.332
## 1st Qu.:-4.117
## Median :-2.728
## Mean :-1.386
## 3rd Qu.: 2.069
## Max. : 5.124

Select

`select` function is used for selecting columns in a data frame.

Here I brought you some examples of the `select` function for selecting different columns in different ways.

data |> select(1) |> head(6)
## ï..n
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
data |> select(last_col()) |> head(6)
## b4_GCVI
## 1 -4.631691
## 2 -2.661626
## 3 -2.707749
## 4 -3.352147
## 5 -3.434427
## 6 -7.331731
data |> select(c(class,b0_GCVI,b1_GCVI,b2_GCVI,b3_GCVI,b4_GCVI)) |> head(6)
## class b0_GCVI b1_GCVI b2_GCVI b3_GCVI b4_GCVI
## 1 soybean -1.4882439 -3.811363 -6.338071 1.2918025 -4.631691
## 2 soybean -1.0790943 -1.828730 -4.837896 2.0158091 -2.661626
## 3 soybean 0.1690087 -2.050998 -2.741627 -1.1819744 -2.707749
## 4 soybean -0.9242190 -2.597677 -4.246418 0.8886908 -3.352147
## 5 soybean -0.6719911 -2.803353 -4.301278 -0.2094595 -3.434427
## 6 soybean -5.1542411 -7.344623 -10.879841 2.6003060 -7.331731
data |> select(-c(ï..n,latitude,longitude)) |> head(6)
## class b0_GCVI b1_GCVI b2_GCVI b3_GCVI b4_GCVI
## 1 soybean -1.4882439 -3.811363 -6.338071 1.2918025 -4.631691
## 2 soybean -1.0790943 -1.828730 -4.837896 2.0158091 -2.661626
## 3 soybean 0.1690087 -2.050998 -2.741627 -1.1819744 -2.707749
## 4 soybean -0.9242190 -2.597677 -4.246418 0.8886908 -3.352147
## 5 soybean -0.6719911 -2.803353 -4.301278 -0.2094595 -3.434427
## 6 soybean -5.1542411 -7.344623 -10.879841 2.6003060 -7.331731
data |> select(starts_with("b")) |> head(6)
## b0_GCVI b1_GCVI b2_GCVI b3_GCVI b4_GCVI
## 1 -1.4882439 -3.811363 -6.338071 1.2918025 -4.631691
## 2 -1.0790943 -1.828730 -4.837896 2.0158091 -2.661626
## 3 0.1690087 -2.050998 -2.741627 -1.1819744 -2.707749
## 4 -0.9242190 -2.597677 -4.246418 0.8886908 -3.352147
## 5 -0.6719911 -2.803353 -4.301278 -0.2094595 -3.434427
## 6 -5.1542411 -7.344623 -10.879841 2.6003060 -7.331731
data |> select(ends_with("GCVI")) |> head(6)
## b0_GCVI b1_GCVI b2_GCVI b3_GCVI b4_GCVI
## 1 -1.4882439 -3.811363 -6.338071 1.2918025 -4.631691
## 2 -1.0790943 -1.828730 -4.837896 2.0158091 -2.661626
## 3 0.1690087 -2.050998 -2.741627 -1.1819744 -2.707749
## 4 -0.9242190 -2.597677 -4.246418 0.8886908 -3.352147
## 5 -0.6719911 -2.803353 -4.301278 -0.2094595 -3.434427
## 6 -5.1542411 -7.344623 -10.879841 2.6003060 -7.331731
data |> select(where(is.numeric)) |> head(6)
## ï..n latitude longitude b0_GCVI b1_GCVI b2_GCVI b3_GCVI b4_GCVI
## 1 1 -28.59329 -52.64978 -1.4882439 -3.811363 -6.338071 1.2918025 -4.631691
## 2 2 -30.87459 -51.72551 -1.0790943 -1.828730 -4.837896 2.0158091 -2.661626
## 3 3 -28.84481 -53.46126 0.1690087 -2.050998 -2.741627 -1.1819744 -2.707749
## 4 4 -30.65571 -55.11848 -0.9242190 -2.597677 -4.246418 0.8886908 -3.352147
## 5 5 -28.85978 -53.61491 -0.6719911 -2.803353 -4.301278 -0.2094595 -3.434427
## 6 6 -28.41732 -55.03171 -5.1542411 -7.344623 -10.879841 2.6003060 -7.331731

Arrange

`arrange` function is used for orders the rows of a data frame by the values of selected columns.

In some cases we would like to see the ordered rows according a determined column, for that we use `arrange` function.

data |> arrange(b4_GCVI) |> head(6)
## ï..n latitude longitude class b0_GCVI b1_GCVI b2_GCVI b3_GCVI
## 1 6 -28.41732 -55.03171 soybean -5.154241 -7.344623 -10.879841 2.600306
## 2 44 -27.94593 -52.35423 soybean -6.849492 -6.901615 -13.810637 4.647782
## 3 11 -29.02970 -54.92061 soybean -10.400090 -6.194132 -19.353834 6.542745
## 4 31 -28.20905 -51.63171 soybean -3.497307 -5.756842 -9.003927 1.384130
## 5 18 -27.71189 -52.56109 soybean -4.217597 -5.241977 -10.647408 3.294080
## 6 14 -32.11317 -53.16475 soybean -4.723501 -4.550052 -11.359418 2.981974
## b4_GCVI
## 1 -7.331731
## 2 -7.022087
## 3 -6.669860
## 4 -6.664257
## 5 -5.579582
## 6 -5.544666

Relocate

`relocate` function change column positions.

Generally in a machine learning model we have the last column as the label, for do that we can use the `relocate` function.

data |> relocate(-class) |> head(6)
## ï..n latitude longitude b0_GCVI b1_GCVI b2_GCVI b3_GCVI b4_GCVI
## 1 1 -28.59329 -52.64978 -1.4882439 -3.811363 -6.338071 1.2918025 -4.631691
## 2 2 -30.87459 -51.72551 -1.0790943 -1.828730 -4.837896 2.0158091 -2.661626
## 3 3 -28.84481 -53.46126 0.1690087 -2.050998 -2.741627 -1.1819744 -2.707749
## 4 4 -30.65571 -55.11848 -0.9242190 -2.597677 -4.246418 0.8886908 -3.352147
## 5 5 -28.85978 -53.61491 -0.6719911 -2.803353 -4.301278 -0.2094595 -3.434427
## 6 6 -28.41732 -55.03171 -5.1542411 -7.344623 -10.879841 2.6003060 -7.331731
## class
## 1 soybean
## 2 soybean
## 3 soybean
## 4 soybean
## 5 soybean
## 6 soybean

Filter

`filter` function is used to subset a data frame, retaining all rows that satisfy your conditions.

The most common operators that are useful to build the conditions are:

`==`, `>`, `>=`

`&`, `|`, `!`, `xor()`

`is.na()`

`between()`, `near()`

We also may put more than one filter in the same `filter` function.

data |> filter(class == "soybean", b4_GCVI > -4) |> head(6)
## ï..n latitude longitude class b0_GCVI b1_GCVI b2_GCVI b3_GCVI
## 1 2 -30.87459 -51.72551 soybean -1.0790943 -1.828730 -4.837896 2.0158091
## 2 3 -28.84481 -53.46126 soybean 0.1690087 -2.050998 -2.741627 -1.1819744
## 3 4 -30.65571 -55.11848 soybean -0.9242190 -2.597677 -4.246418 0.8886908
## 4 5 -28.85978 -53.61491 soybean -0.6719911 -2.803353 -4.301278 -0.2094595
## 5 9 -30.80493 -55.27339 soybean -0.9067734 -3.203305 -4.744385 0.5329000
## 6 13 -31.31976 -53.99590 soybean -2.4681277 -2.352393 -7.690125 3.9750807
## b4_GCVI
## 1 -2.661626
## 2 -2.707749
## 3 -3.352147
## 4 -3.434427
## 5 -3.771788
## 6 -1.823997

Rename

`rename` changes the names of individual variables.

Column names in datasets should be short, inuitive and complete. For that many times we need to rename columns for the dataset.

data |> rename(n = ï..n) |> head(6)
## n latitude longitude class b0_GCVI b1_GCVI b2_GCVI b3_GCVI
## 1 1 -28.59329 -52.64978 soybean -1.4882439 -3.811363 -6.338071 1.2918025
## 2 2 -30.87459 -51.72551 soybean -1.0790943 -1.828730 -4.837896 2.0158091
## 3 3 -28.84481 -53.46126 soybean 0.1690087 -2.050998 -2.741627 -1.1819744
## 4 4 -30.65571 -55.11848 soybean -0.9242190 -2.597677 -4.246418 0.8886908
## 5 5 -28.85978 -53.61491 soybean -0.6719911 -2.803353 -4.301278 -0.2094595
## 6 6 -28.41732 -55.03171 soybean -5.1542411 -7.344623 -10.879841 2.6003060
## b4_GCVI
## 1 -4.631691
## 2 -2.661626
## 3 -2.707749
## 4 -3.352147
## 5 -3.434427
## 6 -7.331731

Mutate

`mutate` is utilized to create or transform variables.

`across` makes it easy to apply the same transformation to multiple columns.

`across` generally it is used into `summarise()` and `mutate()` functions.

In this example we are transforming - rounding to 2 decimals the double variables except latitude and longitude.

data |>
mutate(across(where(is.double) & !c(latitude, longitude), ~ round(.x,2))) |> head(6)
## ï..n latitude longitude class b0_GCVI b1_GCVI b2_GCVI b3_GCVI b4_GCVI
## 1 1 -28.59329 -52.64978 soybean -1.49 -3.81 -6.34 1.29 -4.63
## 2 2 -30.87459 -51.72551 soybean -1.08 -1.83 -4.84 2.02 -2.66
## 3 3 -28.84481 -53.46126 soybean 0.17 -2.05 -2.74 -1.18 -2.71
## 4 4 -30.65571 -55.11848 soybean -0.92 -2.60 -4.25 0.89 -3.35
## 5 5 -28.85978 -53.61491 soybean -0.67 -2.80 -4.30 -0.21 -3.43
## 6 6 -28.41732 -55.03171 soybean -5.15 -7.34 -10.88 2.60 -7.33

Recode

`recode` is utilized to recodes a numeric vector, character vector, or factor according to simple recode specifications.

For more complicated criteria, use `case_when()`.

data |> mutate(specie=recode(class,
soybean="Glycine max",
corn="Zea mays")) |> head(6)
## ï..n latitude longitude class b0_GCVI b1_GCVI b2_GCVI b3_GCVI
## 1 1 -28.59329 -52.64978 soybean -1.4882439 -3.811363 -6.338071 1.2918025
## 2 2 -30.87459 -51.72551 soybean -1.0790943 -1.828730 -4.837896 2.0158091
## 3 3 -28.84481 -53.46126 soybean 0.1690087 -2.050998 -2.741627 -1.1819744
## 4 4 -30.65571 -55.11848 soybean -0.9242190 -2.597677 -4.246418 0.8886908
## 5 5 -28.85978 -53.61491 soybean -0.6719911 -2.803353 -4.301278 -0.2094595
## 6 6 -28.41732 -55.03171 soybean -5.1542411 -7.344623 -10.879841 2.6003060
## b4_GCVI specie
## 1 -4.631691 Glycine max
## 2 -2.661626 Glycine max
## 3 -2.707749 Glycine max
## 4 -3.352147 Glycine max
## 5 -3.434427 Glycine max
## 6 -7.331731 Glycine max

Summarise

`summarise` creates a new data frame.

The objective of the function is summarise a data frame for an aspect.

The most common functions used into `summarise` function are:

`mean()`, `median()`

`sd()`, `IQR()`, `mad()`

`min()`, `max()`, `quantile()`

`first()`, `last()`, `nth()`

`n()`, `n_distinct()`

`any()`, `all()`

We also may put more than one function in `summarise` function.

data |> summarise(n = n(), min = min(b4_GCVI), max = max(b4_GCVI), mean = mean(b4_GCVI))
## n min max mean
## 1 100 -7.331731 5.124163 -1.386388

Group by

`group_by` takes an existing data frame and converts it into a grouped data frame where operations are performed by group. ungroup() removes grouping.

Generally the `group_by` function is used before the `summarise` function to generate summaries by group.

data |> group_by(class) |> summarise(n = n(), min = min(b4_GCVI), max = max(b4_GCVI), mean = mean(b4_GCVI))
## # A tibble: 2 x 5
## class n min max mean
## <chr> <int> <dbl> <dbl> <dbl>
## 1 corn 50 -3.33 5.12 1.34
## 2 soybean 50 -7.33 -1.29 -4.11

Pull

`pull` selects a column in a data frame and transforms it into a vector.

When we are leading with data wrangling we can use `pull` function to extract columns as a vector.

data |> pull(b4_GCVI) |> head(6)
## [1] -4.631691 -2.661626 -2.707749 -3.352147 -3.434427 -7.331731

Join

`join` function joins two data frames together.

Joining tables, data frames with foreign key, the `by` in the `join` function is the most important for relational databases.

The types of join are:

`inner_join` : only rows with matching keys in both x and y;

`left_join` : all rows in x, adding matching columns from y;

`right_join` : all rows in y, adding matching columns from x;

`full_join` : all rows in x with matching columns in y, then the rows of y that don’t match x.

You can se the differences with the below example.

data |> select(ï..n,latitude,longitude,class,b0_GCVI) -> data_1
data |> select(ï..n,latitude,longitude,class,b1_GCVI) -> data_2
data_1 <- data_1[1:60,]
data_2 <- data_2[51:100,]
data_x <- inner_join(data_1,data_2, by = c("ï..n","latitude","longitude","class"))
data_x |> glimpse()
## Rows: 10
## Columns: 6
## $ ï..n <int> 51, 52, 53, 54, 55, 56, 57, 58, 59, 60
## $ latitude <dbl> -27.93454, -28.12877, -27.89983, -28.43043, -28.58020, -27.7~
## $ longitude <dbl> -52.08487, -51.32797, -54.50413, -51.63524, -51.75654, -51.7~
## $ class <chr> "corn", "corn", "corn", "corn", "corn", "corn", "corn", "cor~
## $ b0_GCVI <dbl> 4.5186558, 1.2942828, 7.8575535, -2.8303745, -0.2717497, 5.2~
## $ b1_GCVI <dbl> 0.8883052, -2.2323170, 5.9253740, -5.7838035, -3.2403061, 1.~
data_x <- left_join(data_1,data_2, by = c("ï..n","latitude","longitude","class"))
data_x |> glimpse()
## Rows: 60
## Columns: 6
## $ ï..n <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1~
## $ latitude <dbl> -28.59329, -30.87459, -28.84481, -30.65571, -28.85978, -28.4~
## $ longitude <dbl> -52.64978, -51.72551, -53.46126, -55.11848, -53.61491, -55.0~
## $ class <chr> "soybean", "soybean", "soybean", "soybean", "soybean", "soyb~
## $ b0_GCVI <dbl> -1.48824394, -1.07909429, 0.16900872, -0.92421901, -0.671991~
## $ b1_GCVI <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
data_x <- right_join(data_1,data_2, by = c("ï..n","latitude","longitude","class"))
data_x |> glimpse()
## Rows: 50
## Columns: 6
## $ ï..n <int> 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, ~
## $ latitude <dbl> -27.93454, -28.12877, -27.89983, -28.43043, -28.58020, -27.7~
## $ longitude <dbl> -52.08487, -51.32797, -54.50413, -51.63524, -51.75654, -51.7~
## $ class <chr> "corn", "corn", "corn", "corn", "corn", "corn", "corn", "cor~
## $ b0_GCVI <dbl> 4.5186558, 1.2942828, 7.8575535, -2.8303745, -0.2717497, 5.2~
## $ b1_GCVI <dbl> 0.8883052, -2.2323170, 5.9253740, -5.7838035, -3.2403061, 1.~
data_x <- full_join(data_1,data_2, by = c("ï..n","latitude","longitude","class"))
data_x |> glimpse()
## Rows: 100
## Columns: 6
## $ ï..n <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1~
## $ latitude <dbl> -28.59329, -30.87459, -28.84481, -30.65571, -28.85978, -28.4~
## $ longitude <dbl> -52.64978, -51.72551, -53.46126, -55.11848, -53.61491, -55.0~
## $ class <chr> "soybean", "soybean", "soybean", "soybean", "soybean", "soyb~
## $ b0_GCVI <dbl> -1.48824394, -1.07909429, 0.16900872, -0.92421901, -0.671991~
## $ b1_GCVI <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~

Unite

`unite` function unites the values of two columns into one.

When we would like to unite two columns to use as a merged column we can use `unite` function.

data |> unite ("n_class",ï..n,class,sep="_") -> data_united
data_united |> head(6)
## n_class latitude longitude b0_GCVI b1_GCVI b2_GCVI b3_GCVI
## 1 1_soybean -28.59329 -52.64978 -1.4882439 -3.811363 -6.338071 1.2918025
## 2 2_soybean -30.87459 -51.72551 -1.0790943 -1.828730 -4.837896 2.0158091
## 3 3_soybean -28.84481 -53.46126 0.1690087 -2.050998 -2.741627 -1.1819744
## 4 4_soybean -30.65571 -55.11848 -0.9242190 -2.597677 -4.246418 0.8886908
## 5 5_soybean -28.85978 -53.61491 -0.6719911 -2.803353 -4.301278 -0.2094595
## 6 6_soybean -28.41732 -55.03171 -5.1542411 -7.344623 -10.879841 2.6003060
## b4_GCVI
## 1 -4.631691
## 2 -2.661626
## 3 -2.707749
## 4 -3.352147
## 5 -3.434427
## 6 -7.331731

Separate

`separate` function separates a character column into multiple columns with a regular expression or numeric locations.

When we would like to separate a column to have two or more columns, we can use the `separate` function

data_united |> separate(n_class,c("ï..n","class"),sep="_") |> head(6)
## ï..n class latitude longitude b0_GCVI b1_GCVI b2_GCVI b3_GCVI
## 1 1 soybean -28.59329 -52.64978 -1.4882439 -3.811363 -6.338071 1.2918025
## 2 2 soybean -30.87459 -51.72551 -1.0790943 -1.828730 -4.837896 2.0158091
## 3 3 soybean -28.84481 -53.46126 0.1690087 -2.050998 -2.741627 -1.1819744
## 4 4 soybean -30.65571 -55.11848 -0.9242190 -2.597677 -4.246418 0.8886908
## 5 5 soybean -28.85978 -53.61491 -0.6719911 -2.803353 -4.301278 -0.2094595
## 6 6 soybean -28.41732 -55.03171 -5.1542411 -7.344623 -10.879841 2.6003060
## b4_GCVI
## 1 -4.631691
## 2 -2.661626
## 3 -2.707749
## 4 -3.352147
## 5 -3.434427
## 6 -7.331731

Gather

`gather` function gathers columns into key-value pairs.

When we would like to gather columns into two new columns containing a specific column name and the respectively column values we can use the `gather` function.

data |> gather(key="Feature",value="Value",b0_GCVI:b4_GCVI) -> data_gathered
data_gathered |> head(6)
## ï..n latitude longitude class Feature Value
## 1 1 -28.59329 -52.64978 soybean b0_GCVI -1.4882439
## 2 2 -30.87459 -51.72551 soybean b0_GCVI -1.0790943
## 3 3 -28.84481 -53.46126 soybean b0_GCVI 0.1690087
## 4 4 -30.65571 -55.11848 soybean b0_GCVI -0.9242190
## 5 5 -28.85978 -53.61491 soybean b0_GCVI -0.6719911
## 6 6 -28.41732 -55.03171 soybean b0_GCVI -5.1542411

Spread

`spread` function spreads a key-value pair across multiple columns.

When we need to distribute the pair of key-value columns into a field of cells we can use the `spread` function.

data_gathered %>% spread(key = Feature,value = Value) |> head(6)
## ï..n latitude longitude class b0_GCVI b1_GCVI b2_GCVI b3_GCVI
## 1 1 -28.59329 -52.64978 soybean -1.4882439 -3.811363 -6.338071 1.2918025
## 2 2 -30.87459 -51.72551 soybean -1.0790943 -1.828730 -4.837896 2.0158091
## 3 3 -28.84481 -53.46126 soybean 0.1690087 -2.050998 -2.741627 -1.1819744
## 4 4 -30.65571 -55.11848 soybean -0.9242190 -2.597677 -4.246418 0.8886908
## 5 5 -28.85978 -53.61491 soybean -0.6719911 -2.803353 -4.301278 -0.2094595
## 6 6 -28.41732 -55.03171 soybean -5.1542411 -7.344623 -10.879841 2.6003060
## b4_GCVI
## 1 -4.631691
## 2 -2.661626
## 3 -2.707749
## 4 -3.352147
## 5 -3.434427
## 6 -7.331731

Final

Here are the most common functions for data wrangling utilizing tidyverse package.

R - Data visualization, linear regression and logistic regression

Thu, 23 Sep 2021 00:00:00 +0000

Data visualization, linear regression and logistic regression

Crop data collection - ground truth and remote sensing

Loading the packages

library(tidyverse)
library(tidymodels)
library(sf)
library(geobr)

URL link of data from data collection and remote sensing on GitHub https://github.com/luanpott10/Class

You will need the raw file https://raw.githubusercontent.com/luanpott10/Class/main/data_crops.csv

Loading the data

data <- read.csv("https://raw.githubusercontent.com/luanpott10/Class/main/data_crops.csv")

Plot the data in the map

cities_RS <- read_municipality(code_muni = "RS", year= 2020)
ggplot()+
geom_sf(data=cities_RS)+
geom_point(data=data,aes(x=longitude,y=latitude,fill=class),shape=22,size=2)+
labs(x= "Longitude", y = "Latitude")+
scale_fill_manual(values = c("#eded0c", "#49a345"))+
theme(legend.position = c(0.17, 0.2),
panel.border = element_rect(color="Black", fill = NA),
panel.background = element_rect(fill = "#f2f2f2"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.title.x = element_text(family = "serif",
colour = "#000000",size = 22.0),
axis.text.x = element_text(family = "serif",
colour = "#000000",size = 18.0),
axis.title.y = element_text(family = "serif",
colour = "#000000",size = 22.0),
axis.text.y = element_text(family = "serif",
colour = "#000000",size = 18.0),
legend.title = element_text(family = "serif",
colour = '#000000',size = 12.0),
legend.text = element_text(family = "serif",
colour = '#000000',size = 12.0),
legend.background = element_rect(fill="#f2f2f2",
linetype="dashed",
colour ="#f2f2f2"))

Plot the data variables - continuous x continuous

ggplot(data=data,aes(x=b3_GCVI,y=b4_GCVI))+
geom_point(aes(fill=class),shape=22,size=2)+
scale_fill_manual(values = c("#eded0c", "#49a345"))+
stat_smooth(formula = y~x, method="lm", se=FALSE,color="black", linetype='dashed')+
theme(panel.border = element_rect(color="Black", fill = NA),
panel.background = element_rect(fill = "#f2f2f2"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.title.x = element_text(family = "serif",
colour = "#000000",size = 22.0),
axis.text.x = element_text(family = "serif",
colour = "#000000",size = 18.0),
axis.title.y = element_text(family = "serif",
colour = "#000000",size = 22.0),
axis.text.y = element_text(family = "serif",
colour = "#000000",size = 18.0),
legend.title = element_text(family = "serif",
colour = '#000000',size = 12.0),
legend.text = element_text(family = "serif",
colour = '#000000',size = 12.0),
legend.background = element_rect(fill="#f2f2f2",
linetype="dashed",
colour ="#f2f2f2"))

Linear regression

lm_fit_x <- lm(b3_GCVI ~ b4_GCVI, data = data)
summary(lm_fit_x)
##
## Call:
## lm(formula = b3_GCVI ~ b4_GCVI, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7924 -0.9983 -0.0018 0.9525 3.9586
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.73701 0.15933 -4.626 1.14e-05 ***
## b4_GCVI -0.49793 0.04352 -11.440 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.475 on 98 degrees of freedom
## Multiple R-squared: 0.5718, Adjusted R-squared: 0.5675
## F-statistic: 130.9 on 1 and 98 DF, p-value: < 2.2e-16

Linear regression by tidymodels workflow

Creating a parsnip specification for a linear regression model

lm_model <- linear_reg() |>
set_engine('lm') |>
set_mode('regression')

Fitting the model supplying a formula expression and the data

lm_fit <- lm_model %>%
fit(b3_GCVI ~ b4_GCVI, data = data)

Summary of the model

lm_fit |>
pluck("fit") |>
summary()
##
## Call:
## stats::lm(formula = b3_GCVI ~ b4_GCVI, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7924 -0.9983 -0.0018 0.9525 3.9586
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.73701 0.15933 -4.626 1.14e-05 ***
## b4_GCVI -0.49793 0.04352 -11.440 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.475 on 98 degrees of freedom
## Multiple R-squared: 0.5718, Adjusted R-squared: 0.5675
## F-statistic: 130.9 on 1 and 98 DF, p-value: < 2.2e-16
# Also you can use
lm_fit$fit |> summary()
##
## Call:
## stats::lm(formula = b3_GCVI ~ b4_GCVI, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7924 -0.9983 -0.0018 0.9525 3.9586
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.73701 0.15933 -4.626 1.14e-05 ***
## b4_GCVI -0.49793 0.04352 -11.440 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.475 on 98 degrees of freedom
## Multiple R-squared: 0.5718, Adjusted R-squared: 0.5675
## F-statistic: 130.9 on 1 and 98 DF, p-value: < 2.2e-16

Parameter estimates of a the lm object

tidy(lm_fit)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.737 0.159 -4.63 1.14e- 5
## 2 b4_GCVI -0.498 0.0435 -11.4 9.39e-20

Extract the model statistics

glance(lm_fit)
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.572 0.567 1.47 131. 9.39e-20 1 -180. 365. 373.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Plot the data variables - categorical x continuous

data <- data |> mutate(target = case_when(class == "corn" ~ 0,
class == "soybean" ~ 1))
ggplot(data=data, aes(y=target,x=b4_GCVI))+
geom_point(aes(fill=as.factor(target)),shape=22,size=2) +
scale_fill_manual(values = c("#eded0c", "#49a345"))+
scale_y_continuous(breaks=c(0,1),
labels=c("0","1"),
limits=c(0,1))+
stat_smooth(formula = y~x, method="glm", se=FALSE, method.args = list(family=binomial),
color="black", linetype='dashed')+
labs(x= "b4_GCVI", y = "Class",fill="Class")+
theme(legend.position = c(0.17, 0.2),
panel.border = element_rect(color="Black", fill = NA),
panel.background = element_rect(fill = "#f2f2f2"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.title.x = element_text(family = "serif",
colour = "#000000",size = 22.0),
axis.text.x = element_text(family = "serif",
colour = "#000000",size = 18.0),
axis.title.y = element_text(family = "serif",
colour = "#000000",size = 22.0),
axis.text.y = element_text(family = "serif",
colour = "#000000",size = 18.0),
legend.title = element_text(family = "serif",
colour = '#000000',size = 12.0),
legend.text = element_text(family = "serif",
colour = '#000000',size = 12.0),
legend.background = element_rect(fill="#f2f2f2",
linetype="dashed",
colour ="#f2f2f2"))

Logistic regression

lg_fit_x <- glm(as.factor(class) ~ b4_GCVI, family="binomial", data=data)
summary(lg_fit_x)
##
## Call:
## glm(formula = as.factor(class) ~ b4_GCVI, family = "binomial",
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.91207 -0.04298 0.01134 0.32227 1.86651
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.5734 1.1155 -3.203 0.00136 **
## b4_GCVI -1.5677 0.3778 -4.150 3.33e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 138.629 on 99 degrees of freedom
## Residual deviance: 42.379 on 98 degrees of freedom
## AIC: 46.379
##
## Number of Fisher Scoring iterations: 7

Logistic regression by tidymodels workflow

Creating a parsnip specification for a logistic regression model

lg_model <- logistic_reg() |>
set_engine("glm") |>
set_mode("classification")

Fitting the model supplying a formula expression and the data

lg_fit <- lg_model |>
fit(as.factor(class) ~ b4_GCVI, data = data)

Summary of the model

lg_fit |>
pluck("fit") |>
summary()
##
## Call:
## stats::glm(formula = as.factor(class) ~ b4_GCVI, family = stats::binomial,
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.91207 -0.04298 0.01134 0.32227 1.86651
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.5734 1.1155 -3.203 0.00136 **
## b4_GCVI -1.5677 0.3778 -4.150 3.33e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 138.629 on 99 degrees of freedom
## Residual deviance: 42.379 on 98 degrees of freedom
## AIC: 46.379
##
## Number of Fisher Scoring iterations: 7
# Also you can use
lg_fit$fit |> summary()
##
## Call:
## stats::glm(formula = as.factor(class) ~ b4_GCVI, family = stats::binomial,
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.91207 -0.04298 0.01134 0.32227 1.86651
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.5734 1.1155 -3.203 0.00136 **
## b4_GCVI -1.5677 0.3778 -4.150 3.33e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 138.629 on 99 degrees of freedom
## Residual deviance: 42.379 on 98 degrees of freedom
## AIC: 46.379
##
## Number of Fisher Scoring iterations: 7

Parameter estimates of a the lm object

tidy(lg_fit)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -3.57 1.12 -3.20 0.00136
## 2 b4_GCVI -1.57 0.378 -4.15 0.0000333

Extract the model statistics

glance(lg_fit)
## # A tibble: 1 x 8
## null.deviance df.null logLik AIC BIC deviance df.residual nobs
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 139. 99 -21.2 46.4 51.6 42.4 98 100

R - Cultivated crop maps in Rio Grande do Sul, Brazil

Wed, 22 Sep 2021 00:00:00 +0000

Generate crop cultivated maps

Crop plantation area from IBGE

Loading the packages

library(readxl)
library(tidyverse)
library(geobr)
library(patchwork)

URL link of data from IBGE on GitHub https://github.com/luanpott10/Class

You will nedd the raw file https://raw.githubusercontent.com/luanpott10/Class/main/tabela1612_1.csv

Load the data and data wrangling

The data downloaded from IBGE there are metadata in the 4 first lines that are not intrest for us `skip = 4`.

Also, there are info in the last rows, for that we select the 497 municipalities `data[1:497,]`.

Furthermore, the rows (cities) without crop planted, IBGE used “-” or “…” instead input 0, then we have used `case_when` function.

data <- read_csv('https://raw.githubusercontent.com/luanpott10/Class/main/tabela1612_1.csv', skip = 4)
data <- data[1:497,]
colnames(data) <- c("code","city","rice","corn","soybean")
data <- data |>
mutate(rice_crop = case_when(rice == "..." | rice == "-" ~ 0,
TRUE ~ as.double(rice))) |>
mutate(corn_crop = case_when(corn == "..." | corn == "-" ~ 0,
TRUE ~ as.double(corn))) |>
mutate(soybean_crop = case_when(soybean == "..." | soybean == "-" ~ 0,
TRUE ~ as.double(soybean)))
data <- data |> select(code,city,rice_crop,corn_crop,soybean_crop)

Dataset of geobr package from municipalities of Rio Grande do Sul state

cities_RS <- read_municipality(code_muni = "RS", year= 2020)

Ggplot of the cities

ggplot()+
geom_sf(data=cities_RS)

Joining the IBGE data and the sf object

cities_RS$code_muni <- as.character(cities_RS$code_muni)
data_x <- left_join(cities_RS,data,by= c("code_muni"="code"))

Palettes

pal_soybean <- c('#252525','#ccece6','#99d8c9','#66c2a4','#41ae76','#238b45','#006d2c','#00441b')
pal_corn <- c('#252525','#f7ffa8','#EFFD5F','#FCE205','#FCD12A','#FFC30B','#F9A602','#c48302')
pal_rice <- c('#252525','#caebfc','#9ecae1','#6baed6','#4292c6','#2171b5','#084594','#082954')

Soybean map

(soybean_map <-
ggplot()+
geom_sf(data=data_x,aes(fill=soybean_crop))+
theme_minimal()+
scale_fill_gradientn(colours=pal_soybean,
limits = c(0,150000),
na.value='#252525')+
labs(x= "Longitude", y = "Latitude", title = "Soybean")+
guides(fill=guide_colorbar(title="Crop area (ha)",barwidth = 1,barheight = 4,
frame.colour = "black"))+
theme(legend.position = c(0.17, 0.2),
panel.border = element_rect(color="Black", fill = NA),
panel.background = element_rect(fill = "#f2f2f2"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.title.x = element_text(family = "serif",
colour = "#000000",size = 22.0),
axis.text.x = element_text(family = "serif",
colour = "#000000",size = 18.0),
axis.title.y = element_text(family = "serif",
colour = "#000000",size = 22.0),
axis.text.y = element_text(family = "serif",
colour = "#000000",size = 18.0),
legend.title = element_text(family = "serif",
colour = '#000000',size = 12.0),
legend.text = element_text(family = "serif",
colour = '#000000',size = 12.0),
legend.background = element_rect(fill="#f2f2f2",
linetype="dashed",
colour ="#f2f2f2")))

Corn map

(corn_map <-
ggplot()+
geom_sf(data=data_x,aes(fill=corn_crop))+
theme_minimal()+
scale_fill_gradientn(colours=pal_corn,
limits = c(0,15000),
na.value='#252525')+
labs(x= "Longitude", y = "Latitude", title = "Corn")+
guides(fill=guide_colorbar(title="Crop area (ha)",barwidth = 1,barheight = 4,
frame.colour = "black"))+
theme(legend.position = c(0.17, 0.2),
panel.border = element_rect(color="Black", fill = NA),
panel.background = element_rect(fill = "#f2f2f2"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.title.x = element_text(family = "serif",
colour = "#000000",size = 22.0),
axis.text.x = element_text(family = "serif",
colour = "#000000",size = 18.0),
axis.title.y = element_text(family = "serif",
colour = "#000000",size = 22.0),
axis.text.y = element_text(family = "serif",
colour = "#000000",size = 18.0),
legend.title = element_text(family = "serif",
colour = '#000000',size = 12.0),
legend.text = element_text(family = "serif",
colour = '#000000',size = 12.0),
legend.background = element_rect(fill="#f2f2f2",
linetype="dashed",
colour ="#f2f2f2")))

Rice map

(rice_map <-
ggplot()+
geom_sf(data=data_x,aes(fill=rice_crop))+
theme_minimal()+
scale_fill_gradientn(colours=pal_rice,
limits = c(0,75000),
na.value='#252525')+
labs(x= "Longitude", y = "Latitude", title = "Rice")+
guides(fill=guide_colorbar(title="Crop area (ha)",barwidth = 1,barheight = 4,
frame.colour = "black"))+
theme(legend.position = c(0.17, 0.2),
panel.border = element_rect(color="Black", fill = NA),
panel.background = element_rect(fill = "#f2f2f2"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.title.x = element_text(family = "serif",
colour = "#000000",size = 22.0),
axis.text.x = element_text(family = "serif",
colour = "#000000",size = 18.0),
axis.title.y = element_text(family = "serif",
colour = "#000000",size = 22.0),
axis.text.y = element_text(family = "serif",
colour = "#000000",size = 18.0),
legend.title = element_text(family = "serif",
colour = '#000000',size = 12.0),
legend.text = element_text(family = "serif",
colour = '#000000',size = 12.0),
legend.background = element_rect(fill="#f2f2f2",
linetype="dashed",
colour ="#f2f2f2")))

Crop maps

soybean_map + corn_map + rice_map

Posts | Data Science in Agriculture

R - Data wrangling utilizing the most common functions from tidyverse package collection

Data wrangling utilizing the most common functions from tidyverse package collection.

Loading the package

Loading the data

A rapid view of the data

Functions that can be used

glimpse()

head()

tail()

summary()

Select

select function is used for selecting columns in a data frame.

Here I brought you some examples of the select function for selecting different columns in different ways.

Arrange

arrange function is used for orders the rows of a data frame by the values of selected columns.

In some cases we would like to see the ordered rows according a determined column, for that we use arrange function.

Relocate

relocate function change column positions.

Generally in a machine learning model we have the last column as the label, for do that we can use the relocate function.

Filter

filter function is used to subset a data frame, retaining all rows that satisfy your conditions.

The most common operators that are useful to build the conditions are:

==, >, >=

&, |, !, xor()

is.na()

between(), near()

We also may put more than one filter in the same filter function.

Rename

rename changes the names of individual variables.

Column names in datasets should be short, inuitive and complete. For that many times we need to rename columns for the dataset.

Mutate

mutate is utilized to create or transform variables.

across makes it easy to apply the same transformation to multiple columns.

across generally it is used into summarise() and mutate() functions.

In this example we are transforming - rounding to 2 decimals the double variables except latitude and longitude.

Recode

recode is utilized to recodes a numeric vector, character vector, or factor according to simple recode specifications.

For more complicated criteria, use case_when().

Summarise

summarise creates a new data frame.

The objective of the function is summarise a data frame for an aspect.

The most common functions used into summarise function are:

mean(), median()

sd(), IQR(), mad()

min(), max(), quantile()

first(), last(), nth()

n(), n_distinct()

any(), all()

We also may put more than one function in summarise function.

Group by

group_by takes an existing data frame and converts it into a grouped data frame where operations are performed by group. ungroup() removes grouping.

Generally the group_by function is used before the summarise function to generate summaries by group.

Pull

pull selects a column in a data frame and transforms it into a vector.

When we are leading with data wrangling we can use pull function to extract columns as a vector.

Join

join function joins two data frames together.

Joining tables, data frames with foreign key, the by in the join function is the most important for relational databases.

The types of join are:

inner_join : only rows with matching keys in both x and y;

left_join : all rows in x, adding matching columns from y;

right_join : all rows in y, adding matching columns from x;

full_join : all rows in x with matching columns in y, then the rows of y that don’t match x.

You can se the differences with the below example.

Unite

unite function unites the values of two columns into one.

When we would like to unite two columns to use as a merged column we can use unite function.

Separate

separate function separates a character column into multiple columns with a regular expression or numeric locations.

When we would like to separate a column to have two or more columns, we can use the separate function

Gather

gather function gathers columns into key-value pairs.

When we would like to gather columns into two new columns containing a specific column name and the respectively column values we can use the gather function.

Spread

spread function spreads a key-value pair across multiple columns.

When we need to distribute the pair of key-value columns into a field of cells we can use the spread function.

Final

Here are the most common functions for data wrangling utilizing tidyverse package.

R - Data visualization, linear regression and logistic regression

`glimpse()`

`head()`

`tail()`

`summary()`

`select` function is used for selecting columns in a data frame.

Here I brought you some examples of the `select` function for selecting different columns in different ways.

`arrange` function is used for orders the rows of a data frame by the values of selected columns.

In some cases we would like to see the ordered rows according a determined column, for that we use `arrange` function.

`relocate` function change column positions.

Generally in a machine learning model we have the last column as the label, for do that we can use the `relocate` function.

`filter` function is used to subset a data frame, retaining all rows that satisfy your conditions.

`==`, `>`, `>=`

`&`, `|`, `!`, `xor()`

`is.na()`

`between()`, `near()`

We also may put more than one filter in the same `filter` function.

`rename` changes the names of individual variables.

`mutate` is utilized to create or transform variables.

`across` makes it easy to apply the same transformation to multiple columns.

`across` generally it is used into `summarise()` and `mutate()` functions.

`recode` is utilized to recodes a numeric vector, character vector, or factor according to simple recode specifications.

For more complicated criteria, use `case_when()`.

`summarise` creates a new data frame.

The most common functions used into `summarise` function are:

`mean()`, `median()`

`sd()`, `IQR()`, `mad()`

`min()`, `max()`, `quantile()`

`first()`, `last()`, `nth()`

`n()`, `n_distinct()`

`any()`, `all()`

We also may put more than one function in `summarise` function.

`group_by` takes an existing data frame and converts it into a grouped data frame where operations are performed by group. ungroup() removes grouping.

Generally the `group_by` function is used before the `summarise` function to generate summaries by group.

`pull` selects a column in a data frame and transforms it into a vector.

When we are leading with data wrangling we can use `pull` function to extract columns as a vector.

`join` function joins two data frames together.

Joining tables, data frames with foreign key, the `by` in the `join` function is the most important for relational databases.

`inner_join` : only rows with matching keys in both x and y;

`left_join` : all rows in x, adding matching columns from y;

`right_join` : all rows in y, adding matching columns from x;

`full_join` : all rows in x with matching columns in y, then the rows of y that don’t match x.

`unite` function unites the values of two columns into one.

When we would like to unite two columns to use as a merged column we can use `unite` function.

`separate` function separates a character column into multiple columns with a regular expression or numeric locations.

When we would like to separate a column to have two or more columns, we can use the `separate` function

`gather` function gathers columns into key-value pairs.

When we would like to gather columns into two new columns containing a specific column name and the respectively column values we can use the `gather` function.

`spread` function spreads a key-value pair across multiple columns.

When we need to distribute the pair of key-value columns into a field of cells we can use the `spread` function.

The data downloaded from IBGE there are metadata in the 4 first lines that are not intrest for us `skip = 4`.

Also, there are info in the last rows, for that we select the 497 municipalities `data[1:497,]`.

Furthermore, the rows (cities) without crop planted, IBGE used “-” or “…” instead input 0, then we have used `case_when` function.