GIS Machine Learning With R-An Overview.

Stephen Chege-Tierra Insights
Towards AI
Published in
6 min readMay 1, 2024

--

Created by the author with DALL E-3

R has become very ideal for GIS, especially for GIS machine learning as it has topnotch libraries that can perform geospatial computation. R has simplified the most complex task of geospatial machine learning.

In this piece, we shall look at tips and tricks on how to perform particular GIS machine learning algorithms regardless of your expertise in GIS, if you are a fresh beginner with no experience or a seasoned expert in geospatial machine learning.

We shall look at various types of machine learning algorithms such as decision trees, random forest, K nearest neighbor, and naïve Bayes and how you can call their libraries in R studios, including executing the code.

R Studios and GIS

In a previous article, I wrote about GIS and R., I give a brief background of R and its impact on GIs while touching on machine learning, you can read it here. In this article, I will briefly discuss R and GIS before we go deep into machine learning.

R is an open-source software best known for statistics and computation, while Python is more of a general-purpose programming language that you may use for plenty of tasks, thus geospatial professionals, statisticians and data analysts often prefer R for its robust features.

Ross Ihaka and Robert Gentleman set out to develop a platform that would be more flexible, expandable, and user-friendly. Their goal was to establish a programming language that has the efficacy and flexibility of modern programming languages along with the strength of conventional statistical software, hence the creation of R.

R and Machine Learning

The field of computer science known as “machine learning” focuses on creating algorithms with learning capabilities. Concept learning, function learning, sometimes known as “predictive modeling,” clustering, and the identification of predictive patterns are typical machine learning tasks. These tasks are acquired by data that was made available, such as observations made through instructions or experiences.

By incorporating experience into its duties, machine learning aims to enhance learning over time. The ultimate objective is to enhance learning to the point that it becomes automatic, eliminating the need for human intervention.

Advantages of Using R for Machine Learning

1. R offers clear and illustrative code- R is easier to work with than Python, for instance, if you are just starting on a machine learning project and need to describe the work you perform. This is because R offers the appropriate statistical way to work with data in fewer lines of code.

2. The R language is ideal for visualizing data-The finest prototype for working with machine learning models is provided by the R language, you can make beautiful and interactive machine learning-generated chats using R studios.

3. Robust Library tools-The greatest tools and library packages for working on machine learning projects are available in the R language. These packages allow developers to design the best possible pre-, model, and post-models for machine learning projects. R is the language of choice for machine learning applications because its packages are more sophisticated and comprehensive than those of Python.

4. Integration with GIS and Spatial Analysis- R offers smooth integration with spatial data using packages like SF, raster, and spatial for machine learning applications linked to GIS. R is a great option for geographic data science applications because of these packages, which let users process, analyze, and visualize spatial data in addition to performing machine learning tasks.

5. In-depth Documentation- R facilitates repeatability by analyzing data using a script-based methodology. Transparency and cooperation are promoted in machine learning projects by the ease with which users may share code, document their analyses, and repeat results when developing code in R scripts.

6. Community Support- R is known to have a dedicated following that provides support on Github, stackoverflow and other software collaboration platforms in case you need support or to cross-reference your work.

Types of machine learning with R.

Load machine learning libraries.


# Load required libraries
library(sf) # spatial data
library(raster) # for raster manipulation

1. Decision Tree and R.

A decision tree is a non-parametric supervised learning algorithm, which is used for both classification and regression tasks. It has a ranked, tree structure, which encompasses a root node, branches, internal nodes and leaf nodes.

Code snippet

# Load required libraries
library(sf)
library(raster)
library(rpart)

# Load spatial data (replace 'example.shp' with your file path)
spatial_data <- st_read("example.shp")

# Load raster data (replace 'example.tif' with your file path)
raster_data <- raster("example.tif")

# Extract raster values at spatial data points
spatial_data$raster_values <- extract(raster_data, spatial_data)

# Split data into training and testing sets
set.seed(123) # for reproducibility
train_indices <- sample(1:nrow(spatial_data), 0.7*nrow(spatial_data))
train_data <- spatial_data[train_indices, ]
test_data <- spatial_data[-train_indices, ]

# Specify the target variable and predictor variables
target_variable <- "land_cover_type"
predictor_variables <- c("predictor_var1", "predictor_var2", ...) # Add your predictor variables

# Build the decision tree model
decision_tree <- rpart(as.factor(target_variable) ~ ., data = train_data[, c(predictor_variables, "raster_values")])

# Make predictions on the testing set
predictions <- predict(decision_tree, newdata = test_data[, c(predictor_variables, "raster_values")], type = "class")

# Evaluate model performance (e.g., accuracy)
accuracy <- sum(predictions == test_data$land_cover_type) / length(predictions)
print(paste("Accuracy:", accuracy))

2. K-nearest neighbor

The k-nearest-neighbor algorithm, or simply k-NN, is a data categorization method that defines a data point’s chances of belonging to a specific group based on which group the data points that are nearby to it are in.

Code snippet

# Load required libraries
library(sf) # for working with spatial data
library(raster) # for raster computation
library(class) # for k-nearest neighbor

# Load spatial data (replace 'example.shp' with your file path)
spatial_data <- st_read("example.shp")

# Load raster data (replace 'example.tif' with your file path)
raster_data <- raster("example.tif")

# Extract raster values at spatial data points
spatial_data$raster_values <- extract(raster_data, spatial_data)

# Split data into training and testing sets
set.seed(123) # for reproducibility
train_indices <- sample(1:nrow(spatial_data), 0.7*nrow(spatial_data))
train_data <- spatial_data[train_indices, ]
test_data <- spatial_data[-train_indices, ]

# Specify the target variable and predictor variables
target_variable <- "land_cover_type"
predictor_variables <- c("predictor_var1", "predictor_var2", ...) # Add your predictor variables

# Train k-nearest neighbors classifier
k <- 5 # Number of neighbors
knn_model <- knn(train = train_data[, c(predictor_variables, "raster_values")],
test = test_data[, c(predictor_variables, "raster_values")],
cl = train_data$land_cover_type,
k = k)

# Evaluate model performance (e.g., accuracy)
accuracy <- sum(knn_model == test_data$land_cover_type) / length(knn_model)
print(paste("Accuracy:", accuracy))

3. Naive Bayes

Based on the application of the Bayes theorem and the “naive” assumption of conditional independence between each pair of features given the value of the class variable, a group of supervised learning algorithms known as “naive Bayes methods” are utilized.

Code Snippet

# Load required libraries
library(sf) # for working with spatial data
library(raster) # for raster manipulation
library(e1071) # for Naive Bayes classifier

# Load spatial data (replace 'example.shp' with your file path)
spatial_data <- st_read("example.shp")

# Load raster data (replace 'example.tif' with your file path)
raster_data <- raster("example.tif")

# Extract raster values at spatial data points
spatial_data$raster_values <- extract(raster_data, spatial_data)

# Split data into training and testing sets
set.seed(123) # for reproducibility
train_indices <- sample(1:nrow(spatial_data), 0.7*nrow(spatial_data))
train_data <- spatial_data[train_indices, ]
test_data <- spatial_data[-train_indices, ]

# Specify the target variable and predictor variables
target_variable <- "land_cover_type"
predictor_variables <- c("predictor_var1", "predictor_var2", ...) # Add your predictor variables

# Train Naive Bayes classifier
nb_model <- naiveBayes(as.factor(target_variable) ~ .,
data = train_data[, c(predictor_variables, "raster_values")])

# Make predictions on the testing set
predictions <- predict(nb_model,
newdata = test_data[, c(predictor_variables, "raster_values")])

# Evaluate model performance (e.g., accuracy)
accuracy <- sum(predictions == test_data$land_cover_type) / length(predictions)
print(paste("Accuracy:", accuracy))

4. Linear regression

A variable’s value can be predicted using linear regression exploration based on the value of another variable. The dependent variable is the one that you request to be able to forecast. The independent variable is the one you are using to forecast the value of the other variable in a code.

Code Snippet

# Load required libraries
library(sf) # for working with spatial data
library(raster) # for raster manipulation

# Load spatial data (replace 'example' with your file path)
spatial_data <- st_read("example.shp")

# Load raster data (replace 'example' with your file path)
raster_data <- raster("example.tif")

# Extract raster values at spatial data points
spatial_data$raster_values <- extract(raster_data, spatial_data)

# Split data into training and testing sets
set.seed(123) # for reproducibility
train_indices <- sample(1:nrow(spatial_data), 0.7*nrow(spatial_data))
train_data <- spatial_data[train_indices, ]
test_data <- spatial_data[-train_indices, ]

# Specify the target variable and predictor variables
target_variable <- "land_cover_type"
predictor_variables <- c("predictor_var1", "predictor_var2", ...) # Add your predictor variables

# Train linear regression model
lm_model <- lm(as.formula(paste(target_variable, "~", paste(predictor_variables, collapse = "+"))),
data = train_data[, c(predictor_variables, "raster_values")])

# Make predictions on the testing set
predictions <- predict(lm_model, newdata = test_data[, c(predictor_variables, "raster_values")])

# Evaluate model performance (e.g., RMSE)
rmse <- sqrt(mean((predictions - test_data$land_cover_type)^2))
print(paste("RMSE:", rmse))

In conclusion, R machine learning with GIS authorizes users to excerpt valuable intuitions from spatial data, make informed decisions, and address complex spatial challenges across domains such as environmental science, urban planning, agriculture, and public health. Through continuous innovation and community-driven expansion, R remains at the forefront of spatial data science, driving progressions in both research and applied applications. I will look at Python and GIS machine learning in the next article.

--

--