16 Conclusion

Prerequisites

Like the introduction, this concluding chapter contains few code chunks. But its prerequisites are demanding. It assumes that you have:

  • read through and attempted the exercises in all the chapters of Part I (Foundations);
  • considered how you can use geocomputation to solve real-world problems, at work and beyond, after engaging with Part III (Applications).

16.1 Introduction

The aim of this chapter is to synthesize the contents, with reference to recurring themes/concepts, and to inspire future directions of application and development. Section 16.2 discusses the wide range of options for handling geographic data in R. Choice is a key feature of open source software; the section provides guidance on choosing between the various options. Section 16.3 describes gaps in the book’s contents and explains why some areas of research were deliberately omitted, while others were emphasized. This discussion leads to the question (which is answered in Section 16.5): having read this book, where to go next? Section 16.6 returns to the wider issues raised in Chapter 1. In it we consider geocomputation as part of a wider ‘open source approach’ that ensures methods are publicly accessible, reproducible and supported by collaborative communities. This final section of the book also provides some pointers on how to get involved.

16.2 Package choice

A characteristic of R is that there are often multiple ways to achieve the same result. The code chunk below illustrates this by using three functions, covered in Chapters 3 and 5, to combine the 16 regions of New Zealand into a single geometry:

library(spData)
nz_u1 = sf::st_union(nz)
nz_u2 = aggregate(nz["Population"], list(rep(1, nrow(nz))), sum)
nz_u3 = dplyr::summarise(nz, t = sum(Population))
identical(nz_u1, nz_u2$geometry)
#> [1] TRUE
identical(nz_u1, nz_u3$geom)
#> [1] TRUE

Although the classes, attributes and column names of the resulting objects nz_u1 to nz_u3 differ, their geometries are identical. This is verified using the base R function identical().100 Which to use? It depends: the former only processes the geometry data contained in nz so is faster, while the other options performed attribute operations, which may be useful for subsequent steps.

The wider point is that there are often multiple options to choose from when working with geographic data in R, even within a single package. The range of options grows further when more R packages are considered: you could achieve the same result using the older sp package, for example. We recommend using sf and the other packages showcased in this book, for reasons outlined in Chapter 2, but it’s worth being aware of alternatives and being able to justify your choice of software.

A common (and sometimes controversial) choice is between tidyverse and base R approaches. We cover both and encourage you to try both before deciding which is more appropriate for different tasks. The following code chunk, described in Chapter 3, shows how attribute data subsetting works in each approach, using the base R operator [ and the select() function from the tidyverse package dplyr. The syntax differs but the results are (in essence) the same:

library(dplyr)                          # attach tidyverse package
nz_name1 = nz["Name"]                   # base R approach
nz_name2 = nz |> select(Name)          # tidyverse approach
identical(nz_name1$Name, nz_name2$Name) # check results
#> [1] TRUE

Again the question arises: which to use? Again the answer is: it depends. Each approach has advantages: the pipe syntax is popular and appealing to some, while base R is more stable, and is well known to others. Choosing between them is therefore largely a matter of preference. However, if you do choose to use tidyverse functions to handle geographic data, beware of a number of pitfalls (see the supplementary article tidyverse-pitfalls on the website that supports this book).

While commonly needed operators/functions were covered in depth — such as the base R [ subsetting operator and the dplyr function filter() — there are many other functions for working with geographic data, from other packages, that have not been mentioned. Chapter 1 mentions 20+ influential packages for working with geographic data, and only a handful of these are demonstrated in subsequent chapters. There are hundreds more. As of mid-2022, there are about 200 packages mentioned in the Spatial Task View; more packages and countless functions for geographic data are developed each year, making it impractical to cover them all in a single book.

The rate of evolution in R’s spatial ecosystem may seem overwhelming, but there are strategies to deal with the wide range of options. Our advice is to start by learning one approach in depth but to have a general understand of the breadth of options available. This advice applies equally to solving geographic problems in R (Section 16.5 covers developments in other languages) as it does to other fields of knowledge and application.

Of course, some packages perform much better than others, making package selection an important decision. From this diversity, we have focused on packages that are future-proof (they will work long into the future), high performance (relative to other R packages) and complementary. But there is still overlap in the packages we have used, as illustrated by the diversity of packages for making maps, for example (see Chapter 9).

Package overlap is not necessarily a bad thing. It can increase resilience, performance (partly driven by friendly competition and mutual learning between developers) and choice, a key feature of open source software. In this context the decision to use a particular approach, such as the sf/tidyverse/raster ecosystem advocated in this book should be made with knowledge of alternatives. The sp/rgdal/rgeos ecosystem that sf is designed to supersede, for example, can do many of the things covered in this book and, due to its age, is built on by many other packages.101 Although best known for point pattern analysis, the spatstat package also supports raster and other vector geometries (Baddeley and Turner 2005). At the time of writing (October 2018) 69 packages depend on it, making it more than a package: spatstat is an alternative R-spatial ecosystem.

It is also being aware of promising alternatives that are under development. The package stars, for example, provides a new class system for working with spatiotemporal data. If you are interested in this topic, you can check for updates on the package’s source code and the broader SpatioTemporal Task View. The same principle applies to other domains: it is important to justify software choices and review software decisions based on up-to-date information.

16.3 Gaps and overlaps

There are a number of gaps in, and some overlaps between, the topics covered in this book. We have been selective, emphasizing some topics while omitting others. We have tried to emphasize topics that are most commonly needed in real-world applications such as geographic data operations, projections, data read/write and visualization. These topics appear repeatedly in the chapters, a substantial area of overlap designed to consolidate these essential skills for geocomputation.

On the other hand, we have omitted topics that are less commonly used, or which are covered in-depth elsewhere. Statistical topics including point pattern analysis, spatial interpolation (kriging) and spatial epidemiology, for example, are only mentioned with reference to other topics such as the machine learning techniques covered in Chapter 12 (if at all). There is already excellent material on these methods, including statistically orientated chapters in Bivand, Pebesma, and Gómez-Rubio (2013) and a book on point pattern analysis by Baddeley, Rubak, and Turner (2015). Other topics which received limited attention were remote sensing and using R alongside (rather than as a bridge to) dedicated GIS software. There are many resources on these topics, including Wegmann, Leutner, and Dech (2016) and the GIS-related teaching materials available from Marburg University.

Instead of covering spatial statistical modeling and inference techniques, we focused on machine learning (see Chapters 12 and 15). Again, the reason was that there are already excellent resources on these topics, especially with ecological use cases, including A. Zuur et al. (2009), A. F. Zuur et al. (2017) and freely available teaching material and code on Geostatistics & Open-source Statistical Computing by David Rossiter, hosted at css.cornell.edu/faculty/dgr2 and the R for Geographic Data Science project by Stefano De Sabbata at the University of Leicester for an introduction to R for geographic data science. There are also excellent resources on spatial statistics using Bayesian modeling, a powerful framework for modeling and uncertainty estimation (Blangiardo and Cameletti 2015; Krainski et al. 2018).

We have largely omitted geocomputation on ‘big data’, by which we mean datasets that do not fit on consumer hardware or which cannot realistically be processed on a single CPU. This decision is justified by the fact that the majority of geographic datasets that are needed for common research or policy applications do fit on consumer hardware, even if that may mean increasing the amount of RAM on your computer (or temporarily ‘renting’ compute power available on platforms such as GitHub Codespaces), and the fact that learning to solve problems on small datasets is a prerequisite to solving problems on huge datasets. Analysis of ‘big data’ often involves extracting a small amount of data from a database for a specific statistical analysis. Spatial databases, covered in Chapter 10, can help with the analysis of datasets that do not fit in memory. ‘Earth observation cloud back-ends’ can be accessed from R with the openeo package, as described on the openeo.org website. We omitted detailed coverage of software for geographic analysis of big data with software such as Apache Sedona because the hardware and time costs of setting-up such systems are high, relative to their relative niche use cases.

16.4 Getting help?

Geocomputation is a large field and it is highly likely that you will encounter challenges preventing you from achieving an outcome that you are aiming towards. In many cases you may just ‘get stuck’ at a particular point in data analysis workflows, with a cryptic error message or an unexpected result providing little clues as to what is going on. This section provides pointers to help you overcome such problems, by clearly defining the problem, searching for existing knowledge on solutions and, if those approaches to not solve the problem, through the art of asking good questions.

When you get stuck at a particular point, it is worth first taking a step back and working out which approach is most likely to solve the issue. Trying each of the following steps, in order (or skipping steps if you have already tried them), provides a structured approach to problem solving:

  1. Define exactly what you are trying to achieve, starting from first principles (and often a sketch, as outlined below)
  2. Diagnose exactly where in your code the unexpected results arise, by running and exploring the outputs of individual lines of code and their individual components (you can can run individual parts of a complex command by selecting them with a cursor and pressing Ctrl+Enter in RStudio, for example)
  3. Read the documentation of the function that has been diagnosed as the ‘point of failure’ in the previous step. Simply understanding the required inputs to functions, and running the examples that are often provided at the bottom of help pages, can help solve a surprisingly large proportion of issues (run the command ?terra::rast and scroll down to the examples that are worth reproducing when getting started with the function, for example)
  4. If reading R’s inbuilt documentation, as outlined in the previous step, does not help solve the problem, it is probably time to do a broader search online to see if others have written about the issue you’re seeing. See a list of places to search for help below for places to search
  5. If all the previous steps above fail, and you cannot find a solution from your online searches, it may be time to compose a question with a reproducible example and post it in an appropriate place

Steps 1 to 3 outlined above are fairly self-explanatory but, due to the vastness of the internet and multitude of search options, it is worth considering effective search strategies before deciding to compose a question.

16.4.1 Searching for solutions online

A logical place to start for many issues is search engines. ‘Googling it’ can in some cases result in the discovery of blog posts, forum messages and other online content about the precise issue you’re having. Simply typing in a clear description of the problem/question is a valid approach here but it is important to be specific (e.g. with reference to function and package names and input dataset sources if the problem is dataset specific). You can also make online searches more effective by including additional detail:

  • Use quote marks to maximise the chances that ‘hits’ relate to the exact issue you’re having by reducing the number of results returned
  • Set time restraints, for example only returning content created within the last year can be useful when searching for help on an evolving package.
  • Make use of additional search engine features, for example restricting searches to content hosted on CRAN with site:r-project.org

16.4.2 Places to search for (and ask) for help

  • R’s Special Interest Group on Geographic data email list (R-SIG-GEO)
  • The GIS Stackexchange website at gis.stackexchange.com
  • The large and general purposes programming Q&A site stackoverflow.com
  • Online forums associated with a particular entity, such as the RStudio Community, the rOpenSci Discuss web forum and forums associated with particular software tools such as the Stan forum
  • Software development platforms such as GitHub, which hosts issue trackers for the majority of R-spatial packages and also, increasingly, inbuilt discussion pages such as that created to encourage discussion (not just bug reporting) around the sfnetworks package (see luukvdmeer/sfnetworks/discussions)

16.4.3 How to ask a good question with a reproducible example

In terms asking a good question, a clearly stated questions supported by an accessible and fully reproducible example is key. It is also helpful, after showing the code that ‘did not work’ from the user’s perspective, to explain what you would like to see. A very useful tool for creating reproducible examples is the reprex package.

16.4.4 Defining and sketching the problem

The best starting point when developing a new geocomputational methodology or approach is often a pen and paper (or equivalent digital sketching tools such as Excalidraw and tldraw which allow collaborative sketching and rapid sharing of ideas): during the most creative early stages of methodological development work software of any kind can slow down your thoughts and direct your thinking away from important abstract thoughts. Framing the question with mathematics is also highly recommended, with reference to a minimal example that you can sketch ‘before and after’ versions of numerically. If you have the skills and if the problem warrants it, describing the approach algebraically can in some cases help develop effective implementations.

16.6 The open source approach

This is a technical book so it makes sense for the next steps, outlined in the previous section, to also be technical. However, there are wider issues worth considering in this final section, which returns to our definition of geocomputation. One of the elements of the term introduced in Chapter 1 was that geographic methods should have a positive impact. Of course, how to define and measure ‘positive’ is a subjective, philosophical question, beyond the scope of this book. Regardless of your worldview, consideration of the impacts of geocomputational work is a useful exercise: the potential for positive impacts can provide a powerful motivation for future learning and, conversely, new methods can open-up many possible fields of application. These considerations lead to the conclusion that geocomputation is part of a wider ‘open source approach’.

Section 1.1 presented other terms that mean roughly the same thing as geocomputation, including geographic data science (GDS) and ‘GIScience’. Both capture the essence of working with geographic data, but geocomputation has advantages: it concisely captures the ‘computational’ way of working with geographic data advocated in this book — implemented in code and therefore encouraging reproducibility — and builds on desirable ingredients of its early definition (Openshaw and Abrahart 2000):

  • The creative use of geographic data
  • Application to real-world problems
  • Building ‘scientific’ tools
  • Reproducibility

We added the final ingredient: reproducibility was barely mentioned in early work on geocomputation, yet a strong case can be made for it being a vital component of the first two ingredients. Reproducibility

  • encourages creativity by shifting the focus away from the basics (which are readily available through shared code) and towards applications;
  • discourages people from ‘reinventing the wheel’: there is no need to re-do what others have done if their methods can be used by others; and
  • makes research more conducive to real world applications, by enabling anyone in any sector to apply your methods in new areas.

If reproducibility is the defining asset of geocomputation (or command-line GIS) it is worth considering what makes it reproducible. This brings us to the ‘open source approach’, which has three main components:

  • A command-line interface (CLI), encouraging scripts recording geographic work to be shared and reproduced
  • Open source software, which can be inspected and potentially improved by anyone in the world
  • An active developer community, which collaborates and self-organizes to build complementary and modular tools

Like the term geocomputation, the open source approach is more than a technical entity. It is a community composed of people interacting daily with shared aims: to produce high performance tools, free from commercial or legal restrictions, that are accessible for anyone to use. The open source approach to working with geographic data has advantages that transcend the technicalities of how the software works, encouraging learning, collaboration and an efficient division of labor.

There are many ways to engage in this community, especially with the emergence of code hosting sites, such as GitHub, which encourage communication and collaboration. A good place to start is simply browsing through some of the source code, ‘issues’ and ‘commits’ in a geographic package of interest. A quick glance at the r-spatial/sf GitHub repository, which hosts the code underlying the sf package, shows that 40+ people have contributed to the codebase and documentation. Dozens more people have contributed by asking question and by contributing to ‘upstream’ packages that sf uses. More than 600 issues have been closed on its issue tracker, representing a huge amount of work to make sf faster, more stable and user-friendly. This example, from just one package out of dozens, shows the scale of the intellectual operation underway to make R a highly effective and continuously evolving language for geocomputation.

It is instructive to watch the incessant development activity happen in public fora such as GitHub, but it is even more rewarding to become an active participant. This is one of the greatest features of the open source approach: it encourages people to get involved. This book itself is a result of the open source approach: it was motivated by the amazing developments in R’s geographic capabilities over the last two decades, but made practically possible by dialogue and code sharing on platforms for collaboration. We hope that in addition to disseminating useful methods for working with geographic data, this book inspires you to take a more open source approach. Whether it’s raising a constructive issue alerting developers to problems in their package; making the work done by you and the organizations you work for open; or simply helping other people by passing on the knowledge you’ve learned, getting involved can be a rewarding experience.