This tutorial provides an introductory overview on how to integrate the R programming language into Geomatica in order to leverage the many statistical, graphical, and geospatial packages available in R. In this tutorial, you will learn how to install an R interface (rpy2) to communicate with Python, how to operate and manage data across Python and R, and how to call basic R functionality. The aim of this tutorial is to demonstrate how Geomatica, Python, and R can be combined to build powerful geospatial applications.
The R programming language was first released in 1993, and since then the number of available packages it offers has grown considerably. The source of its considerable growth came from its capability to easily generate and integrate user-created content. There now exists a large international community across academia and industry that contributes to extending its functionality. As a result, R now provides its users access to an enormous amount of statistical, mathematical, graphical, and other types of tools. This tutorial will teach you the basics on how to access those tools to build your own powerful geospatial applications.
- Geomatica 2016 or later installed
- Python 101 (basic understanding of python coding concepts)
- try, except, finally statements
- How to access modules and classes
- How to call a function
- R (most recent version)
- R 101 (basic understanding of R coding concepts)
- Importing R packages
- Knowledgeable of R data structures
- Knowledgeable of basic R functionality
- Some experience programming with Geomatica (recommended)
- Click Here to download the data required for this tutorial
To demonstrate how Geomatica, Python, and R can be combined to build powerful geospatial applications, this tutorial will demonstrate how to process Sentinel-2 imagery with R functions accessed from Python. More specifically, Sentinel-2 imagery will be accessed in Python using Geomatica’s Python API and NumPy, then prepared and processed in R using rpy2. The rpy2 interface enables users to embed R within Python that provides the capability to access and execute R functionality.
1. Configure & install rpy2 (Windows only)
The first step is to prepare your system for the installation of rpy2. This involves setting two new environment variables, R_HOME and R_USER. Those environment variables are needed to ensure rpy2 can be successfully installed.
Begin by opening the Environment Variables panel in Windows, as seen below.
The first environment variable to set is the R_USER user variable. Create a new user variable called R_USER and specify the location of your Windows’ user folder.
The second environment variable to set is the R_HOME system variable. Create a new system variable called R_HOME and specify the location of R’s installation folder.
The environment variables are now configured and we can continue with the installation of rpy2.
To install rpy2 you must first download the installation binaries for Windows from the Unofficial Windows Binaries for Python Extension Packages site. Select the appropriate binary package for your operating system that’s compiled for Python 2.7. After the download, open CMD as administrator and navigate to the folder location of the rpy2 binaries.
From the folder location in CMD, use the following pip command pip install [package_name] to install rpy2. After the installation run Python in CMD, and then use the following command import rpy2.robjects to ensure rpy2 was properly installed.
The rpy2 package is successfully installed if there are no warning messages after running the import line.
2. Import necessary modules
The first step to writing your Geomatica, Python, and R script is to import the required libraries from each respective source.
In the code above, all the necessary libraries and modules are imported to be able to access the required functionality. On line 1 we import Python’s OS package to access operating system related functionality. On line 2 we import the numpy module to be able to properly read and structure imagery as large arrays. On line 3 the datasource class is added to provide read and write functionality for geospatial datasets. From line 4 to 7 the rpy2 modules are added for the required rpy2 functionality.
On line 4, the robjects module is imported to provide an interface between R and Python. This interface allows users to create R data structures and directly access R functionality from Python. Line 5 imports the rlike functionality that allows users to create collection type data structures that are useful for creating R data frames. On Line 6 the packages module is added to enable users to import R packages into the Python environment. Finally, on line 7 the numpy2ri module is imported to convert numpy objects to rpy2 objects.
3. Configure R environment & import packages
The next step is to configure environment variables and import the desired R packages.
The first environment variable on line 10 defines the maximum amount of virtual memory that can be used by R. The default embedded R memory limit might be insufficient for managing and processing large numpy arrays. The memory limit for R can be increased using robjects.r to modify the memory limit.
Note: ensure you provide a suitable memory limit for your system.
On line 13 the automatic conversion from numpy to rpy2 objects is activated by simply setting the numpy2ri object to active.
Note: this functionality has changed in newer versions of rpy2.
Starting on line 16 R packages are imported using the importr function. Each R package is imported and assigned to a Python object. The functionality of the R packages can then be accessed using the Python object. The first imported package is the base package that provides access to all of R’s basic functionality. The Lattice and grDevices packages provide functionality for graphical programming.
4. Open the dataset and store as numpy array
The purpose of the fourth step is to open the Sentinel-2 dataset, access the content, and store the pixel values as a numpy array.
On line 21 and 22, working_dir is assigned the path of the working directory and image_fn is assigned the name of the dataset on disk. Provide the location of your working directory on line 21.
On line 24 a context manager is established for ds.open_dataset(image_fn, ds.eAM_READ) using the with statement. This line instantiates an object of class Dataset by using the method open_dataset and passing the arguments image_fn and ds.eAM_READ. The image_fn argument provides the path and filename of the dataset to access, and the ds.eAM_READ argument specifies to open the dataset as read only.
Continuing with line 25 to 27, there are a number of different objects instantiated and variables defined. Starting on line 25 an object of class BasicReader is instantiated with the Dataset object as an argument. Using the BasicReader object, on line 26 the number of bands is assigned to a variable using the method chans_count.
Finally on line 27, the read_raster method is used to read the raster values and store them as a numpy array. The first two arguments indicate where to start reading the raster values, also known as the upper-left offset. Both values are set to 0 for these arguments. Therefore, there is no offset and the raster values are read from the upper-leftmost pixel. The last two arguments define the width and height of the raster. In this case, we set them equal to the width and height of the input file by setting the image_rdr.width and image_rdr.height attributes.
5. Creating an R Data Frame from a numpy array
The following section demonstrates how to convert a numpy array to an R Data Frame. There can be a number of required steps prior to the conversion to ensure the process is successful.
The first step is to convert the data type of the numpy array to a data type accepted by R. The R programming language does not support unsigned data types. Unfortunately, rasters are often stored as unsigned data types and must be converted prior to R. On line 34 the numpy array storing the raster values is copied and casted from unsigned 16bit to signed 32bit using the numpy function astype.
Next a 2-dimensional list is created and populated with the raster values of the numpy array to be further manipulated for the R Data Frame. From line 37 to 39 a 2-dimensional list is declared and iteratively populated. At each iteration two things occur: 1) the first dimension receives a string indicating the band number (band read at iteration), and 2) the second dimension receives the raster values of that band. The raster values are then prepared using a numpy and an rpy2 function. The flatten numpy function is first applied to collapse the matrix into a single dimension. Afterwards, the IntVector rpy2 function is applied to create an R Vector object.
The next step is to create an rpy2 Ordered Dictionary from the 2-dimensional list and then create the R Data Frame. Before creating the R Data Frame it is necessary to ensure the raster values are stored in one of the accepted object types. The accepted object types can either be an R Vector object or any Python object with the method iteritems(), like a dictionary. In this example, the dictionary format is chosen because the dictionary keys are transferred to the Data Frame’s column headers.
On line 42 the Ordered Dictionary is created using the rpy2 OrdDict function and passing the 2-dimensional list as an argument. The OrdDict function creates an rpy2 Ordered Dictionary object where the keys are indexed following the sequence in the 2-dimensional list.
The last step on line 45 is the creation of the R Data Frame object. The Data Frame is created using the Data Frame rpy2 function.
6. Processing with R
The last section provides an example on how to access the content of the R Data Frame, analyze its content, create Python objects of R tools, and how to generate and plot graphics.
The first example demonstrates how to access and analyze the content of the Data Frame. On line 48 the Data Frame is accessed and descriptive statistics are computed within a single line. The content of the first band (Band_1) is accessed using the rx2 rpy2 method. The rx2 method provides an approach to indexing and subsetting that is similar to the R language. Passing the column header as argument, retrieves the data for that column.
Next the R summary function is accessed directly using the base Python object previously initialized above. The summary function is used to compute descriptive statistics. Afterwards, the R Garbage Collector is ran to release memory in the embedded instance of R. On line 49 R’s Garbage Collector is called using the base object. It is recommended to run R’s Garbage Collector after memory demanding operations.
The second example demonstrates how to produce and plot graphics using R packages. In this example, histograms are sequentially generated for each band.
Starting on line 52, a Formula object is instantiated using the robjects.r instance of R. The Formula object is later used to define the type of graph to generate.
Next on line 55 and 56 Python objects are instantiated for the required graphics and plotting R functions. The first instantiated Python object is the R plot function. This is a generic R function for plotting R objects. The second object is the histogram function from the Lattice R package. This function provides the capability to draw and generate histogram objects.
The next block of code is the generation and plotting of the histograms. On line 59 an R graphics device for Windows is initialized using the x11 function from the grDevices package.
From line 62 to 66 a histogram for each band is sequentially generated and plotted. Starting on line 62 a for loop is defined to iterate across the number of columns available in the Data Frame. On line 63 the histograms are generated and plotted to the initialized graphics device. Within this line there are a number of different R functions and parameters involved in generating the histograms.
The formula function is used to define the type of graph to generate, see graph types. The histogram function from the Lattice package is used to generate the histograms. There are a number of different parameters that can be modified to customize the histograms, see documentation. Finally, the plot function is used to plot the generated histograms.
On line 66 the built-in Python function raw_input is used to create a pause between each iteration. The function prompts the user to press enter in order to continue to the next histogram. The last operation on line 69 terminates the initialized graphics device.