This tutorial will demonstrate how developers can launch multiple concurrent (parallel) Geomatica processes from a single script in order to improve overall throughput for a batch process.
Multiprocessing is a capability built into Python. It allows a Python program or script to launch multiple jobs in parallel. This can significantly improve processing times for some Geomatica algorithms when processing a batch.
Note: Many of Geomatica’s algorithms are designed for multi-threaded processing. Using multiprocessing on these algorithms may result in limited performance gains or in some cases may even slowdown processing.
- Geomatica 2015 or later installed
- Python 101 (basic understanding of python coding concepts)
- How to access modules, classes, attributes and methods
- How to call a function
- How to build a function
- Some experience programming with Geomatica (recommended)
- In particular, EASI programming
- Some experience with multiprocessing concepts
- Click Here to download the data required for this tutorial
To demonstrate parallel processing with Geomatica, we will construct a program that will run an unsupervised fuzzy k-means classification on 3 landsat-8 images simultaneously. Furthermore, we will create a limit to the number of jobs that can run concurrently. This simple management method will automatically grab the next job in the queue once one of the workers becomes available.
1. Import necessary modules
The first step is to import the native and Geomatica modules we require for performing multiprocessing with Geomatica.
In the above code, we import a number of native modules and one Geomatica module. On lines 1 and 2 we import the modules os and fnmatch, which are used to perform various operating system functions. On lines 3 and 4 we import the modules calendar and time, which are used to keep track of how much time it takes to process the entire batch. On line 5 we import multiprocessing, which is the key module required for parallelizing Geomatica’s algorithms. Line 6 we write from pci.fuzclus import fuzclus, which will import the Geomatica algorithm fuzlcus, responsible for performing a fuzzy k-means unsupervised classification.
2. Create generator to search input directory for valid files
In this step, we will create a simple generator that will search the input directory for all valid input files and prepare them for processing
On line 9 we define a function called get_batch, which takes in the in_dir variable and img_filter variable. The in_dir variable points to the input directory that it should recursively check for input files and the img_filter variable is a string that is used to determine which files are valid. Lines 11 to 13 loop through the directory and sub directories (if exist) and search for all files that match the img_filter string, using fnmatch.
On line 27, the yield operator is used to return every item (valid input file) on-the-fly.
This function will be called later on in the program.
3. Create worker function
The worker function is in-and-of-itself, nothing special. It is the function that contains the process that we want parallelized. In the function, we will make a call to the Geomatica algorithm we want to use pass the necessary input parameters.
On line 16 of the code block above, we define a function called worker(current_job, in_file) with two arguments. The first argument, current_job keeps track of which process is currently being run and the second argument, in_file, specifies the input file we will be performing the classification on.
On line 18 we simply print the current job number so we know which jobs are currently active
Line 19 makes a call to the fuzclus() Geomatica algorithm, which takes 4 keyword arguments:
- fili=in_file – points to the input file that the classification will be run on
- dbic = [1, 2, 3, 4] – specifies the input channels the classification should use for clustering
- backval= – defines the no data value to omit during processing
- dboc= – specifies the channel number that should receive the output classification map
4. Create main program block and setup multiprocessing
When using the multiprocessing module to parallelize python processes, it is vital that you use the __name__ == ‘__main__’: condition, especially in Windows. Furthermore, it is required that your instantiation and use of the multiprocessing object(s) be held within the above condition. For more information on why __name__ == ‘__main__’: is required for windows, please read multiprocessing guidelines for Windows.
On line 23 we create the condition if __name__ == ‘__main__’:, which tells python to only run the code in this block if it is launched as the main program. If the python file is imported, the code in this block will not be executed.
On line 25, define the working_dir variable and set it to the root directory where you unzipped the data for this tutorial. Lines 26 and 27 are used to define the variables psh_dir and file_filter, respectively. The psh_dir points to the specific folder with the input files and the file_filter contains the string used to find valid input files.
On line 30 we create a variable called parallel_proc, which is used to define how many processes should be run in parallel.
On line 32, we create a variable that will get the starting time in seconds since epoch, which will be used to help us calculate the time required to process the batch.
On line 34, we output each iteration from the get_batch generator and temporarily store it in the in_files variable.
On line 36 we create a variable i and assign it the value 1, i = 1, which is used as a starting point to identify which process in the batch is running.
On line 37 we create an object called pool from multiprocessing.Pool(processes=parallel_proc) class and set the keyword argument processes equal to parallel_proc. This will limit the number of parallel processes to the value defined by parallel_proc, in this case 3.
Note: if we do not define the processes argument, the multiprocessing.Pool object will by default set the number of parallel processes equal to the number of cores on your computer.
5. Run Geomatica process in parallel
In this section, we will create a for-loop to acquire all images in the batch and use the pool.apply_async() function to run them in parallel (up to three jobs at a time).
On line 38 we create the for-loop. In the for-loop we will get each img item that is yielded from in_files variable. Inside the for-loop, on line 39, we will call the pool.apply_async(worker, args=(I, img, )) function and set the first argument to the worker function (the function we wish to parallelize) and the second argument args=(i, img, ), which is the arguments we want to pass to the worker function.
On line 41, we increment variable i by 1, which as stated before is used to keep track of the current process.
On lines 43 and 44, we tab-out of the for-loop and close and then join the pool object.
Lastly, on lines 46 and 47 we calculate the total time required to process the batch and then print the result to terminal.
Try it Yourself!
See how multiprocessing can improve performance for yourself. In the data package you downloaded for this tutorial, you will find a python file called serial_processing.py and multi_processing.py. Both scripts run the same workflow on the same images, just the serial_proessing.py script runs the processes in sequence, whereas the multi_processing.py script runs them in parallel.
Change the working_dir variable in the two files to point to the root folder where you extracted the tutorial data and run one script after the other. See the performance difference for yourself!