Take me to pcigeomatics.com

PCI Geomatics Help Center

How can we help you today?

Optimized Image Processing by Tiles

PCI Geomatics -

Overview

This tutorial is an extension to the Process Rasters by Iterating over Tiles with Geomatica Python tutorial and communicates some advanced concepts for rules and methods to optimize processing of tiles. Some concepts communicated here include: Conditions for when a file should be processed by tiles, determining optimal tile dimensions based on input and output file layouts, and creating a cap for how much memory should be used.

Data Package

  • Click Here to download the data required for this tutorial

Watch this Tutorial

< Embed Video Tutorial Here >

Benefits

Processing by tiles is a very important concept when creating custom processes that may be used on large files. When building a custom solution, you may not have control over the file size or tile dimensions of the input file, which means that hardcoding the tile dimensions can degrade performance or overwhelm memory resources. This tutorial will communicate and demonstrate some general rules to consider when processing by tiles.

Prerequisites

  • Geomatica 2016 or later installed
  • Python 101 (basic understanding of python coding concepts)
    • How to access modules, classes, attributes and methods
    • How to build and call a function
    • How to use with statements
    • For-loops
  • Work through and understand the concepts in the Process Rasters by Iterating over Tiles with Geomatica Python tutorial (highly recommended)
  • Some experience programming with Geomatica (recommended)
  • Some experience with numpy (recommended)

Key Concepts

  1. Processing by tiles will generally degrade performance, so you want to make sure to use the largest tile size possible, based on an appropriate memory limit
  2. Use optimal tile dimensions. When possible avoid creating tiles in the X or Y dimension that are smaller than the tile dimensions in your input and output files
    1. If your processing tile dimensions are smaller than the tile dimensions of the input or output file, you will require multiple reads and or writes to a specific tile, which may significantly degrade performance
    2. For example, if my input file tile dimensions are 1024x1 (striped) and my output file is 512x512, your tile dimensions for processing should not be smaller than 1024x512
  3. Concept 2 (directly above) should only be broken if the optimal tile dimension would require too much memory
    1. For example, if the input is a large striped file with dimensions 1,000,000x1, the output file is 1024x1024 and you have 4 channels at 16-bits per pixel, the optimal tile size of 1,000,000x1024 would require approximately 7.63GB of memory, which for some systems may be too much
      1. Bytes: 1,000,000x1024x4x2 = 8,192,000,000 Bytes
      2. KB: (1,000,000x1024x4x2)/1024 = 8,000,000 KB
  • MB: (1,000,000x1024x4x2)/10242 = 7812.5 MB
  1. GB: (1,000,000x1024x4x2)/10243 = 7.63 GB

Tutorial

This tutorial is an extension to the intermediate level tutorial Process Rasters by Iterating over Tiles with Geomatica Python , which demonstrates how to process an image by subsets or tiles. In this tutorial we will look at the same basic code used in that tutorial, but work with more challenging input imagery (raw Landsat-8 which is striped) in order to demonstrate some key concepts for optimizing throughput when processing with tiles.

Furthermore, this tutorial will break the code up into two main functions, one function called main(), which is responsible for running the primary program, and a second function called get_tile_size(), which is responsible for computing the optimal tile size to be used for processing. The focus of this tutorial will be based on what the get_tile_size() function is doing. The main() function will only be referenced when it relates to the get_tile_size() function.

 

Note: If you are confused about the main() function, you should work through the Process Rasters by Iterating over Tiles with Geomatica Python tutorial first, as the code in that tutorial is very similar to the code in the main() function

Figure 2 - When using tile dimensions that are not optimized for the data, considerable more time is required for processing (in this case, almost 15x more)

1. Import Necessary Modules

At the top of our python file, we will import the modules required for this tutorial.

1
2
3
4
5
import os
import calendar
import time
import numpy as np
from pci.api import gobs, datasource as ds

 

2. Setup input/output files and processing parameters

This next block of code will live at the bottom of the file and is used to setup the input/output files and processing parameters. If this file is run as the primary process (not imported by another file), this block will be initialized.

 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
if __name__ == '__main__':
    # set input output files and processing parameters for workflow
    working_dir = r'D:\tutorials\optimized_tile_processing'  # set this path to the dir where you extracted the tutorial data
    input_file = os.path.join(working_dir, 'LC81950282015242LGN00\LC81950282015242LGN00_MTL.txt-MS')
    output_file = os.path.join(working_dir, 'landsat8_sum.tif')
    output_tile_size = 'TILED512'
    memory_limit = 32   # in MegaBytes (MB)
    overwrite = 'yes'

    if overwrite == 'yes':
        try:
            os.remove(output_file)
        except OSError:
            pass

    main(input_file, output_file, output_tile_size, memory_limit)  # run the main process

 

On line 111, if __name__ == ‘__main__’: tells python to only run the block of code below it if this python file is run as the main process.

On line 113, create a string variable called working_dir, and set it to the path where you unzipped the tutorial data. Lines 114 and 115, are set to the input file and the output file, respectively. Line 116 sets the tiling scheme that will be used when creating the output .tif file, which is set to 515x512 pixels. On line 117, we define the memory limit as 32MB, which is quite small, but sufficient for this tutorial. Line 118 to 124, creates a simple rule for whether to overwrite the output file. And finally, on line 126, we call the main() function and pass the input/output file and processing variables as arguments.

3. Create main() function to run primary process

The main() function will hold the primary code for processing by tiles.

 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
def main(in_file, out_file, out_tile_size, mem_lim):
    """
    This function contains the main process for this workflow, which will process an image by tiles, using
    optimal tile dimensions, computed by the get_tile_dimensions() function.
    :param in_file: path to input file for processing
    :param out_file: path and name of output file for writing
    :param out_tile_size: tiling method
    :param mem_lim: user defined memory limit for determining optimal tile size
    :return: none
    """
    # open two datasets using the with-statement, one to read from and one to write to
    with ds.open_dataset(in_file) as read_dataset, ds.new_dataset(out_file, 'TIFF', out_tile_size) as write_dataset:
        in_chans = [2, 3]
        reader_l8 = ds.BasicReader(read_dataset, in_chans)
        in_coord_sys = reader_l8.crs
        in_geocode = reader_l8.geocoding

        in_raster = reader_l8.read_raster(0, 0, reader_l8.width, reader_l8.height)

        # create the writer object
        writer = ds.BasicWriter(write_dataset)

        # create output file
        out_chan = 1
        writer_info = gobs.RasterInfo(reader_l8.width, reader_l8.height, out_chan, gobs.DT_32R)
        writer.create(writer_info)

        # get and set the optimal tile size
        tile_sz_x, tile_sz_y = get_tile_size(reader_l8, writer, mem_lim, len(in_chans), out_chan)
        reader_l8.tile_size = (tile_sz_x, tile_sz_y)

        i = 1

        start_time = calendar.timegm(time.gmtime())

        # iterate through each tile, add the blue channel pixel values with the green channel and write to output
        for tile in reader_l8:
            print 'creating tile: {0} with dimensions: {1}'.format(i, reader_l8.tile_size)

            # store raster channels as numpy arrays and convert to 16bit Unsigned Integers
            blue_chan = in_raster.data[:, :, 0]
            blue_chan_32r = blue_chan.astype(np.float32)
            green_chan = in_raster.data[:, :, 1]
            green_chan_32r = green_chan.astype(np.float32)

            # Perform addition between blue and green channels and store results in sum_array
            sum_array = blue_chan_32r + green_chan_32r

            # wrap array in a raster object (this doesn't make a copy of the data)
            sum_raster = gobs.array_to_raster(sum_array)

            # write the pixels from sum_raster, the input coordinate system and geocoding information to output file
            writer.write_raster(sum_raster)
            writer.crs = in_coord_sys
            writer.geocoding = in_geocode

            i += 1

        proc_time = calendar.timegm(time.gmtime()) - start_time

        print '\nProcessing took {0} seconds!'.format(proc_time)

 

Below the import statements and above the if __name__ == ‘__main__’: code blocks, create a new function called main() with 4 arguments def main(in_file, out_file, out_tile_size, mem_lim):

Copy the code from above and put it into that function.

On line 78, we make a call to another function that we will create momentarily, called get_tile_size(), which will return the optimal tile_sz_x and tile_sz_y based on the tiling schemes of the input and output data, as well as the user defined memory limit.

Note: for more information on the code in the main() function, refer to tutorial Process Rasters by Iterating over Tiles with Geomatica Python.

4. Create get_tile_size() function

Above the main() function and below the import statements, create a new function that will compute the optimal tile size based on the tiling scheme of the input and output files.

  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
def get_tile_size(in_reader, out_writer, mem_lim, num_in_chans, num_out_chans):
    """
    This function determines the optimal tile dimensions for processing a file based on the tile dimensions
    of the input file and output file
    :param in_reader: the reader object that contains the tiling information for the input file
    :param out_writer: the writer object that contains the tiling information for the output file
    :param mem_lim: user defined memory limit in MB
    :param num_in_chans: number of input channels that are being read in by the in_reader object
    :param num_out_chans: number of output channels that are created by the out_writer object
    :return: tile_size_x, tile_size_y
    """
    in_reader.set_optimal_tile_size()
    out_writer.set_optimal_tile_size()

    max_tile_x = max([in_reader.tile_size[0], out_writer.tile_size[0]])
    max_tile_y = max([in_reader.tile_size[1], out_writer.tile_size[1]])

    in_bytes = in_reader.datatype.bits_per_pixel / 8
    out_bytes = out_writer.datatype.bits_per_pixel / 8

    mem_req_bytes = max_tile_x * max_tile_y * ((num_in_chans * in_bytes) + (num_out_chans * out_bytes))
    mem_req_megabytes = mem_req_bytes / 1048576.0

 

On line 8 def get_tile_size(in_reader, out_writer, mem_lim, num_in_chans, num_out_chans): defines the function that will be responsible for computing the optimal tile size in the x and y directions. The parameters passed to this function are as follows:

  • in_reader: the input reader object that contains tiling information about the input file to be processed
  • out_writer: the utput writer object that contains tiling information about the output file that will be created
  • mem_lim: the user defined memory limit. This sets the upper limit of how much memory can be used, which will define the maximum tile size
  • num_in_chans: how many channels will be read in during processing, this also impacts memory usage
  • num_out_chans: how many channels will be written to the output file, this also impacts memory usage

Lines 19 and 20 get the tile dimensions in the input file and output file by calling in_reader.set_optimal_tile_size() and out_writer.set_optimal_tile_size() methods. Lines 22 and 23 gets the largest tile size in each direction between the input and output file. This is important because we want to make sure the processing tile size we define is equal to or larger than the largest size in each direction, as long as it meets our memory limit.

Lines 25 and 26 we get the bit-depth of the input and output channels in bytes through in_reader.datatype.bits_per_pixel and out_writer.datatype.bits_per_pixel. This information will be stored in the in_bytes and out_bytes variables and used to calculate the memory requirements for the optimal tile size.

Figure 3: When possible, the processing tile size should be the maximum size in each direction of the input and output file tile scheme

5. Calculate the memory required to use the optimal tile size

Using the information about the input/output tile size in each direction, the number of channels and the bit depth of each channel, we can calculate the amount of memory required for processing the tiles.

28
29
mem_req_bytes = max_tile_x * max_tile_y * ((num_in_chans * in_bytes) + (num_out_chans * out_bytes))
    mem_req_megabytes = mem_req_bytes / 1048576.0

 

Line 28 calculates the memory required for the optimal tile size in bytes. Line 29 converts the memory requirement value from byte to megabytes (MB), so that we can compare it to the user entered memory limit.

6. Check that the memory required for optimal tile size is less than the memory limit

Use an if-condition statement to qualify that the memory required by the tile size is below the memory limit set by the user. In this case we are using 32MB.

 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
if mem_lim > mem_req_megabytes:
        print '\n$$$$$$$$$ Optimal tile size meets memory limit $$$$$$$$$'
        print 'The memory limit of: {0} is greater than the ' \
              'memory required for optimal tiling of: {1}\n'.format(mem_lim, mem_req_megabytes)

        tile_size_x = max_tile_x
        tile_size_y = max_tile_y
    else:
        # you can be more creative in creating a dynamic solution to lowering the tile size to meet your memory lim
        print '\n$$$$$$$$$ Optimal tile size exceeds memory limit $$$$$$$$$'
        print 'The memory limit of: {0} is less than the ' \
              'memory required for optimal tiling of: {1}\n'.format(mem_lim, mem_req_megabytes)

        tile_size_x = 512
        tile_size_y = 512

    return tile_size_x, tile_size_y

 

On line 31, we check to see if the mem_lim variable is greater than the mem_req_megabytes variable. If it is true, then we simply set the tile_size_x and tile_size_y to the max_tile_x and Max_tile_y on lines 36 and 37.

Note: Further optimizations should be done here. If the memory limit is much bigger than the memory required by the optimal tile size, you should create some rules to further increase the tile size. As a general rule, you should always try to use the largest tile size possible.

On line 38, we use an else: condition to handle the case where the optimal tile size would require more memory than the user defined memory limit. In this case, we just simply set the tile_size_x and tile_size_y to 512.

Note: hardcoding a tile size is not recommended, but was done to keep this tutorial relatively simple. You should create a set of rules to define the largest tile size possible that will stay under the memory limit.

Finally, on line 47, we return the tile_size_x and tile_size_y to be used in the main() function for processing.

Try it!

Seeing is believing! Try running the code you created or the process_rasters_by_tiles_optimizations.py file provided with the tutorial data with the memory_limit variable set to 32 megabytes. Take note in the terminal how long it took to run the process. Then set the memory_limit variable to 16 megabytes and rerun the process. The second time through the memory required for the optimal tile will be greater than the memory limit and the tile size will be defaulted to 512x512, which is not optimal. Notice how the second process took much longer to finish.

Have more questions? Submit a request

Comments

Powered by Zendesk