High-Throughput Raster Processing

Introduction

{rosettaPTF} is designed for efficient batch processing of soil hydraulic parameters. The package supports multiple input formats and offers options for scaling analysis to large datasets.

The core implementation uses the rosetta-soil Python module, which provides vectorized computation over multiple samples. This vignette demonstrates key parameters for controlling performance and handling large-scale analyses.

Processing Point Data

The run_rosetta() function accepts point data as a data.frame, where each row represents a single observation. Let’s use the sample soil property dataset included with the package.

data("MUKEY_PROP")

# View the structure of the sample data
str(MUKEY_PROP)

# Run rosetta on the sample property data
system.time({
  res_points <- run_rosetta(MUKEY_PROP[, c("sandtotal_r", "silttotal_r", "claytotal_r", "dbthirdbar_r")])
})

head(res_points)

Processing Continuously Varying Raster Data

Spatial soil property predictions are commonly stored as raster grids. The run_rosetta() function can process SpatRaster objects directly, computing predictions for every cell in the grid.

Creating a Raster Stack from Sample Data

We’ll use the sample spatial dataset (MUKEY_WCS) to create a continuous raster surface by interpolating soil properties.

# Convert MUKEY_WCS matrix to SpatRaster
data("MUKEY_WCS", package = "rosettaPTF")
data("MUKEY_PROP", package = "rosettaPTF")

r_template <- terra::rast(MUKEY_WCS, crs = "EPSG:5070")
terra::ext(r_template) <- c(-1365495, -1358925, 2869245, 2873655)
names(r_template) <- "mukey"

levels(r_template) <- MUKEY_PROP[, c("mukey",
  "sandtotal_r", "silttotal_r", "claytotal_r",
  "dbthirdbar_r")]

r_input <- terra::catalyze(r_template)
plot(r_input)

Running Rosetta on Raster Data

Pass the SpatRaster object to run_rosetta(). The output is a multi-layer raster containing mean and standard deviation for each predicted parameter.

# Process the raster stack
system.time({
  r_output <- run_rosetta(r_input)
})

# Inspect the layers
names(r_output)

# Plot predicted Ksat (log10 cm/day)
plot(r_output[["log10_Ksat_mean"]], main = "Predicted Ksat")

Scaling to Large Datasets

Parallel Processing with Multiple Cores

For large rasters, the cores argument enables block-wise processing across multiple CPU cores. This parameter controls how the raster is divided and processed in parallel.

# Divide the raster into blocks and process each block on separate cores
r_output_parallel <- run_rosetta(r_input, cores = 2)

Note that with small rasters, as in this example, parallel processing may be significantly slower than sequential processing.

Key Parameters for Scaling

cores: Number of CPU cores to use for parallel processing. Set to 1 (default) for sequential processing, or 2+ for parallel block-wise processing. Useful for rasters that are memory-intensive or computationally demanding.
Input format: SpatRaster objects are preferred for spatial workflows. The function handles memory-efficient extraction and output reconstruction automatically.

Batch Processing Workflows

For extremely large regions or high-resolution grids, consider processing tiles or regions sequentially:

# Example: process raster in regional tiles
tiles <- terra::getTileExtents(r_input, 125)

results <- terra::merge(terra::sprc(apply(tiles, 1, function(x) {
  terra::window(r_input) <- x
  run_rosetta(r_input)
})))

Advanced Options

Controlling Parameter Estimation Scale

By default, parameters like alpha, npar, and Ksat are returned on a logarithmic (log10) scale. Alternative scales are available via the estimate_type argument:

# Linear scale estimates
res_linear <- run_rosetta(MUKEY_PROP[, c("sandtotal_r", "silttotal_r", "claytotal_r", "dbthirdbar_r")],
                          estimate_type = "arith")

# Geometric mean (recommended for log-transformed parameters)
res_geo <- run_rosetta(MUKEY_PROP[, c("sandtotal_r", "silttotal_r", "claytotal_r", "dbthirdbar_r")],
                       estimate_type = "geo")

Bootstrap Ensemble

By default, predictions include the full 1,000-member bootstrap ensemble for uncertainty quantification. The output includes both mean and standard deviation for each parameter.

# The output includes uncertainty estimates
head(res_points[, c("log10_Ksat_mean", "log10_Ksat_sd")])

Summary

Key considerations for high-throughput analysis:

Use SpatRaster objects as input for spatial workflows; the function handles data conversion automatically.
Set cores > 1 to enable parallel block-wise processing for large rasters.
Choose estimate_type based on your application requirements.
For extremely large regions, consider tiling or regional batch processing workflows.