High-Throughput Raster Processing

Introduction

{rosettaPTF} is designed for efficient batch processing of soil hydraulic parameters. The package supports multiple input formats and offers options for scaling analysis to large datasets.

The core implementation uses the rosetta-soil Python module, which provides vectorized computation over multiple samples. This vignette demonstrates key parameters for controlling performance and handling large-scale analyses.

Processing Point Data

The run_rosetta() function accepts point data as a data.frame, where each row represents a single observation. Let’s use the sample soil property dataset included with the package.

data("MUKEY_PROP")

# View the structure of the sample data
str(MUKEY_PROP)

# Run rosetta on the sample property data
system.time({
  res_points <- run_rosetta(MUKEY_PROP[, c("sandtotal_r", "silttotal_r", "claytotal_r", "dbthirdbar_r")])
})

head(res_points)

Processing Continuously Varying Raster Data

Spatial soil property predictions are commonly stored as raster grids. The run_rosetta() function can process SpatRaster objects directly, computing predictions for every cell in the grid.

Creating a Raster Stack from Sample Data

We’ll use the sample spatial dataset (MUKEY_WCS) to create a continuous raster surface by interpolating soil properties.

# Convert MUKEY_WCS matrix to SpatRaster
data("MUKEY_WCS", package = "rosettaPTF")
data("MUKEY_PROP", package = "rosettaPTF")

r_template <- terra::rast(MUKEY_WCS, crs = "EPSG:5070")
terra::ext(r_template) <- c(-1365495, -1358925, 2869245, 2873655)
names(r_template) <- "mukey"

levels(r_template) <- MUKEY_PROP[, c("mukey",
  "sandtotal_r", "silttotal_r", "claytotal_r",
  "dbthirdbar_r")]

r_input <- terra::catalyze(r_template)
plot(r_input)

Running Rosetta on Raster Data

Pass the SpatRaster object to run_rosetta(). The output is a multi-layer raster containing mean and standard deviation for each predicted parameter.

# Process the raster stack
system.time({
  r_output <- run_rosetta(r_input)
})

# Inspect the layers
names(r_output)

# Plot predicted Ksat (log10 cm/day)
plot(r_output[["log10_Ksat_mean"]], main = "Predicted Ksat")

Scaling to Large Datasets

Parallel Processing with Multiple Cores

For large rasters, the cores argument enables block-wise processing across multiple CPU cores. This parameter controls how the raster is divided and processed in parallel.

# Divide the raster into blocks and process each block on separate cores
r_output_parallel <- run_rosetta(r_input, cores = 2)

Note that with small rasters, as in this example, parallel processing may be significantly slower than sequential processing.

Key Parameters for Scaling

Batch Processing Workflows

For extremely large regions or high-resolution grids, consider processing tiles or regions sequentially:

# Example: process raster in regional tiles
tiles <- terra::getTileExtents(r_input, 125)

results <- terra::merge(terra::sprc(apply(tiles, 1, function(x) {
  terra::window(r_input) <- x
  run_rosetta(r_input)
})))

Advanced Options

Controlling Parameter Estimation Scale

By default, parameters like alpha, npar, and Ksat are returned on a logarithmic (log10) scale. Alternative scales are available via the estimate_type argument:

# Linear scale estimates
res_linear <- run_rosetta(MUKEY_PROP[, c("sandtotal_r", "silttotal_r", "claytotal_r", "dbthirdbar_r")],
                          estimate_type = "arith")

# Geometric mean (recommended for log-transformed parameters)
res_geo <- run_rosetta(MUKEY_PROP[, c("sandtotal_r", "silttotal_r", "claytotal_r", "dbthirdbar_r")],
                       estimate_type = "geo")

Bootstrap Ensemble

By default, predictions include the full 1,000-member bootstrap ensemble for uncertainty quantification. The output includes both mean and standard deviation for each parameter.

# The output includes uncertainty estimates
head(res_points[, c("log10_Ksat_mean", "log10_Ksat_sd")])

Summary

Key considerations for high-throughput analysis:

  1. Use SpatRaster objects as input for spatial workflows; the function handles data conversion automatically.
  2. Set cores > 1 to enable parallel block-wise processing for large rasters.
  3. Choose estimate_type based on your application requirements.
  4. For extremely large regions, consider tiling or regional batch processing workflows.