--- title: "High-Throughput Raster Processing" knit: litedown:::knit vignette: > %\VignetteIndexEntry{High-Throughput Raster Processing} %\VignetteEngine{litedown::vignette} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} library(rosettaPTF) library(terra) EVAL <- rosettaPTF::rosetta_module_available() litedown::reactor( eval = EVAL, collapse = TRUE, fig.width = 8, fig.align = 'center' ) ``` ## Introduction `{rosettaPTF}` is designed for efficient batch processing of soil hydraulic parameters. The package supports multiple input formats and offers options for scaling analysis to large datasets. The core implementation uses the `rosetta-soil` Python module, which provides vectorized computation over multiple samples. This vignette demonstrates key parameters for controlling performance and handling large-scale analyses. ## Processing Point Data The `run_rosetta()` function accepts point data as a `data.frame`, where each row represents a single observation. Let's use the sample soil property dataset included with the package. ```{r point_data} data("MUKEY_PROP") # View the structure of the sample data str(MUKEY_PROP) # Run rosetta on the sample property data system.time({ res_points <- run_rosetta(MUKEY_PROP[, c("sandtotal_r", "silttotal_r", "claytotal_r", "dbthirdbar_r")]) }) head(res_points) ``` ## Processing Continuously Varying Raster Data Spatial soil property predictions are commonly stored as raster grids. The `run_rosetta()` function can process `SpatRaster` objects directly, computing predictions for every cell in the grid. ### Creating a Raster Stack from Sample Data We'll use the sample spatial dataset (MUKEY_WCS) to create a continuous raster surface by interpolating soil properties. ```{r raster_setup} # Convert MUKEY_WCS matrix to SpatRaster data("MUKEY_WCS", package = "rosettaPTF") data("MUKEY_PROP", package = "rosettaPTF") r_template <- terra::rast(MUKEY_WCS, crs = "EPSG:5070") terra::ext(r_template) <- c(-1365495, -1358925, 2869245, 2873655) names(r_template) <- "mukey" levels(r_template) <- MUKEY_PROP[, c("mukey", "sandtotal_r", "silttotal_r", "claytotal_r", "dbthirdbar_r")] r_input <- terra::catalyze(r_template) plot(r_input) ``` ### Running Rosetta on Raster Data Pass the `SpatRaster` object to `run_rosetta()`. The output is a multi-layer raster containing mean and standard deviation for each predicted parameter. ```{r raster_processing} # Process the raster stack system.time({ r_output <- run_rosetta(r_input) }) # Inspect the layers names(r_output) # Plot predicted Ksat (log10 cm/day) plot(r_output[["log10_Ksat_mean"]], main = "Predicted Ksat") ``` ## Scaling to Large Datasets ### Parallel Processing with Multiple Cores For large rasters, the `cores` argument enables block-wise processing across multiple CPU cores. This parameter controls how the raster is divided and processed in parallel. ```{r parallel_raster, eval=FALSE} # Divide the raster into blocks and process each block on separate cores r_output_parallel <- run_rosetta(r_input, cores = 2) ``` Note that with small rasters, as in this example, parallel processing may be significantly slower than sequential processing. ### Key Parameters for Scaling - **`cores`**: Number of CPU cores to use for parallel processing. Set to 1 (default) for sequential processing, or 2+ for parallel block-wise processing. Useful for rasters that are memory-intensive or computationally demanding. - **Input format**: `SpatRaster` objects are preferred for spatial workflows. The function handles memory-efficient extraction and output reconstruction automatically. ### Batch Processing Workflows For extremely large regions or high-resolution grids, consider processing tiles or regions sequentially: ```{r batch_example, eval=FALSE} # Example: process raster in regional tiles tiles <- terra::getTileExtents(r_input, 125) results <- terra::merge(terra::sprc(apply(tiles, 1, function(x) { terra::window(r_input) <- x run_rosetta(r_input) }))) ``` ## Advanced Options ### Controlling Parameter Estimation Scale By default, parameters like `alpha`, `npar`, and `Ksat` are returned on a logarithmic (log10) scale. Alternative scales are available via the `estimate_type` argument: ```{r estimate_scales} # Linear scale estimates res_linear <- run_rosetta(MUKEY_PROP[, c("sandtotal_r", "silttotal_r", "claytotal_r", "dbthirdbar_r")], estimate_type = "arith") # Geometric mean (recommended for log-transformed parameters) res_geo <- run_rosetta(MUKEY_PROP[, c("sandtotal_r", "silttotal_r", "claytotal_r", "dbthirdbar_r")], estimate_type = "geo") ``` ### Bootstrap Ensemble By default, predictions include the full 1,000-member bootstrap ensemble for uncertainty quantification. The output includes both mean and standard deviation for each parameter. ```{r uncertainty} # The output includes uncertainty estimates head(res_points[, c("log10_Ksat_mean", "log10_Ksat_sd")]) ``` ## Summary Key considerations for high-throughput analysis: 1. Use `SpatRaster` objects as input for spatial workflows; the function handles data conversion automatically. 2. Set `cores > 1` to enable parallel block-wise processing for large rasters. 3. Choose `estimate_type` based on your application requirements. 4. For extremely large regions, consider tiling or regional batch processing workflows.