kernel_image_resize

A pure-Triton Hub kernel that fuses the resize + rescale + normalize preprocessing pipeline run by ~150 transformers fast image processors (TorchvisionBackend: resize → fold(rescale, normalize)) into a single GPU pass. It takes raw CHW uint8 images and returns the normalized (N, C, out_h, out_w) float tensor with no intermediate full-resolution float buffer.

On a ragged SigLIP-so400m batch (A100, N=32, inputs 384–1024², out 384², bicubic+antialias) the default backend runs in 1.29 ms/iter vs 3.90 ms for the fast processor (~3× faster) and 2.89 ms for torchvision's own per-image loop, at parity ≤1e-4 vs torchvision-float.

It ships as a kernels universal build variant (no compiled extension, just Triton), so it loads on any CUDA PyTorch build via get_kernel.

Usage

import torch
from kernels import get_kernel

kir = get_kernel("Molbap/kernel_image_resize", revision="main", trust_remote_code=True)

# a list of different-H×W uint8 CHW images (the ragged case torchvision loops over)
images = [torch.randint(0, 256, (3, h, w), dtype=torch.uint8, device="cuda")
          for h, w in [(640, 480), (800, 600), (384, 1024)]]

pixel_values = kir.resize_normalize(
    images,
    size=384,                      # int (square), (H, W), or {"height", "width"}
    image_mean=[0.5, 0.5, 0.5],
    image_std=[0.5, 0.5, 0.5],
    rescale_factor=1 / 255,
    resample="bicubic",            # or "bilinear", or a PIL resample int
    antialias=True,                # match the ViT/CLIP/SigLIP default
)
# -> (3, 3, 384, 384) float32, ready for the model

trust_remote_code=True is required because this is a personal namespace (not the trusted kernels-community org). revision="main" loads the current code; tag a v1.0.0 release if you want version=1 loading instead.

resize_normalize accepts a stacked (N, C, H, W) tensor or a ragged list of CHW tensors. resize_normalize_ragged is the same kernel, list-only.

With a transformers processor

There is no use_kernels=True hook for image processors — that machinery swaps nn.Module layer forwards inside the model, not processor code. Use the kernel directly with the processor's config (see example_transformers.py for a runnable version):

from kernels import get_kernel
kir = get_kernel("Molbap/kernel_image_resize", revision="main", trust_remote_code=True)
_PIL_RESAMPLE = {0: "bilinear", 2: "bilinear", 3: "bicubic"}

def preprocess_with_kernel(processor, images):
    size = processor.size  # must be fixed {"height", "width"}; no crop/pad
    return kir.resize_normalize(
        images, (size["height"], size["width"]),
        processor.image_mean, processor.image_std,
        rescale_factor=float(processor.rescale_factor),
        resample=_PIL_RESAMPLE[int(processor.resample)],
        antialias=bool(getattr(processor, "antialias", True)),
    )

Backends

  • backend="separable" (default): two-pass uint8 kernel doing taps+taps loads — torchvision's own separable algorithm. Fastest (~3× the fast processor on the batch above); parity ≤1e-4 vs torchvision-float. The float intermediate makes it more accurate than, but not bit-identical to, torchvision's fixed-point uint8 intermediate.
  • backend="fused": a single 2D launch, taps×taps loads per output pixel. Same parity, kept as the reference path but ~9× slower than separable (the 2D float load count is the reason a separable pass wins — see benchmarks/benchmark.py).

Parity notes

The resampling weights match PyTorch aten UpSampleKernel. Antialiased bicubic uses the PIL cubic coefficient a=-0.5; non-antialiased bicubic uses Keys a=-0.75. The antialias renormalize-truncate window applies on every axis, including upsampling dims.

Center crop / shortest-edge

Pass crop_size to resize then center-crop in one pass (the crop is folded into the output-coordinate mapping, no extra buffer). resize_mode="shortest_edge" does aspect-preserving resize (short side = size) then crop — the CLIP / DINOv2 pipeline.

# CLIP/DINOv2-style: resize shortest edge to 256, center-crop 224
pv = kir.resize_normalize(images, 256, mean, std, resample="bicubic", antialias=True,
                          crop_size=224, resize_mode="shortest_edge")

example_transformers.py derives all of this from a processor's config automatically.

Scope

Resize (+ optional center crop) + rescale + normalize. It does not pad — padding processors (many detection models) run a different pipeline. The fused backend is resize-only; crop is handled by the separable backend.

Downloads last month
-
kernel