MAISI-US
— ultrasound to CT synthesis · Diffusion models · ControlNet · Rectified Flow Diffusion
ControlNet-conditioned 3D latent diffusion adapting NVIDIA's MAISI foundation model for ultrasound to CT synthesis using paired cadaver US-CT ROIs, VAE latent pre-encoding, and a multi-metric validation suite (PSNR, SSIM, LPIPS, GOA).
Problem
Spinal surgery navigation traditionally relies on intraoperative cone-beam CT — accurate, but slow and dose-heavy. Ultrasound is fast and radiation-free, but surgeons reason in CT-space: bones, screw trajectories, vertebral landmarks. The translation from a sweep of ultrasound frames to a usable CT volume in real time is the missing link.
We frame this as a cross-modality volumetric synthesis problem: given a registered ultrasound volume, produce a CT volume that supports real-time pose regression for surgical guidance, evaluated against a paired ground-truth CT. To our knowledge, this is the first LDM-based, ControlNet-conditioned, 3D volumetric, spine-specific US-to-CT synthesis pipeline.
Approach
We adapt MAISI — NVIDIA's 3D latent diffusion foundation model — on the maisi3d-rflow variant (Rectified Flow scheduler, ~33× faster inference than DDPM) and inject ultrasound conditioning through a 3D ControlNet branch with single-channel US input (conditioning_embedding_in_channels: 1). The VAE encoder and diffusion U-Net stay frozen; only the ControlNet learns to route ultrasound features into the latent denoising trajectory.
A few details that turned out to matter more than expected:
- HU normalization. MAISI expects CT volumes rescaled to [0, 1]. Naive min-max normalization produced black spots in synthesis — windowing CT to the observed HU range before scaling fixed reconstruction artifacts that masked otherwise-clean outputs.
- Patch geometry. The U-shaped backbone requires dimensions divisible by 32 (training) or 128 (inference). We train and infer on 96×96×48 ROI patches at 1 mm spacing, with z resampled where needed and sliding-window inference at high overlap.
- Latent pre-encoding. CT targets are VAE-encoded once and stored as latent embeddings before ControlNet training — this cuts GPU memory and epoch time by avoiding repeated encoder passes.
- Orientation alignment. Bounding-box axes must match MAISI's expected RAS layout; CT and CBCT ROIs use different transpose orders to align with the pretrained latent space.
Pipeline
The implementation lives in a private MONAI tutorial fork (tutorials_maisi) structured as end-to-end train/validate scripts:
paired US-CT ROIs datalist builder (cbct_* to vnn_* suffix matching)
- MONAI transforms (RAS orientation, intensity scaling, divisible padding)
- VAE latent cache for CT targets
- multi-GPU ControlNet training (torchrun · L1 loss · TensorBoard / W&B)
- infer_controlnet (sliding-window · Rectified Flow denoising)
- similarity_scoring + orientation_robustness_validationDataset
We collected on in-house cadaveric dataset of paired US and CBCT Volumes: Clarius and Alpinion linear/curvilinear probes for ultrasound, Medtronic O-arm cone-beam CT for ground truth, NDI optical tracking for probe pose, and a Universal Robots UR3 to script reproducible sweep trajectories. Calibration and registration produced a paired US-CBCT corpus where every ultrasound voxel maps to a CT-space coordinate. We supplement with the NMDID-Spine-NifTI dataset and Sonogym-simulated ultrasound for the SynthUS track.
Without a tightly registered paired corpus the ControlNet just learns priors. The dataset work was as much of the project as the modeling.
Results
Rigorous quantitative validation has been conducted with various image agreement metrics to effectively study the synthesized volumes and quantify synthesis errors and epistemic uncertainty.
The synthesized volumes preserve vertebral structure faithfully enough to support 3D-2D registration into intraoperative imaging — the next milestone is closing the loop with the registration pipeline.
Links
Paper in preparation. Training and evaluation code is in a private lab repository (tutorials_maisi). Reach out if you'd like to discuss the approach, evaluation protocol, or dataset curation.