AIInfrastructureScalingWeather Models

How a Small Team Scales Planet-Sized AI Weather Models

Jordan Daubinet, Marvin Gabler·September 3, 2025·7 min read
How a Small Team Scales Planet-Sized AI Weather Models

Training frontier AI models typically requires massive engineering organizations with specialized teams for data, research, infrastructure, and operations. We took a different approach: building unified systems where each engineer owns the full stack from data ingestion to production serving.

Our EPT-2 models process planetary-scale weather data: 100+ TB hourly throughput, petabytes of historical observations, distributed training across 1,000+ H100 GPUs, and real-time inference serving energy markets globally. This post shares the system architecture and engineering decisions that make this scale achievable.

The patterns described here emerged from practical constraints. Small teams cannot afford architectural complexity or operational overhead. Every system must be debuggable by a single engineer, recoverable without extensive coordination, and performant enough to beat models trained on significantly larger infrastructure. These requirements shaped our technical choices in ways that may be useful for others building high-performance AI systems.

Our Weather Models as Ultra-High Resolution Sparse Video

To understand the scale of our EPT-2 models, consider them as sparse video generation systems operating at unprecedented resolution. While state-of-the-art video models like OpenAI's Sora typically work with up to 1080p resolution and 3 RGB channels, our models generate outputs exceeding 4000×2000 resolution with up to 100 channels representing distinct physical variables: temperature, pressure, geopotential height, vorticity, specific humidity, wind components, and dozens more.

Unlike conventional video models that process dense pixel grids, our models need to be able to encode atmospheric states from sparse observations scattered irregularly across space and time. Each "frame" we generate reconstructs the complete global atmospheric state from tens of thousands of weather stations, radars, satellite measurements, proprietary sources etc., filling in millions of grid points through learned physical relationships. This sparse-to-dense generation at planetary scale represents a unique technical achievement that current approaches in weather AI cannot match.

The temporal evolution must respect physical conservation laws while maintaining numerical stability across hundreds of forecast steps. We are currently the only lab demonstrating this combination of resolution, channel depth, and physical consistency in production weather systems, while also outperforming all publicly known numerical and AI weather models on accuracy metrics.

For more details on our model architecture and approach, see our research page.

Why Scale Matters

Weather systems exhibit global spatiotemporal coupling: a typhoon in the Pacific affects jet stream patterns over North America within days. Capturing these long-range dependencies requires models that process planetary observations simultaneously. The target operating envelope for production weather AI includes global coverage, short to long term forecast horizons (up to 60 days), and hourly temporal resolution, all with delivery deadlines tied to energy market gates. Missing a forecast window by minutes can cost millions in stranded generation capacity.

Data Systems: From Observations to Training-Ready Shards

Weather observations arrive from heterogeneous sources: satellite imagery (GOES, Meteosat, Himawari), ground station networks (SYNOP, METAR), radar reflectivity (NEXRAD), and many more. Each source has distinct formats, coordinate systems, temporal sampling, and variable naming conventions.

Think of this as assembling video frames from scattered, time-shifted camera feeds. Our ingestion pipeline performs real-time standardization: unit conversions (e.g. Kelvin to Celsius), regridding (lat/lon to standardized projections), variable harmonization (e.g. specific humidity vs relative humidity), and temporal alignment (handling irregular observation windows). These sparse observations are encoded into dense atmospheric representations, similar to how neural video codecs reconstruct full frames from keypoints. Raw observations are transformed into spatiotemporal chunks aligned with training stride patterns, then sharded by region, time, and variable type for optimal I/O access patterns.

The data path follows: cold storage → training cache → host memory → device memory. Our infrastructure manages petabytes of weather observations accumulated over years, achieving 100 TB+ hourly processing with 5-8 GB/s data inflow and 5000+ IOPS per node. This scale requires careful orchestration of storage tiers, network bandwidth allocation, and compute scheduling to ensure data availability matches training demands.

Distributed Training at Frontier Scale (1000+ H100s)

Compute Scale: We have trained and operated models on more than 1,000 NVIDIA H100 GPUs. Input throughput frequently becomes the limiting factor at this scale rather than matrix FLOPs; the sections below explain how we keep the model fed.

Infrastructure Partners: Many of the datacenters we tested could handle the networking throughput we required for this scale. Our partnerships with cloud providers and storage solutions like Vast Data have been instrumental in achieving these performance targets. For details on how we scaled our infrastructure, see our case study with Crusoe Cloud and Vast Data.

8x H100 server rack 8x H100 server node used for EPT-2 training.

We mainly use two parallelism modes based on model size and cluster topology:

Distributed Data Parallel (DDP): Each GPU maintains a complete model replica but processes different data shards. Gradients are synchronized via all-reduce across all ranks after each backward pass. We use DDP for models that fit within single-node GPU memory (typically <40B parameters).

Fully Sharded Data Parallel (FSDP): Model parameters, gradients, and optimizer states are partitioned across multiple GPUs. Forward passes gather required parameters just-in-time; backward passes use reduce-scatter for gradient synchronization. FSDP enables training of models exceeding single-node memory capacity.

We tune global batch size and gradient accumulation steps to balance training stability against cluster utilization. Mixed precision (BF16 compute, FP32 master weights) reduces memory footprint while maintaining numerical stability. Activation checkpointing trades compute for memory by recomputing intermediate activations during backward passes.

Training these models requires learning temporal dynamics across hundreds of timesteps while maintaining consistent physics. Each training sample represents a complete "weather state" and its target, with the model learning to predict future frames from past atmospheric states. A single training run processes thousands of terabytes of weather data, cycling through years of global atmospheric observations to learn robust physical patterns.

Distributed Data Parallel Distributed Data Parallel: replicate model, shard data.

Full Sharded Data Parallel Full Sharded Data Parallel: shard parameters, gradients, and optimizer state.

Marvin and Roberto in the datacenter Model Lead Roberto and CEO Marvin in a Nebius datacenter where we train our models.

Input Pipeline & Dataloaders (Why We Beat Frontier Throughput)

Weather data presents unique challenges for ML pipelines: irregular spatiotemporal grids, heterogeneous variable types, and massive file counts that break traditional dataloaders. Our custom architecture addresses these through:

Prefetch Design: Multi-level prefetch queues (depth 4-8) with backpressure control prevent memory overflow while maintaining continuous GPU feeding. Each prefetch thread manages its own memory pool to avoid allocation contention.

Optimized Processing Layer: Performance-critical transformations (grid interpolation, variable normalization, data packing) eliminate Python GIL bottlenecks and enable true parallel processing across CPU cores.

Zero-Copy Architecture: Data moves directly from storage to GPU without intermediate copies. Memory-mapped files, pinned buffers, and CUDA streams orchestrate this pipeline, overlapping I/O with compute at every stage.

Throughput Results: In internal benchmarks, our custom dataloaders and end-to-end pipeline outperform variants from frontier labs such as DeepSeek under equivalent hardware and batch settings.

Reliability, Reproducibility, and Safety Rails

Training runs use deterministic random seeds and checkpoint schedules that enable roll-forward/roll-back operations. We built our own custom scheduling system on top of Slurm on Kubernetes, implementing automatic job requeuing on node preemption. Our automated benchmarking and monitoring infrastructure allows models to scale linearly from 8 to 1000+ GPUs by simply changing a configuration flag, ensuring consistent performance across cluster sizes.

Monitoring spans multiple layers: training loss curves and gradient norms for convergence tracking, dataloader queue depth and I/O wait times for pipeline health, GPU utilization and NCCL step timings for distributed efficiency, and benchmarking across validation performance in time and space.

Inference: Meeting Market Deadlines

The EPT-2 family includes four specialized models: EPT-2 (deterministic forecasts), EPT-2 Early (low-latency availability for time-critical applications), EPT-2e (ensemble forecasts for probabilistic prediction), and EPT-2 RR (rapid refresh with hourly global updates). Each model follows the same pipeline phases: ingest → compute → export, but with different schedule orchestration and computational requirements.

During inference, our models generate future "frames" of global weather, predicting how atmospheric patterns evolve hour by hour. Unlike video generation that can hallucinate plausible but incorrect details, weather forecasting demands physical accuracy at every pixel across 100 channels. A single inconsistent prediction can cascade into forecast failure.

Models are warm-loaded into GPU memory to eliminate cold-start latency. Our production pipelines leverage Rust for high-performance model serving, providing memory safety and zero-cost abstractions that accelerate data preprocessing stages.

Inference pipeline for EPT‑2 family EPT-2 inference pipeline alongside other models running on our infrastructure.

Evaluation: What We Measure and Why It Matters

Our evaluation uses Root Mean Square Error (RMSE) against weather station observations worldwide through the WeatherReal dataset. We apply consistent evaluation masks and temporal periods across all model comparisons, with no post-hoc selection of favorable time windows or regions.

10m wind speed RMSE vs forecast lead time Wind speed prediction error increases with forecast horizon: EPT-2 family vs baseline models.

2m temperature RMSE vs forecast lead time Temperature prediction accuracy across forecast horizons: lower RMSE indicates better performance.

Ground station validation skill scores Global validation results showing consistent performance improvements across geographical regions.


For technical details on the EPT-2 model architecture and comprehensive benchmark results, see our paper: EPT2- Technical Report

Tagged
AIInfrastructureScalingWeather ModelsEngineeringPerformance

Want to talk to the team behind the writing?

Book a demo to see EPT-2 and Athena in production, or read the open papers behind the work.