StationBench Data: EPT-2 Achieves Top AI Forecast Accuracy

StationBench Data: EPT-2 Achieves Top AI Forecast Accuracy

ON THIS PAGE

Written by: Olivier Lam, Physical AI Team, Jua.ai AG

Key Takeaways for Energy and Weather Teams

  • EPT-2 delivers competitive or superior accuracy versus ECMWF HRES on 10 m wind, 100 m wind, 2 m temperature, and surface solar radiation through 10-day lead times, verified on more than 10,000 real ground stations.

  • Physics-constrained modeling prevents the physically impossible outputs that plague generic deep-learning weather models, especially on extreme events.

  • EPT-2e outperforms the 50-member ECMWF ENS mean on both RMSE and CRPS at many lead times while using only 10 ensemble members.

  • Native any-Δt forecasting and up to 24 daily updates cut compounding error and slow refresh cycles from traditional NWP and other AI models.

  • Jua for Energy, powered by EPT-2 and the AI agent Athena, is the production-grade platform to evaluate first, so you can see these results on your own data in a live benchmark.

How Physics-Constrained Forecasting Changes Model Reliability

Most AI weather models use standard deep-learning architectures on atmospheric data and learn statistical correlations from historical records without enforcing conservation laws of mass, momentum, and energy. They can output fields that look plausible in aggregate yet remain physically impossible, a hallucination problem structurally similar to the one that affects large language models on symbolic tasks. These failures become more visible on extremes and out-of-distribution events.

Physics-constrained foundation models follow a different path. The Earth Physics Transformer (EPT) family, documented in arXiv:2507.09703, is a general spatiotemporal transformer that learns the governing physics of complex systems directly from observational data and encodes conservation laws in a latent representation that integrates forward in time. The architecture cannot produce outputs that violate those laws, because the constraint lives inside the model rather than in a post-processing filter. The same architecture that learns atmospheric dynamics already predicts plasma behaviour inside a tokamak, where the domain changes but the physics remains constant.

This structural constraint allows EPT to stay reliable when the atmosphere moves into states outside the training distribution, which is the regime where generic deep-learning models often degrade. A 2026 study published in Science Advances found that physics-based models such as HRES outperform generic deep-learning models including GraphCast, Pangu-Weather, and FuXi on record-breaking extremes. Physics-constrained models do not share this failure mode.

Accuracy by Timescale: From Nowcast to 10+ Days

StationBench is Jua’s open-source verification framework that evaluates forecast skill against more than 10,000 real ground-truth stations globally, with no post-processing or station fine-tuning applied to any model. The methodology appears in arXiv:2507.09703.

Across a range of lead times, EPT-2 records competitive RMSE compared to ECMWF HRES on key variables that drive energy P&L. For 10 m wind speed, EPT-2 outperforms HRES at many lead times. For 100 m wind speed, the hub-height variable that determines turbine output, the margin remains consistent across the range and directly affects wind-portfolio dispatch. For 2 m temperature, EPT-2 maintains competitive RMSE that matters for gas demand and load forecasting. For surface solar radiation (SSRD), EPT-2 wins by default against Microsoft Aurora, which publishes no SSRD output, and performs well versus HRES across the horizon.

EPT-2e, the ensemble variant, beats the 50-member ECMWF ENS mean on both RMSE and CRPS at many lead times, as documented in arXiv:2410.15076. CRPS measures probabilistic forecast skill, so a lower CRPS means the ensemble’s probability distributions are better calibrated against observed outcomes. EPT-2e achieves this with 10 published members against the ENS’s 50, which highlights the quality of the underlying physics representation rather than ensemble size alone.

EPT-2 is trained to forecast at native any-Δt, which means it predicts arbitrary lead times rather than fixed 6-hour increments. Aurora and most peer models roll forward in 6-hour steps and compound error at each step. EPT-2 avoids that rolling process and removes this source of drift.

Extreme-Event Performance on Heat, Cold, and Wind Records

The 2026 Science Advances study tested GraphCast, Pangu-Weather, and FuXi against ECMWF HRES on approximately 160,000 heat, 33,000 cold, and 53,000 wind record-breaking events from 2020, identified using ERA5 reanalysis against 1979–2017 baselines. The authors found that generic AI models can underpredict temperatures during heat records and overpredict during cold records, with error magnitude increasing as the record breaks by a larger margin. For some record-breaking heat events, AI-based forecasts have underperformed HRES.

The same study notes that the three evaluated models are deterministic and simulate only one future state. Probabilistic models that generate multiple possible futures are expected to capture record-breaking extremes more effectively. EPT-2e fits this description as a physics-constrained probabilistic model whose ensemble spread reflects genuine atmospheric uncertainty rather than statistical noise. The physics constraint supports stable behaviour when the atmosphere enters rare regimes outside the training distribution, which is where generic deep-learning models often fail.

Physics Constraints and the Hallucination Problem

LLMs hallucinate because they remain unconstrained on the symbolic surface, so token sequences that look plausible can be physically nonsensical. A standard transformer applied naively to atmospheric data produces the same class of error and can output fields that appear statistically smooth but violate basic physics. EPT introduces the constraint at the representation level and encodes conservation laws in the latent space that the model integrates forward. The architecture cannot generate a wind field that violates momentum conservation or a temperature profile that violates energy balance.

This structure provides a concrete answer to the objection that AI weather models cannot be trusted. The objection remains accurate for generic deep-learning models and does not apply to physics-constrained foundation models. External validation supports this claim through StationBench results against more than 10,000 ground stations, published in peer-reviewed technical reports at arXiv:2507.09703 and arXiv:2410.15076.

Head-to-Head Benchmark Results Across Key Variables

The table below summarises EPT-2 performance versus ECMWF HRES, Microsoft Aurora, and Google DeepMind GraphCast across four energy-relevant variables and lead-time bands. The pattern stays consistent: EPT-2 shows strong performance versus HRES across all four variables, while Aurora omits the solar radiation variable that drives PV dispatch decisions and GraphCast publishes no 100 m wind data despite its relevance for turbine output. All figures are drawn from the StationBench methodology described earlier, as reported in arXiv:2507.09703 and arXiv:2410.15076. “Wins” denotes lower RMSE on StationBench, and “No output” denotes that the model publishes no forecast for that variable.

Variable / Lead Time

EPT-2

ECMWF HRES

Microsoft Aurora

GraphCast

10 m wind, 0–240 h

Strong performance at many lead times

Benchmark

Loses to EPT-2 across full range

Loses to EPT-1.5 on European wind

100 m wind, 0–240 h

Strong performance at many lead times

Benchmark

Loses to EPT-2 across full range

No published 100 m output

2 m temperature, 0–240 h

Strong performance at many lead times

Benchmark

Loses to EPT-2 up to ~130 h; competitive beyond

Loses to EPT-1.5 on European temperature

Surface solar radiation, 0–240 h

Strong performance vs. HRES at many lead times

Benchmark

No output

No published SSRD output

See these benchmark results on your own region and variables in a live session.

Ensemble Skill and Probabilistic Depth with EPT-2e

The ensemble performance mentioned earlier matters because CRPS measures the full probabilistic skill of an ensemble, so a lower score means the forecast distribution is better calibrated against observed outcomes. This metric aligns with how energy risk desks price optionality around weather uncertainty and evaluate hedging strategies. Beyond this accuracy advantage, EPT-2e updates four times per day, and no AI peer such as Aurora, GraphCast, or ECMWF AIFS ships a productised ensemble equivalent. The ECMWF ENS remains the gold standard for probabilistic NWP, and EPT-2e now surpasses its mean on the metrics that matter most to energy risk teams.

Operational Refresh, Latency, and Dissemination

Traditional NWP runs at 2–4 cycles per day because a single ECMWF simulation consumes approximately 8,400 kWh and costs €1,000–€20,000 on HPC infrastructure, which has created a hard compute ceiling on update frequency for forty years. EPT-2 breaks this ceiling, since a single inference runs on a single GPU in minutes at approximately 0.25 kWh and $0.20–$15. EPT-2 RR, the rapid-refresh variant, updates up to 24 times per day and keeps traders close to the latest atmospheric state.

EPT-2 HRRR delivers high-resolution coverage at up to 5 km natively over Europe at the same cadence, which supports site-level decisions. Athena, the AI agent instrumented with the Jua for Energy tool surface, resolves typical forecast queries in approximately 90 seconds. A typical Jua run completes about 2.5 hours ahead of competing operational runs at the same cycle. For a 1 GW wind portfolio, a four-percentage-point accuracy gain from this refresh cadence translates to roughly €1.5 M in annual savings under standard hedging and imbalance-cost structures.

How to Run Your Own 5-Minute Benchmark

The live benchmark on the Jua platform puts more than 25 models on a single surface, including 10 proprietary EPT-family models and 15 third-party NWP and AI models such as ECMWF HRES, Aurora, and GraphCast. A user selects a region, a variable, and a time window, and the platform returns a head-to-head accuracy comparison in seconds. Backtests against years of historical forecasts run in about 5 minutes via Athena.

You can run this benchmark at athena.jua.ai and see how EPT-2 and EPT-2e behave on the exact locations and assets that matter to your desk.

Conclusion: Turning Physics-Constrained Accuracy into P&L

StationBench results confirm what the arXiv technical reports established, which is that EPT-2 delivers state-of-the-art performance in atmospheric prediction with strong results versus ECMWF HRES on variables and lead times that drive an energy P&L. EPT-2e performs well versus the ECMWF ENS mean on RMSE and CRPS and brings probabilistic depth with fewer members. Physics constraints are not a marketing claim, because they form the structural reason EPT-2 remains reliable where generic deep-learning models fail.

Jua for Energy, built on EPT and Athena, is the first production-grade platform that combines this accuracy with 24-runs-per-day refresh, natural-language analysis, and a live benchmarking surface. The numbers speak, and the next step is running them on your own region so you can see the impact on your portfolio.

Schedule a live benchmark session to see EPT-2 on your assets.

Frequently Asked Questions

What makes physics-constrained AI weather forecasting more accurate than generic deep-learning models?

Generic deep-learning weather models learn statistical correlations from historical atmospheric data but are not structurally constrained by the conservation laws of mass, momentum, and energy that govern physical reality. They can produce outputs that are statistically smooth yet physically impossible and can underperform on extreme events that fall outside their training distribution. Physics-constrained foundation models such as EPT encode conservation laws in the latent representation that the model integrates forward in time, so the architecture cannot generate outputs that violate those laws.

This structural constraint supports EPT-2 accuracy at long lead times and on record-breaking events where generic models can degrade. External validation comes from StationBench verification against more than 10,000 real ground stations, with no post-processing applied, published in peer-reviewed technical reports on arXiv.

How does EPT-2 compare to ECMWF HRES and Microsoft Aurora in practice?

EPT-2 shows strong performance compared to ECMWF HRES across lead times from 0 to 240 hours and on key variables that drive energy P&L, including 10 m wind, 100 m wind, 2 m temperature, and surface solar radiation. ECMWF HRES has held the benchmark for forty years, and EPT-2 now shows strong performance versus it on StationBench. Against Microsoft Aurora, EPT-2 shows strong performance on 10 m wind, 100 m wind, and 2 m temperature across the 0–240 hour range. On surface solar radiation, EPT-2 wins by default because Aurora publishes no SSRD output.

Beyond raw accuracy, EPT-2 forecasts at native any-Δt rather than rolling forward in fixed 6-hour steps as Aurora does, which means EPT-2 does not compound step error. EPT-2 inference also runs approximately 25% faster than Aurora. Jua for Energy does not replace ECMWF, because serious customers keep their ECMWF subscription and run Jua for Energy alongside it. The platform instead replaces the manual plumbing and fragmented tooling around the incumbent feed.

What is StationBench and why does it matter for evaluating AI weather forecasting accuracy?

StationBench is Jua’s open-source verification framework that evaluates forecast skill against more than 10,000 real ground-truth observation stations globally. Unlike reanalysis-based benchmarks, which compare model outputs to gridded reconstructions of the atmosphere, StationBench measures accuracy against actual physical measurements at specific locations. No post-processing or station fine-tuning is applied to any model being evaluated, which keeps the comparison fair and reproducible.

This methodology matters because energy trading decisions such as wind-ramp positioning, solar-generation dispatch, and gas-demand hedging depend on point-level accuracy at specific sites rather than on gridded skill scores averaged across large domains. StationBench results form the basis for all EPT-2 and EPT-2e benchmark claims published in arXiv:2507.09703 and arXiv:2410.15076.

How does Jua for Energy handle ensemble forecasting and probabilistic risk?

EPT-2e is the ensemble variant of EPT-2 and beats the 50-member ECMWF ENS mean on both RMSE and CRPS at many lead times, verified on StationBench. CRPS measures the full probabilistic skill of an ensemble, so a lower score means the forecast distribution is better calibrated against observed outcomes and aligns with how risk desks price optionality around weather uncertainty. EPT-2e updates four times per day, and no AI peer such as Aurora, GraphCast, or ECMWF AIFS ships a productised ensemble equivalent.

For energy traders who need to position around forecast uncertainty rather than a single deterministic trajectory, EPT-2e provides the probabilistic depth of the ECMWF ENS with superior skill scores and the physics-constrained reliability that generic ensemble models lack on extreme events.

How quickly can a trading team or meteorologist evaluate Jua for Energy against their current provider?

A trading team or meteorologist can complete an initial evaluation in approximately 5 minutes. The live benchmarking surface on the Jua platform puts more than 25 models on a single screen. A meteorologist or quant developer selects a region, a variable, and a current provider, and the platform returns a head-to-head accuracy comparison in seconds. Full backtests against years of historical forecasts run in about 5 minutes via Athena, the AI agent instrumented with the Jua for Energy tool surface.

The Python SDK installs with pip install jua, and the REST API exposes all models through a single schema with Apache Arrow support for large payloads. Teams that prefer programmatic evaluation can stand up an integration in days rather than the quarter it often takes elsewhere, and the benchmark then becomes the deal trigger because the numbers remain visible and reproducible.

Want to talk to the team
behind the writing?

Book a demo to see EPT-2 and Athena in production, or read the open papers behind the work.