Weather Forecasting

Accuracy of AI Weather Models: The Operational Challenge

Name: Athena
Brand: Jua

Olivier Lam·May 10, 2026

AI Weather Model Accuracy: How EPT-2 Outperforms Traditional

Written by: Olivier Lam, Physical AI Team, Jua.ai AG | Last updated: June 24, 2026

Key Takeaways for Energy Trading Teams

Traditional NWP systems like ECMWF HRES remain accurate but are computationally expensive and update infrequently, which limits their value for intraday trading.
Most AI weather models miss record-breaking extremes because they lack explicit physical constraints and regress toward their training distribution.
EPT-2, Jua’s physics foundation model, outperforms ECMWF HRES on every lead time and energy-relevant variable while updating up to 24 times daily at about 0.25 kWh per simulation.
The Jua platform provides a unified benchmarking surface for 25+ models, so traders can run live head-to-head comparisons on their own regions and variables in seconds.
Run a live benchmark to compare EPT-2 with your current forecast provider and quantify P&L impact before the next model run.

The Problem: High-Stakes Model Choice for Energy Portfolios

Traditional NWP, led by ECMWF’s High-Resolution forecast system (HRES), has set the accuracy benchmark for more than forty years. Its authority is earned: physics-based solvers decompose the atmosphere into three-dimensional grid cells and integrate conservation equations for mass, momentum, and energy forward in time. The limitation is computational. A single NWP simulation consumes approximately 8,400 kWh and costs €1,000–€20,000 on high-performance computing infrastructure. That cost ceiling limits global forecast updates to two to four runs per day, which the energy industry has treated as a structural constraint.

AI weather models break this cost barrier. Inference on a single GPU takes minutes at roughly 0.25 kWh per simulation, which enables update frequencies that NWP cannot match. The accuracy case for AI models remains contested, and skepticism among meteorologists and quant developers rests on published evidence, not conservatism. A model that is cheaper but systematically wrong on cold snaps, wind ramps, and heat extremes creates a worse operational outcome than a slower, more expensive model that gets those events right.

For a 1 GW wind portfolio, a four-percentage-point improvement in forecast accuracy translates to approximately €1.5 million per year in reduced hedging and imbalance costs. A 1 GW solar portfolio at the same accuracy gain saves approximately €3 million per year. The financial stakes of model selection are not marginal, and the ability of AI models to capture the extremes that drive those costs becomes operationally critical.

Quantify the impact on your book by running a live benchmark on your regions and variables against 25+ models on the Jua platform.

AI Weather Models and Extremes: What the Evidence Shows

The most rigorous published evidence on AI performance on extremes comes from a 2024 study in Science Advances. The study evaluated GraphCast, Pangu-Weather, FuXi, and their operational variants against ECMWF HRES on record-breaking heat, cold, and wind events in 2018 and 2020. These events exceeded the 1979–2017 ERA5 training maxima used by all three AI models. HRES produced lower RMSE (root mean square error, the average magnitude of forecast error in the units of the predicted variable) on these out-of-distribution events across nearly all lead times. AI models systematically underestimated both the frequency and intensity of record-breaking events, with forecast bias growing nearly linearly as record exceedance increased. HRES showed far smaller bias and no such linear degradation.

The mechanism is structural. Purely data-driven models without explicit physical constraints behave as interpolators within their training distribution. When an event exceeds the historical maximum in the training data, the model has no physical basis for extrapolation and regresses toward the interior of the distribution. The 2024 Science Advances analysis documented that AI models underpredict hot records and overpredict cold records, which creates an implicit soft cap near the most extreme values observed in training data.

A benchmark study on the WEATHER-5K dataset, which covers 5,672 global stations with hourly observations over a 10-year span, confirmed this pattern at scale. Nearly all academic AI forecasting models underperform ECMWF HRES on extreme-event prediction, with particularly large gaps on wind speed and wind direction at the 99.5th percentile using the Symmetric Extremal Dependence Index (SEDI). The study identified four core limitations: inadequate data scale and diversity for global generalization, reliance on aggregate metrics that hide extreme-event failure modes, disconnect from operational forecasting requirements, and absence of explicit physical constraints such as pressure–wind relationships.

EPT-2, Jua’s physics foundation model, addresses the structural cause rather than the symptom. EPT-2 is a spatiotemporal transformer trained on observational physics, learning the governing conservation laws of mass, momentum, and energy directly from data in a latent representation that is integrated forward in time. Outputs are physically constrained by construction. The architecture cannot produce the soft-cap behavior documented in purely data-driven models because the representation respects the physical laws that govern extreme events. Benchmark results appear in the peer-reviewed technical report arXiv:2507.09703.

Numerical Models vs AI Models: Current Accuracy Landscape

For most AI models in 2026, numerical models still win on extremes. On average conditions across all lead times, performance looks more competitive. For EPT-2 specifically, the picture changes: EPT-2 outperforms ECMWF HRES on every lead time and on every variable that drives energy P&L, as documented in arXiv:2507.09703. EPT-2e, the ensemble variant, beats the 50-member ECMWF ENS mean on both RMSE and CRPS (Continuous Ranked Probability Score, a measure of probabilistic forecast accuracy where lower is better) at virtually every lead time.

Model	RMSE vs HRES (10 m wind, 100 m wind, 2 m temp, SSRD, 0–240 h)	Ensemble CRPS vs ECMWF ENS mean
EPT-2 (Jua)	Outperforms HRES on all four variables across full 0–240 h range	EPT-2e beats 50-member ENS mean at virtually every lead time
ECMWF HRES	Benchmark, 40-year NWP reference	ENS mean, gold standard for probabilistic NWP
Microsoft Aurora	Loses to EPT-2 on 10 m wind and 100 m wind across full range; loses on 2 m temp up to about 130 h; no SSRD output	No productized ensemble equivalent
GraphCast (Google DeepMind)	Underperforms HRES on record-breaking extremes; competitive on average conditions at medium range	No productized ensemble equivalent

Aurora and GraphCast are research outputs from large AI labs, not productized platforms with operational refresh schedules, ensemble variants, or transparent hindcast benchmarking. The comparison above is model to model. At the platform level, the distinction is sharper: Jua is a foundation-model and agent company, and Aurora and GraphCast run as guests on the Jua platform’s 25-model benchmarking surface.

Operational Accuracy in 2026: Cadence, Cost, and Latency

Accuracy only matters when paired with operational cadence. A model that is slightly more accurate but updates twice daily is less useful for intraday energy trading than a model that is more accurate and updates 24 times daily. The operational comparison below uses figures from arXiv:2507.09703 and the Jua infrastructure specification.

Model	Daily forecast updates	Inference energy per simulation
EPT-2 / EPT-2 RR (Jua)	Up to 24× per day (EPT-2 RR); 4× per day (EPT-2e)	About 0.25 kWh on a single GPU
ECMWF HRES / ENS	2–4× per day	About 8,400 kWh on HPC
Microsoft Aurora / GraphCast	Typically 4× per day (research cadence; no productized operational schedule)	Similar order of magnitude to EPT-2 for inference

The cost asymmetry between EPT-2 and traditional NWP is approximately four orders of magnitude. EPT-2 was trained on 8 × H100 GPUs over 10 days. Microsoft Aurora required 32 × A100 GPUs over 18 days, which means four times fewer GPUs and a substantially shorter training cycle for EPT-2. At inference, EPT-2 runs approximately 25 percent faster than Aurora. A typical Jua run completes about 2.5 hours ahead of competing operational runs at the same cycle, so traders see the updated forecast before the market re-prices.

These performance advantages stem from EPT’s architectural foundation. EPT is a general physics foundation model, not a narrow weather model. The architecture learns the governing dynamics of any continuous, conservation-law-constrained physical system. The atmosphere is the first physical system EPT has been fine-tuned for, and energy trading is the first market Athena has been instrumented for. This mirrors the relationship between a horizontal AI platform and a flagship vertical product. Physically impossible outputs are eliminated by construction, not by post-processing, because the representation itself respects conservation laws.

See a head-to-head comparison of EPT-2 against your current forecast provider in under five minutes.

How Jua for Energy Becomes a Daily Trading Workspace

Jua is a foundation-model and agent company, and Jua for Energy is the first applied product built on EPT and Athena. The Jua platform exposes more than 25 models through a single schema and a single API. These models include 10 proprietary AI models from the EPT family and 15 third-party NWP and AI models such as ECMWF HRES, ECMWF ENS, ECMWF AIFS, NOAA GFS, Microsoft Aurora, and GFS GraphCast. Customers including Axpo, TotalEnergies, Statkraft, EnBW, EDF, and Hydro-Québec run daily trading and dispatch decisions on the platform.

The operational surface compresses the manual trading workflow into one integrated environment. Day-Ahead and Intraday briefings auto-refresh on every new model run, which removes the morning prep routine. Power forecasts for solar, wind onshore, wind offshore, load, and residual load across Germany, Great Britain, France, the Netherlands, and Belgium update with actual-generation data every 15 minutes, which provides continuous validation. When models disagree on a key variable, divergence alerts fire immediately. When a model revises its own output, correction alerts do the same, so traders see the moments when forecast uncertainty spikes. Athena ties this together by resolving natural-language queries for briefings, benchmarks, backtests, and custom widgets in about 90 seconds.

Quant developers access the same surface through pip install jua and a REST API with Apache Arrow support for large payloads. Hindcast data is available for multi-year backtesting across Jua and third-party models, which lets teams validate strategies before committing capital.

The four-percentage-point accuracy gain discussed earlier scales linearly across multi-GW portfolios, so even modest improvements move material P&L. The live benchmark on the Jua platform, any region, any variable, any time window, with results in seconds, becomes the deal trigger for most customers.

Test the workflow on your highest-stakes region before the next model run lands.

Frequently Asked Questions

How do I know EPT-2 benchmarks are trustworthy?

EPT-2 benchmarks appear in peer-reviewed technical reports on arXiv (EPT-2: 2507.09703; EPT-1.5: 2410.15076) and are evaluated against more than 10,000 real ground stations using Jua’s open-source StationBench methodology, with no post-processing or station fine-tuning. The evaluation is external and reproducible. The Jua platform’s live benchmarking surface also lets any prospective customer run a head-to-head comparison on their own region and variable in under five minutes. The benchmark is a live computation that the customer controls, not a static vendor graphic.

Will AI models ever replace ECMWF HRES for energy trading?

Jua for Energy does not aim to replace ECMWF HRES, and most serious customers keep their ECMWF subscription running alongside the Jua platform. ECMWF AIFS, ECMWF’s own AI model, runs natively on the Jua platform. Jua for Energy replaces the plumbing around the ECMWF feed: the in-house grib pipeline, manual benchmarking, morning-briefing routine, and dashboard stitching. The 7–9 a.m. manual prep window compresses into a single workspace where every model, including ECMWF, GFS, AIFS, Aurora, and EPT, appears on the same screen with one schema and one API. Whether any AI model fully replaces NWP incumbents for extreme-event forecasting depends on continued progress in physics-constrained architectures. EPT-2’s current benchmark results represent the state of the art as of mid-2026.

What happens if an AI model produces physically impossible outputs?

Purely data-driven AI models without physical constraints can produce outputs that violate conservation laws, which is the atmospheric equivalent of an LLM hallucination. As explained in the extremes section, EPT’s architecture respects conservation laws by construction, which prevents the physically impossible outputs that generic transformers can produce when applied naively to physics problems. This architectural choice is the key distinction between EPT and research-grade AI weather models like GraphCast or Aurora and underpins EPT-2’s performance on record-breaking extremes.

How quickly can I integrate Jua forecasts with my existing ECMWF pipeline?

Integration typically takes days rather than a quarter. The Jua platform exposes a REST API with Apache Arrow support for large payloads and a Python SDK installable via pip install jua from PyPI. All 25+ models on the platform, including ECMWF HRES, ENS, AIFS, NOAA GFS, DWD ICON, Aurora, and GraphCast, are accessible through a unified schema, so switching or comparing models does not require re-engineering existing pipelines. Hindcast data is available across multiple Jua and third-party models for backtesting. ENTSO-E grid data integrates directly for European power-market workflows. Documentation is at docs.jua.ai and the developer dashboard at developer.jua.ai.

Conclusion: Four Tests for Any AI Weather Solution

Four criteria determine whether an AI weather solution is operationally trustworthy for energy trading in 2026. First, RMSE and CRPS performance on extremes, not just average conditions, because the events that move energy prices sit in the tail of the distribution. Second, explicit physics constraints that prevent physically impossible outputs and support extrapolation beyond the training distribution. Third, operational cadence and inference cost, since a model that updates 24 times daily at 0.25 kWh per run behaves very differently from one that updates twice daily at 8,400 kWh. Fourth, transparent hindcast benchmarking that any customer can reproduce on their own region and variable, rather than vendor-provided graphics.

As of mid-2026, EPT-2 is the only model that meets all four criteria while powering a productized platform. The performance advantage documented earlier holds across all energy-relevant variables: 10 m wind, 100 m wind, 2 m temperature, and surface solar radiation across the full 0–240 hour range. Its ensemble variant, EPT-2e, beats the 50-member ECMWF ENS mean on RMSE and CRPS at virtually every lead time. These results are documented in peer-reviewed technical reports at arXiv:2507.09703 and arXiv:2410.15076 and satisfy the transparency criterion.

The Jua platform runs up to 24 times daily at approximately 0.25 kWh per simulation, with Athena resolving natural-language queries in about 90 seconds and backtests in about five minutes. Jua is a foundation-model and agent company, and Jua for Energy is the first applied product. The architecture learns physics, and the domain remains a variable.

Back to all articles Explore energy trading

View the key takeaways as a web story

Want to talk to the team behind the writing?

Book a demo to see EPT-2 and Athena in production, or read the open papers behind the work.

Book a demo Read the papers