Weather Forecasting

AI Weather Forecasting Accuracy 2026: EPT-2 Benchmark

Name: Athena
Brand: Jua

Olivier Lam·June 4, 2026

StationBench Data: EPT-2 Achieves Top AI Forecast Accuracy

Written by: Olivier Lam, Physical AI Team, Jua.ai AG | Last updated: July 4, 2026

Key Takeaways for Energy Desks

AI weather models like EPT-2 now beat traditional NWP on standard accuracy metrics for routine forecasts across all lead times up to 240 hours.
Physics-based models such as ECMWF HRES still hold an edge when predicting record-breaking extreme weather events.
EPT-2 leads the 2026 leaderboard on wind, temperature, and solar radiation variables that drive energy trading profitability.
Operational advantages of AI models include dramatically lower compute costs, up to 24 daily updates, and faster dissemination than traditional NWP.
Energy traders see the strongest results with a hybrid stack that pairs EPT-2 accuracy with ECMWF extremes coverage; book a demo with Jua to benchmark your current provider.

2026 Model Leaderboard: EPT-2 at the Top

The 2026 leaderboard rests on two peer-reviewed technical reports: arXiv:2507.09703 (EPT-2) and arXiv:2410.15076 (EPT-1.5). EPT-2 is the current global state of the art in atmospheric prediction, benchmarked against more than 10,000 real ground stations on open-source StationBench with no post-processing or station fine-tuning. EPT-2e, the ensemble variant, beats the 50-member ECMWF ENS mean on both RMSE and CRPS at virtually every lead time, using 30 members instead of 50.

The Jua platform hosts more than 25 models at once: 10 proprietary AI models from the EPT family plus 15 third-party NWP and AI models, including ECMWF HRES, ECMWF ENS, ECMWF AIFS, NOAA GFS, GFS GraphCast, Microsoft Aurora, DWD ICON Global, and ICON-EU. This multi-model infrastructure enables direct performance comparison across generations. EPT-1.5 already outperforms GraphCast, FuXi, Pangu-Weather, and ECMWF HRES on European wind and temperature, and EPT-2 extends that lead across the full global variable set.

The live 25-model benchmarking surface on the Jua platform returns a head-to-head accuracy comparison in seconds on any region and variable your team selects. Book a demo to run your own benchmark against your current forecast provider.

Routine Accuracy vs Extreme Events: Where Each Model Wins

The 2026 accuracy picture splits cleanly along one axis: routine forecasts versus record-breaking extremes.

On routine metrics, EPT-2 leads the AI models, outperforming both traditional NWP and competing AI architectures on standard error measures. Desks that care about day-to-day P&L see the largest gains here.

On record-breaking extremes, a 2025 study published in Science Advances found that ECMWF HRES outperforms AI models including Pangu-Weather on record-breaking temperature extremes across nearly all lead times. The study constructed a benchmark of heat, cold, and wind records in 2020 that exceeded 1979–2017 training maxima. ECMWF HRES consistently outperformed GraphCast, Pangu-Weather, and FuXi on those tail events, with the performance gap largest at short lead times. AI models systematically underestimate both the frequency and intensity of record-breaking events, with forecast bias growing nearly linearly with the magnitude of record exceedance. As lead author Dr. Zhongwei Zhang of KIT stated: “The greater the exceedance of the record of their training data, the larger the underestimation.”

A 2026 study in Geophysical Research Letters by Davis et al. evaluated Aurora, GraphCast, PanguWeather, FourCastNetV2, and FourCastNet against ECMWF HRES for U.S. West Coast atmospheric river detection across 152 daily forecast cycles. HRES maintained superior detection skill for the first four forecast days. Aurora achieved the lowest atmospheric river detection performance of all models tested despite recording the strongest variable-specific RMSE, which illustrates the disconnect between standard error metrics and extreme-event prediction skill.

EPT-2 uses a physics-constrained architecture, a spatiotemporal transformer that learns conservation laws for mass, momentum, and energy directly from observational data. The outputs are physically constrained by construction, not by post-processing. This design explains why EPT-2 leads on standard metrics while the hybrid recommendation below remains valid for tail-event risk management.

GraphCast, ECMWF, and EPT-2: Head-to-Head Metrics for Energy Variables

These architectural advantages translate directly into measurable performance gains. The table below compares EPT-2 and EPT-2e against ECMWF HRES, ECMWF ENS, Microsoft Aurora, and GFS GraphCast on the four variables that drive energy P&L, across the 0–240 hour lead-time range. All EPT-2 figures are sourced from arXiv:2507.09703; EPT-1.5 figures from arXiv:2410.15076. ECMWF HRES is the 40-year NWP benchmark. Aurora and GraphCast figures reflect published research outputs.

Model	10 m & 100 m Wind (RMSE vs HRES, 0–240 h)	2 m Temperature (RMSE vs HRES, 0–240 h)	Surface Solar Radiation (SSRD)
EPT-2 (arXiv:2507.09703)	Beats HRES on every lead time across full 0–240 h range	Beats HRES on every lead time across full 0–240 h range	Beats HRES; Aurora has no SSRD output
EPT-2e (arXiv:2507.09703)	Beats ECMWF ENS mean (RMSE & CRPS) at virtually every lead time	Beats ECMWF ENS mean (RMSE & CRPS) at virtually every lead time	Ensemble probabilistic skill exceeds ENS mean
ECMWF HRES / ENS	40-year NWP benchmark, retains advantage on record-breaking extremes (Science Advances 2026)	Outperforms AI models on tail events, competitive on routine metrics	Operational SSRD output available
Aurora / GraphCast (Davis et al. 2026)	EPT-2 beats Aurora on 10 m and 100 m wind across full 0–240 h range, Aurora rolls forward in fixed 6-hour steps, compounding error	EPT-2 beats Aurora on 2 m temperature up to ~130 h, GraphCast competitive on standard metrics but underperforms on extremes	Aurora has no SSRD output, EPT-2 wins by default

Aurora’s strong variable-specific RMSE does not translate to accurate extreme-event prediction, as the Davis et al. atmospheric river study confirmed. EPT-2’s native any-Δt forecasting, trained to predict at arbitrary time steps rather than rolling forward in fixed 6-hour increments, avoids the error compounding that affects Aurora and most peers.

See these numbers applied to your own portfolio region. Book a demo and run a head-to-head benchmark against your current provider on the Jua platform.

Operational Metrics: Frequency, Latency, and Cost

Operational cadence and inference cost show the sharpest gap between AI and traditional NWP.

A single traditional NWP simulation consumes approximately 8,400 kWh and costs €1,000–€20,000 to run on HPC infrastructure, taking one to two hours per cycle. That compute ceiling caps update frequency at two to four global forecasts per 24 hours, a hard constraint the energy industry has operated under for forty years. A single EPT-2 inference runs on a single GPU in minutes, at approximately 0.25 kWh and $0.20–$15 per simulation. The cost delta reaches roughly four orders of magnitude.

EPT-2 RR (rapid refresh) updates up to 24 times per day, providing the highest-frequency operational coverage available. For ensemble forecasting, EPT-2e updates 4 times per day, while EPT-2 HRRR delivers high-resolution coverage at up to 5 km native resolution over Europe for regional applications. A typical Jua run completes approximately 2.5 hours ahead of competing operational runs at the same cycle, which creates a dissemination advantage when the trade window is measured in minutes.

As of 2026, inference speed is no longer a meaningful differentiator among AI weather providers, and the value has shifted to accuracy, data depth, and decision tooling. EPT-2 leads on all three.

EPT-2 was trained on 8 × H100 GPUs over 10 days. Microsoft Aurora required 32 × A100 GPUs over 18 days, using four times more GPUs and a substantially longer training cycle. Jua’s inference runs approximately 25% faster than Aurora’s at run time.

Hybrid Forecasting: How to Combine EPT-2 and ECMWF

Jua for Energy does not replace ECMWF. The ECMWF two-week outlook remains the definitive reference point for traders repricing risk around heating demand, renewable output, and system tightness. ECMWF AIFS, ECMWF’s own AI model, runs natively on the Jua platform alongside EPT-2 under a unified schema and a single API.

The Science Advances study cited earlier is explicit: AI models cannot currently replace classical numerical forecasts for record-breaking extremes, and parallel use of both approaches is recommended, especially for early warning systems where underestimation of extremes can delay or prevent warnings. EPT-2’s physics-constrained foundation-model design narrows that gap relative to pure data-driven peers, but a hybrid stack remains the sound recommendation for any desk managing tail-event exposure.

Aurora’s factual limitations also matter for the hybrid decision. Aurora has no SSRD output, which makes it structurally incomplete for solar-generation forecasting. Aurora also rolls forward in fixed 6-hour steps, compounding error at longer lead times. EPT-2 forecasts at arbitrary lead times natively. These differences determine which variables and horizons a model can be trusted on.

Jua for Energy: Athena and 90-Second Briefings

Jua operates as a foundation model and agent company, and Jua for Energy is the first applied product. The relationship mirrors Anthropic and Claude Code, a horizontal AI platform with a flagship vertical product. EPT is a general physics foundation model, and Athena is an AI agent instrumented with the Jua for Energy tool surface.

Athena turns a natural-language question into a briefing, a benchmark, a backtest, or a custom widget. A typical query resolves in approximately 90 seconds, and a backtest completes in approximately 5 minutes. A trader asking “what is the 100 m wind forecast spread across models for northern Germany tonight?” receives the answer, the underlying widget, and the model delta without opening a terminal or downloading a grib file.

The market-sizing economics are straightforward. A 1 GW wind portfolio that gains four percentage points of forecast accuracy saves approximately €1.5 M per year under typical hedging and penalty structures, while a 1 GW solar portfolio at the same accuracy gain saves approximately €3 M per year. Customers operating multi-GW portfolios scale these economics linearly.

Jua for Energy is used by Axpo, TotalEnergies, Statkraft, EnBW, EDF, and Hydro-Québec across utilities, physical trading houses, and quantitative funds on four continents. Quant teams install the SDK with pip install jua and pipe EPT-2 forecasts directly into their own systematic models via the REST API, with Apache Arrow support for large payloads and hindcast data available for backtesting.

The 7–9 a.m. manual prep routine, downloading grib files, processing through brittle pipelines, and waiting for the meteorologist’s briefing, compresses into a single workspace refreshed up to 24 times a day where every model sits on the same screen with one schema. You act before the market does. Book a demo to see Athena and EPT-2 applied to your desk’s variables and regions.

Frequently Asked Questions

What is the most accurate weather forecasting AI in 2026?

EPT-2, from Jua’s Earth Physics Transformer family, is the current global state of the art in atmospheric prediction for the variables that drive energy P&L. It outperforms ECMWF HRES on every lead time from 0 to 240 hours on 10 m wind, 100 m wind, 2 m temperature, and surface solar radiation. EPT-2e, the ensemble variant, beats the 50-member ECMWF ENS mean on both RMSE and CRPS at virtually every lead time. EPT-2 also outperforms Microsoft Aurora on 10 m wind, 100 m wind, and 2 m temperature across the full 0–240 hour range, and wins by default on surface solar radiation because Aurora produces no SSRD output. These results are documented in peer-reviewed technical reports on arXiv (2507.09703 for EPT-2; 2410.15076 for EPT-1.5) and validated against more than 10,000 real ground stations on open-source StationBench with no post-processing.

What are the disadvantages of AI in weather forecasting for extreme events?

The primary limitation of current AI weather models on extreme events is systematic underestimation of both the intensity and frequency of record-breaking heat, cold, and wind events. A 2026 peer-reviewed study in Science Advances, led by Dr. Zhongwei Zhang at the Karlsruhe Institute of Technology, constructed a benchmark of over 248,000 record-breaking events in 2020 and found that GraphCast, Pangu-Weather, and FuXi all underperformed ECMWF HRES on tail events across nearly all lead times. The degree of underestimation grows nearly linearly with the magnitude of record exceedance, so the more extreme the event relative to training data, the larger the AI model’s error. A separate 2026 study in Geophysical Research Letters found that Aurora achieved the lowest atmospheric river detection performance of all models tested despite recording the strongest variable-specific RMSE, which shows that standard error metrics do not capture extreme-event skill. The recommended operational approach is parallel use of physics-based and AI models, with physics-constrained AI architectures, such as EPT-2, narrowing the gap relative to pure data-driven peers.

How accurate is AI weather prediction compared to traditional models at 10-day lead times?

At 10-day lead times, AI models generally match or exceed traditional NWP on standard metrics for routine variables. EPT-2 leads that AI tier, beating HRES on every lead time out to 240 hours on the four key energy variables. The exception remains record-breaking extremes. A 2025 study published in Science Advances found that ECMWF HRES outperforms AI models including Pangu-Weather on record-breaking temperature extremes across nearly all lead times. The performance gap between HRES and AI models narrows beyond 5 days but remains favorable to HRES for tail events across seasons and climate zones. At 10-day lead times for routine energy-trading variables such as wind generation, solar output, and temperature-driven demand, EPT-2 is the most accurate operational model available. For tail-event risk management at any lead time, running ECMWF alongside EPT-2 remains the operationally sound recommendation.

Conclusion: 2026 Leaderboard and Physical-Economy Roadmap

The 2026 leaderboard on routine metrics is clear. EPT-2 leads ECMWF HRES, Microsoft Aurora, and GFS GraphCast on the four key energy variables across the full 0–240 hour range, and EPT-2e leads the ECMWF ENS mean on RMSE and CRPS at virtually every lead time. Physics-based models retain a measurable advantage on record-breaking extremes, so a hybrid stack that runs EPT-2 alongside ECMWF remains the correct configuration for any desk managing tail-event exposure.

The operational advantages compound the accuracy story, with high-frequency updates, dramatically lower inference costs than traditional NWP, a meaningful dissemination lead, and Athena resolving natural-language queries into benchmarks and backtests in minutes. The Jua platform exposes all 25+ models through a single schema, with hindcast data available for backtesting and pip install jua for programmatic access.

Jua operates as a foundation model and agent company. The atmosphere is the first physical system EPT has been fine-tuned for, and energy trading is the first market Athena has been instrumented for. The roadmap extends to plasma fusion, aerospace, materials, and other physical-economy domains where the same architecture applies. Customers buying Jua for Energy today are buying the first surface of a platform that will expand outward from there.

See EPT-2 head-to-head against your current forecast provider on your own region and variables. Book a demo to run a live comparison on your own data.

Back to all articles Explore energy trading

View the key takeaways as a web story

Want to talk to the team behind the writing?

Book a demo to see EPT-2 and Athena in production, or read the open papers behind the work.

Book a demo Read the papers