Product

AI Weather Intelligence Accuracy: EPT-2 vs. NWP Models

Name: Athena
Brand: Jua

Olivier Lam·May 15, 2026

AI Weather Intelligence Accuracy: Jua EPT-2 Beats ECMWF

Written by: Olivier Lam, Physical AI Team, Jua.ai AG | Last updated: July 12, 2026

Key Takeaways for Energy Trading Teams

Traditional NWP models update only 2–4 times daily with 6–12 hour data lags, so traders react to weather after prices move.
EPT-2 delivers up to 24 daily updates at a fraction of NWP compute cost and outperforms ECMWF HRES and ENS on key energy variables across all lead times.
Athena AI agent unifies 25+ models in one workspace and turns natural-language questions into briefings, benchmarks, and backtests in under 90 seconds.
Physics-constrained architecture enforces conservation laws and prevents hallucinations, keeping trading outputs reliable even during extreme events.
Energy traders can evaluate Jua for Energy instantly by running live head-to-head benchmarks on their own regions and variables at athena.jua.ai, or schedule a personalized demo to see the platform in action.

Update Frequency Limits in Traditional NWP

The ECMWF supercomputer runs its full NWP algorithm twice a day, with smaller supplementary runs in between. The energy industry receives roughly four global forecasts per 24-hour period, with weather data that is 6–12 hours old by the time traders receive production forecasts. During those gaps, traders work from stale numbers and react to weather only after it appears in the price.

Infrastructure economics drive this limitation. A single NWP simulation requires many hours on supercomputers, consuming about 8,400 kWh and costing €1,000–€20,000 per run. That cost ceiling has constrained update frequency for forty years.

EPT-2e Rapid Updates and Fresh Inputs

EPT-2 updates 4 times per day, and EPT-2 RR, the rapid-refresh model, updates up to 24 times per day. This cadence is viable because a single EPT-2 inference runs on one GPU in minutes at roughly 0.25 kWh and $0.20–$15 per simulation, which is about four orders of magnitude cheaper than an equivalent NWP run. EPT-2e beats the 50-member ECMWF ENS mean on both RMSE and CRPS at virtually every lead time, as documented in the peer-reviewed technical report on arXiv 2507.09703.

Jua’s rapid-refresh integration makes weather inputs only 1.5–2.5 hours old and available approximately 1.5 hours after each run, instead of the 6–12 hour lag of traditional NWP. Frequent, low-cost runs turn weather into a near-real-time trading signal rather than a delayed indicator.

Why Fragmented Weather Workflows Slow Traders

The standard energy-trading workflow stitches together raw grib files from ECMWF and GFS, brittle in-house pipelines, internal meteorology input, and multiple vendor dashboards. Traders jump between spreadsheets, terminals, and web tools to build a coherent view of the day. By the time that view exists, the market often has already moved.

Data-driven weather prediction models now rival traditional NWP skill at a fraction of the computational cost. Most of these models arrive as raw outputs without workflow tooling, so trading desks carry the integration burden and lose time that should go into risk and opportunity analysis.

Athena: One Workspace for 25+ Models

Athena is Jua’s AI agent, wired into the Jua for Energy tool surface. A trader types a natural-language request such as “show the 100 m wind forecast spread across models for northern Germany tonight.” Athena then plans the task, calls tools, evaluates intermediate outputs, and returns a briefing, benchmark, backtest, or custom widget.

Athena turns raw physics predictions from EPT-2 into actionable trading analysis by reading market context and modeling participant behavior. Typical queries resolve in about 90 seconds, and backtests complete in roughly 5 minutes. The Jua platform exposes more than 25 models, including 10 proprietary EPT models and 15 third-party NWP and AI models, through a single schema and API, which removes the multi-vendor dashboard problem.

See Athena in action by running a live benchmark on your own region and variables at athena.jua.ai. Most comparisons complete in under 5 minutes.

Why Benchmarking AI Weather Models Is Hard

Meteorologists evaluating AI weather models usually receive vendor graphics instead of tools to run their own head-to-head benchmarks. They rarely can test their own region and variable directly. Accuracy claims without peer-reviewed evidence and transparent validation methods deserve skepticism.

A 2024 study in Science Advances evaluated AI models on record-breaking extremes using latitude-weighted RMSE, forecast bias, precision-recall curves, and binary correlation metrics. That kind of multi-metric, open methodology sets a high bar that simple vendor charts do not meet.

StationBench: Live, Transparent Evaluation

EPT-2 is evaluated on open-source StationBench against more than 10,000 real ground stations, with no post-processing or station fine-tuning. This approach compares model output directly to what instruments measured, instead of relying on gridded reanalysis that can hide errors through spatial smoothing.

EPT-2 outperforms ECMWF HRES on every lead time and on 10 m wind, 100 m wind, 2 m temperature, and surface solar radiation across the full 0–240 hour range. On the Jua platform, prospects select their own region, variable, and time window, then receive a head-to-head result in under 5 minutes. The evaluation surface lets the numbers speak for themselves.

Compute Cost as a Constraint on Forecasting

A single traditional NWP simulation consumes about 8,400 kWh of compute and costs €1,000–€20,000 on HPC infrastructure. Traditional NWP systems need several hours on supercomputers with hundreds of nodes to produce a single 10-day global forecast. These economics cap update frequency at two to four runs per day and restrict who can afford regional models, deep ensembles, or scenario analyses.

GPU-Efficient EPT-2 Inference and Ensembles

A single EPT-2 inference runs on one GPU in minutes at roughly 0.25 kWh and $0.20–$15, which is about four orders of magnitude cheaper than traditional NWP. This efficiency starts at training. EPT-2 trained on 8 × H100 GPUs over 10 days, while Microsoft Aurora used 32 × A100 GPUs over 18 days, so EPT-2 required far fewer GPUs and a shorter training cycle.

Lower training and inference cost makes frequent runs and deep ensembles practical. EPT-2e delivers 30 ensemble members that beat the ECMWF ENS mean on RMSE and CRPS at virtually every lead time, without requiring HPC infrastructure. That capability turns ensemble forecasting into a standard tool for trading desks rather than a luxury.

From Raw Model Output to Trading-Ready Insight

Quant teams that license AI weather research outputs usually receive raw model files. They then build ingestion pipelines, ensemble logic, benchmarking harnesses, and hindcast access on their own. That engineering work consumes capacity that could focus on alpha research.

Operational AI models currently produce output every 6 hours, which limits usefulness for fast-moving systems such as sharp fronts, and they rarely ship with a productized analyst layer. Teams end up reinventing similar tooling across firms.

Athena Briefings, Alerts, and Backtests

Athena converts natural-language objectives into concrete deliverables. Day-Ahead and Intraday briefings auto-refresh on every new model run and summarize model consensus across 25+ models, changes since the previous run, convergence tracking, market spread, and price implications.

Divergence alerts trigger when models disagree on a key variable, and correction alerts trigger when a model revises its own output. The Python SDK installs via pip install jua, and the REST API exposes all models through a single schema with Apache Arrow support for large payloads. Hindcast data across Jua and third-party models supports robust backtesting. Integrations that often take a quarter elsewhere typically stand up in days.

Why Traders Question AI Reliability

Physics models that hallucinate or violate conservation laws are unsafe for trading. Deeper analyses of AI NWP predictions reveal systematic artifacts and instabilities, including violations of physics conservation and underdispersive ensembles. Unconstrained baseline AI weather models show significant drift and variation over time in conservation metrics, while physics-constrained models maintain nearly constant conservation.

This pattern justifies skepticism. AI weather models that claim accuracy without peer-reviewed evidence and physics-grounded architecture warrant close scrutiny before traders rely on them.

Physics-Constrained EPT Architecture

EPT is a spatiotemporal transformer foundation model trained on observational physics. It learns governing laws such as mass, momentum, and energy conservation directly from data, in a latent representation that evolves forward in time. Outputs remain physically constrained by design.

An LLM operates unconstrained on the symbolic surface, while a physics model constrains behavior at the representation level. In mature domains like weather forecasting, embedding physical constraints turns errors into expected, bounded deviations that scientific workflows can absorb, instead of undetected false beliefs. Physics-constrained models begin to outperform unconstrained baselines on state variables such as temperature, geopotential height, and mean sea level pressure after about 96 hours of lead time, with larger gains in mid-latitudes and near the poles. EPT-2’s results appear in the EPT-2 technical report (arXiv 2507.09703).

Extreme Events and Physics Constraints

A 2024 Science Advances study found that AI models systematically underestimate both the frequency and intensity of record-breaking events. They underpredict hot records, overpredict cold records, and show forecast errors that grow nearly linearly with greater record exceedance. Researchers at the University of Geneva and Karlsruhe Institute of Technology reported that AI weather models struggle with events outside their training data, because they learn from historical patterns that do not cover unprecedented extremes.

Physics constraints matter most in these regimes. In extratropical cyclone case studies, physics-constrained models produce stronger temperature gradients, more accurate low placement, and better conversion of potential to kinetic energy in the vertical structure than both the IFS and unconstrained AI baselines. EPT’s architecture learns conservation laws from observational data rather than interpolating from pattern libraries, which matters when the event has no close historical analog. A 2026 Rice University study found that AI models deviated from gradient wind balance near storm centers and systematically overestimated inner-core size in stronger cyclones. Physical consistency in wind fields remains a key differentiator that EPT addresses by construction.

Running Jua Alongside ECMWF

Jua for Energy complements ECMWF rather than replacing it. Serious customers keep their ECMWF subscription and run Jua for Energy alongside it. ECMWF AIFS, ECMWF’s own AI model, runs on the Jua platform in the same workspace as EPT.

Volue positions its AI-Weather Rapid Updates, powered by Jua, as a complementary second signal to ECMWF. Traders use this setup to spot divergences in volatile periods and to identify under- or overproduction risks earlier. Jua replaces the plumbing around the incumbent feed: in-house grib pipelines, manual benchmarking, morning-briefing analysts, and dashboard stitching. The 7–9 a.m. manual prep routine compresses into a single workspace, refreshed on the cadence of the underlying physics, where ECMWF, GFS, AIFS, Aurora, and EPT share one screen, schema, and API.

Side-by-Side Comparison: Jua, Legacy NWP, and Research AI

Capability	Jua for Energy (EPT family + Athena)	Legacy NWP (ECMWF HRES / ENS)	Research-Grade AI (Aurora / GraphCast)
Deterministic accuracy (0–240 h, 10 m wind, 100 m wind, 2 m temp, SSRD)	EPT-2 beats ECMWF HRES on every lead time and on all four main energy-relevant variables	The 40-year benchmark; gold standard for deterministic NWP	Aurora loses to EPT-2 on 10 m and 100 m wind across full range; Aurora has no SSRD output
Ensemble (probabilistic) forecasting	EPT-2e: 30 members, beats ECMWF ENS mean on RMSE and CRPS at virtually every lead time	ENS: 50 members, gold standard for probabilistic NWP	No productized ensemble equivalent
Update frequency	Up to 24×/day (EPT-2 RR); EPT-2 4×/day; 15-min refresh for actual generation	2–4×/day; 6–12 hour lag	Typically 4×/day research cadence; no productized operational schedule
Inference cost and energy	~0.25 kWh, ~$0.20–$15 per simulation, minutes on a single GPU	~8,400 kWh, €1,000–€20,000, 1–2 hours on HPC	Similar order of magnitude to Jua for inference; EPT-2 is ~25% faster than Aurora
Productized agent (natural-language analyst)	Athena: briefings, benchmarks, backtests, widget generation (~90 s per query)	None	None
Transparent benchmarking surface	25+ models on one platform; any region, any variable; results in under 5 minutes	Available to members; no productized cross-vendor benchmarking	No productized benchmarking surface
API / SDK	REST + Apache Arrow; `pip install jua` on PyPI; single schema across all 25+ models	Grib files via MARS; member access	Research code / limited API; no unified schema

Verify these comparisons on your own data by running a head-to-head benchmark at athena.jua.ai.

Frequently Asked Questions

What is a physics-constrained AI weather model?

A physics-constrained AI weather model is a foundation model trained to learn conservation laws of the atmosphere, such as mass, momentum, and energy, directly from observational data. It does not simply fit statistical patterns to historical records without physical grounding. EPT is a spatiotemporal transformer that integrates a latent physical representation forward in time, so the architecture cannot produce outputs that violate conservation laws in the way a generic transformer might.

Unconstrained AI weather models show drift and instability in conservation metrics over time, while physics-constrained models maintain consistency by design. For energy trading, this means physics-constrained outputs avoid thermodynamically impossible wind ramps or temperature inversions that look plausible numerically but lack any physical basis.

How is EPT-2 accuracy evaluated, and what does StationBench measure?

EPT-2 is evaluated on open-source StationBench against more than 10,000 real ground stations, with no post-processing or station fine-tuning applied. StationBench measures RMSE and CRPS at the station level, comparing model output directly to what instruments recorded. This approach is stricter than gridded reanalysis comparisons, which can mask errors through spatial smoothing.

EPT-2 and EPT-2e performance against ECMWF HRES and ENS across 10 m wind, 100 m wind, 2 m temperature, and surface solar radiation over 0–240 hours is documented in the EPT-2 technical report (arXiv 2507.09703). StationBench provides the transparent, reproducible framework behind those results.

What are the integration requirements for connecting Jua for Energy to existing systems?

Integration uses standard tools and requires no proprietary infrastructure. The Python SDK installs via pip install jua from PyPI and provides forecast access, hindcast and backtesting, and weather-parameter standardization across all models. The REST API exposes more than 25 models through a single schema, with Apache Arrow support for large payloads suitable for continental, multi-variable, multi-model backtests.

Hindcast data is available across multiple Jua and third-party models. A direct ENTSO-E integration supplies European grid data, including actual generation, capacity, and PSR classifications. API documentation lives at query.jua.ai/docs, and the developer dashboard is at developer.jua.ai. Most quant teams complete integrations in days rather than quarters.

Does Jua for Energy replace an ECMWF subscription?

Jua for Energy is designed to run alongside ECMWF, not replace it. ECMWF HRES and ENS remain first-class models on the Jua platform in the same workspace as EPT, and ECMWF AIFS also runs there. Jua replaces the surrounding plumbing: in-house grib pipelines, manual benchmarking, morning-briefing analysts, and dashboard stitching across vendors.

The 7–9 a.m. manual prep routine becomes a single workspace where every model shares one schema and one API, refreshed up to 24 times per day. Traders keep their trusted ECMWF signals and gain faster, richer context from EPT and Athena.

Which teams use Jua for Energy, and what does each team gain?

Three roles engage with the platform in most customer organizations:

Meteorologists use the live benchmarking surface with 25+ models to evaluate quality rigorously and brief the trading desk. EPT-2 and EPT-2e are documented in peer-reviewed technical reports on arXiv (2507.09703 and 2410.15076), which gives meteorologists the evidence they need to champion the platform internally.
Quant developers use the Python SDK and REST API to feed Jua forecasts and hindcasts into systematic trading models. The single schema across models, Apache Arrow support, and broad hindcast availability are the features that win quant teams.
Traders use Day-Ahead and Intraday briefings, divergence and correction alerts, and Athena’s natural-language layer to act before the market. A 1 GW wind portfolio that gains four percentage points of forecast accuracy saves about €1.5 million per year, and a 1 GW solar portfolio at the same gain saves roughly €3 million per year.

Conclusion: How to Judge AI Weather for Energy Trading

The core issue is simple. Stale or uncertain weather signals, assembled from fragmented workflows and unvalidated models, cost energy traders and utilities millions in missed windows and imbalance costs. Weather is now the single biggest unpriced variable in energy markets, and traders who forecast it better win.

The six problems discussed above map to three evaluation dimensions: technical capability, operational integration, and trust. Technical capability covers update frequency and compute economics. Operational integration covers workflow consolidation and transparent benchmarking. Trust covers validation methodology and physics grounding.

Validation methodology. Accuracy claims should rest on peer-reviewed benchmarks against real ground-station observations, with no post-processing. EPT-2 meets this bar on StationBench, with results published at arXiv 2507.09703.
Physics grounding. The model architecture should enforce conservation laws rather than fit patterns that can break under out-of-distribution conditions.
Update frequency. The platform should refresh at intraday cadence that matches how energy markets trade, not at twice-daily intervals.
Workflow integration. A single schema, a single API, and an agent that turns natural-language objectives into deliverables should replace fragmented stacks.
Transparent benchmarking. Prospects should be able to run head-to-head comparisons on their own region and variable against current providers in minutes.
Compute economics. Inference cost must support the update frequency and ensemble depth the market requires, without HPC infrastructure.

Jua serves major utilities across four continents, including some of Europe’s largest energy companies, as well as commodity traders and hedge funds. Axpo, TotalEnergies, Statkraft, EnBW, EDF, and Hydro-Québec are among them. In most cases, the decisive moment is the live benchmark, when the prospect runs a comparison on their own region and variable and the conversation shifts from “is this real?” to “how fast can we procure?”

Jua is a foundation model and agent company, and Jua for Energy is the first applied product. The architecture learns physics, and the domain remains a variable.

Run benchmarks on your own region and variables on the Jua platform. See your forecasts in less than 5 minutes, head-to-head against 25+ models, at athena.jua.ai.

Back to all articles Explore energy trading

View the key takeaways as a web story

Want to talk to the team behind the writing?

Book a demo to see EPT-2 and Athena in production, or read the open papers behind the work.

Book a demo Read the papers