Written by: Olivier Lam, Physical AI Team, Jua.ai AG
Key Takeaways for European Power Desks
- European day-ahead forecast accuracy in 2026 hinges on low MAE values and balancing MAPE of 8–15%. A 4–8% CRPS reduction versus the ECMWF ENS baseline is a meaningful improvement.
- Renewables now supply nearly half of EU electricity, which amplifies price volatility and makes traditional NWP updates at 2–4 runs per day structurally stale for intraday and day-ahead trading.
- Physics-based foundation models like Jua’s EPT-2 deliver physically constrained forecasts at single-GPU cost (~0.25 kWh). This cost profile enables up to 24 daily refreshes and materially lower inference expense.
- Jua’s unified platform surfaces more than 25 models through one API and pairs them with the Athena agent. Together they convert raw output into briefings, benchmarks, and backtests in about 90 seconds.
- Traders evaluating new solutions should run live benchmarks on their highest-stakes regions. Schedule a Jua benchmark on your own data to test EPT-2 accuracy and workflow speed.
The Accuracy Challenge Facing European Power Traders
European energy market forecast accuracy has deteriorated measurably since 2023. Renewables accounted for 47.5% of gross EU electricity consumption in 2024, up 2.1 percentage points from 2023. That higher share compounds the forecasting problem. Wind and solar generation behave as continuous, multi-scale physical processes, and their variability propagates directly into price uncertainty. Median imbalance prices have increased across European regions while the daily price spread has widened strongly over the last four years, with extreme prices occurring more frequently due to higher renewable penetration.
Dunkelflaute events, which are prolonged periods of low wind and low solar irradiance, concentrate this risk. A single multi-day Dunkelflaute across Germany, France, and the Benelux can push day-ahead prices above 300 €/MWh while simultaneously degrading the accuracy of every NWP-derived generation forecast in the affected zone. Cross-border market coupling amplifies the effect. After PICASSO introduction, the day-ahead-to-imbalance spread narrowed in Belgium and the Netherlands, while France showed increasing spreads into mid-2025 driven by reduced nuclear availability and higher intraday volatility. Coupling that reduces spread in one zone can widen it in another, which leaves any forecast stack calibrated on pre-coupling data structurally stale.
For utilities, traders, and quant teams, the consequence stays the same. The numbers they trade on are wrong more often, and the cost of being wrong is higher. That reality raises an immediate need to define what “accurate enough” means in 2026 conditions.
Defining Good Forecast Accuracy in 2026 European Markets
In 2026, the benchmark for day-ahead electricity price forecasting in liquid European hubs centers on low MAE values under normal volatility conditions. Balancing market MAPE typically runs between 8% and 15% and widens sharply during renewable ramp events. For probabilistic forecasts, CRPS has become the preferred skill metric. A reduction of 4–8% relative to the ECMWF ENS baseline counts as material for quant teams and meteorologists evaluating new models.
Most researchers in European imbalance price forecasting now adopt probabilistic rather than deterministic methods because probabilistic forecasts quantify uncertainty, which is especially valuable under the high-volatility conditions created by variable renewables. A deterministic point forecast that is accurate on average can still produce catastrophic P&L outcomes on tail events such as Dunkelflaute price spikes, wind ramps, and nuclear outages that define a year’s trading result. CRPS captures both calibration and sharpness in a single number, while MAE and MAPE do not.
Physics-Based Models Plus Agents as the New Forecasting Stack
Physics-based foundation models paired with an AI agent now form the structural response to deteriorating forecast accuracy. This category did not exist at scale three years ago. It now defines a distinct approach to weather and power forecasting.
The category has three defining traits. First, the underlying model learns the governing physics of the atmosphere, including mass, momentum, and energy conservation, directly from observational data. This learning produces outputs that are physically constrained by construction rather than statistically plausible by coincidence. Second, the model runs at inference costs that permit update frequencies that traditional numerical weather prediction cannot match. Third, an agent layer translates raw forecast output into briefings, benchmarks, backtests, and alerts without requiring a human analyst in the loop.
Jua operates as a foundation model and agent company. Its EPT (Earth Physics Transformer) family is a general spatiotemporal transformer foundation model for physical systems. The architecture is domain-agnostic, while the current product is fine-tuned for atmospheric prediction. Athena is Jua’s AI agent, currently instrumented with the Jua for Energy tool surface. Jua for Energy is the first applied product built on EPT and Athena and is used by Axpo, TotalEnergies, Statkraft, EnBW, EDF, and Hydro-Québec, among others. The relationship mirrors the way Anthropic relates to Claude Code, with a horizontal AI platform and a flagship vertical product.
Five-Minute Day-Ahead Workflow on a Trading Desk
At 5:45 a.m., before the day-ahead auction opens, a power trader at a mid-size European utility opens the Jua platform. The overnight EPT-2 run has already completed and is disseminated approximately 2.5 hours ahead of competing operational runs at the same cycle. The Day-Ahead briefing is live and shows model consensus across more than 25 models, model delta since the previous run, convergence tracking as lead time shortens, and price implications for the markets the desk trades.
The trader sees that EPT-2e and ECMWF ENS have diverged by 1.8 GW on German wind generation for the 18:00–22:00 window. A divergence alert fired at 04:12. The trader asks Athena, “What is the 100 m wind forecast spread across models for northern Germany tonight?” The answer, a ranked comparison with the underlying widget, resolves in approximately 90 seconds. The desk positions before the auction. The 7–9 a.m. manual prep routine disappears.
See this 90-second workflow on your own trading region
Pain Point 1: Infrequent Model Updates and the Case for Rapid Refresh
A single traditional NWP simulation consumes approximately 8,400 kWh and costs €1,000–€20,000 to run on HPC infrastructure. That cost profile caps update frequency at two to four runs per day and has constrained the energy industry for forty years. Between runs, traders stare at stale numbers. In Europe’s weather-driven energy markets, traders are turning to AI and machine-learning tools designed not to predict temperatures and precipitation, but to forecast the forecast, which signals how stale the underlying signal has become.
EPT-2 RR, Jua’s rapid-refresh model variant, updates up to 24 times per day. A single EPT-2 inference runs on a single GPU in minutes at approximately 0.25 kWh and $0.20–$15, which is roughly four orders of magnitude cheaper than the NWP equivalent. That low inference cost enables the 24-daily-refresh cadence. The trade-off is that rapid-refresh runs use shorter assimilation windows, so they are most valuable for intraday positioning. Multi-day strategic planning still relies primarily on the full EPT-2 and EPT-2e runs.
Pain Point 2: Fragmented Tool Stacks and a Unified Schema Across 25+ Models
The standard energy-desk workflow assembles ECMWF grib files, GFS outputs, internal meteorology briefings, vendor dashboards, and a desk group chat into a view of the day that is outdated before completion. Each data source uses a different schema, a different update cadence, and a different access method. One or two engineers absorb the integration cost, which diverts time away from alpha research.
The Jua platform exposes more than 25 models through a single REST API with Apache Arrow support for large payloads. These models include 10 proprietary AI models from the EPT family plus 15 third-party NWP and AI models such as ECMWF HRES, ECMWF ENS, ECMWF AIFS, NOAA GFS, DWD ICON, Microsoft Aurora, and GFS GraphCast. The Python SDK installs with pip install jua. An integration that takes a quant team a quarter to build elsewhere stands up in days. The trade-off is that unified schema normalisation involves opinionated variable naming, so teams with highly customised internal ontologies may require a short mapping step.
Pain Point 3: Benchmarking Quality with a Live Five-Minute Comparison
Meteorologists evaluating AI weather models are often asked to accept vendor-provided accuracy graphics instead of running benchmarks themselves. ECMWF’s two-week outlook is the definitive reference point for traders repricing risk around heating demand, renewable output, and system tightness, so any credible accuracy claim must anchor to a head-to-head comparison against ECMWF, not against a vendor-selected baseline.
The Jua platform’s benchmarking surface puts all 25+ models on a single screen. A meteorologist selects any region, any variable, and any time window, and the platform returns a head-to-head accuracy comparison in under 30 seconds. EPT-2 is documented in peer-reviewed technical reports at arXiv:2507.09703 and EPT-1.5 at arXiv:2410.15076. EPT-2 outperforms ECMWF HRES on every lead time and on 10 m wind, 100 m wind, 2 m temperature, and surface solar radiation across the full 0–240 hour range. EPT-2e beats the 50-member ECMWF ENS mean on both RMSE and CRPS at virtually every lead time.
Run a live head-to-head benchmark on your highest-stakes region
Pain Point 4: High Compute Cost and Single-GPU Inference
The compute ceiling on NWP arises from physics rather than incremental engineering. Solving differential equations across a three-dimensional planetary grid at 9 km resolution requires HPC infrastructure that costs €1,000–€20,000 per simulation run. EPT-2 was trained on 8 × H100 GPUs over 10 days, while Microsoft Aurora required 32 × A100 GPUs over 18 days.
At inference, EPT-2 runs on a single GPU in minutes at the sub-dollar cost profile described earlier. This cost asymmetry, which reaches roughly four orders of magnitude versus NWP, makes 24 daily refreshes economically viable where traditional providers remain capped at two to four. The trade-off is that GPU-based inference requires stable cloud infrastructure. Jua runs on Nebius AI Cloud for high-throughput cluster stability.
Pain Point 5: From Raw Output to Athena Agent Briefings
Raw model output such as grib files, NetCDF arrays, and API JSON payloads does not equal a trading decision. Converting that output into a decision requires a meteorologist to downscale and interpret, an analyst to write the briefing, and a trader to synthesise the result. That chain takes hours and often breaks under time pressure. Jua’s agentic intelligence layer Athena turns raw physics predictions from EPT-2 into trading decisions by reading market context and modeling participant behavior.
Athena is Jua’s AI agent, instrumented with the Jua for Energy tool surface. A natural-language question such as “backtest a wind-ramp strategy on EPT-2e over the last two winters” resolves to a full backtest report in approximately 5 minutes. A briefing query resolves in approximately 90 seconds. Trading houses and quant desks describe Athena as “another headcount, for free.” The trade-off is that Athena’s outputs remain analytical rather than advisory, so trading and dispatch decisions stay with the customer.
Watch Athena answer a live forecast question on your portfolio
Comparing Jua, ECMWF, and AI Peers on Key Capabilities
| Capability | Jua for Energy (EPT family + Athena) | ECMWF HRES / ENS (NWP incumbent) | Aurora / GraphCast (AI peer) |
|---|---|---|---|
| Deterministic accuracy vs HRES (0–240 h, 10 m wind, 100 m wind, 2 m temp, SSRD) | EPT-2 beats HRES on every lead time and all four variables | The 40-year benchmark itself | Aurora loses to EPT-2 on 10 m and 100 m wind across full range, with no SSRD output |
| Probabilistic ensemble | EPT-2e beats ECMWF ENS mean on RMSE and CRPS at virtually every lead time | ENS: 50-member gold standard for probabilistic NWP | No productised ensemble equivalent |
| Update frequency | Up to 24×/day (EPT-2 RR), 4×/day (EPT-2e), 15-min actual generation | 2–4×/day | Typically 4×/day research, no productised operational schedule |
| Inference cost per simulation | Single-GPU inference at the low-cost profile described earlier | ~8,400 kWh, €1,000–€20,000 on HPC | Similar order of magnitude to Jua for inference |
| Agent / natural-language analyst | Athena for briefings, benchmarks, backtests, widgets, about 90 seconds per query | None | None |
| Cross-model benchmarking | 25+ models, any region and variable, result in under 30 seconds | Available to members, no productised cross-vendor surface | No productised benchmarking surface |
Jua for Energy does not replace ECMWF. It displaces the plumbing around ECMWF. ECMWF AIFS runs on the Jua platform as a guest model. Serious customers keep their ECMWF subscription and run Jua for Energy alongside it.
View this comparison table against your current provider in a live session
Risks and Due Diligence for Physics Models Plus Agents
Four risks apply to any physics-foundation-model-plus-agent deployment in European energy markets.
Data-quality dependence. EPT-2 relies on a multi-petabyte training corpus drawn from more than 120 distinct sources, including proprietary station coverage across more than 10,000 stations. Forecast quality in data-sparse regions such as offshore zones and high-altitude terrain degrades relative to data-dense continental areas. Teams should evaluate accuracy on their specific geography rather than on global averages.
Integration complexity. Unified schema normalisation across more than 25 models involves opinionated variable naming. Teams with highly customised internal ontologies should budget a short mapping step. The REST API and Python SDK are documented at docs.jua.ai, and the Apache Arrow payload format handles large continental backtests without choking.
Validation standards. European imbalance price forecasting faces rapidly increasing volatility because variable renewable generation and unexpected events can quickly switch the sign of system imbalance. Any model evaluation should include tail-event performance such as Dunkelflaute periods, wind ramps, and nuclear outages, not only mean-error metrics. EPT-2 benchmarks run against more than 10,000 real ground stations on open-source StationBench, with no post-processing or station fine-tuning.
Practical evaluation criteria. Start with a live benchmark on your highest-stakes region and variable, and compare MAE, RMSE, and CRPS against your current provider on the same time window. This head-to-head view confirms whether accuracy claims hold for your geography. After confirming accuracy, request hindcast access for backtesting so you can validate performance on historical tail events. Finally, verify dissemination time, because a forecast that arrives too late cannot inform your position. Jua runs complete approximately 2.5 hours ahead of competing operational runs at the same cycle. A 1 GW wind portfolio that gains four percentage points of forecast accuracy saves approximately €1.5 million per year, and a 1 GW solar portfolio at the same accuracy gain saves approximately €3 million per year.
Run your own due-diligence benchmark on the Jua platform
Frequently Asked Questions
How does the physics-foundation-model-plus-agent category differ from standard AI weather models?
A physics foundation model learns the governing conservation laws of a physical system, including mass, momentum, and energy, directly from observational data in a latent representation that is integrated forward in time. Outputs remain physically constrained by construction rather than statistically plausible by coincidence. An AI agent layer converts those outputs into briefings, benchmarks, backtests, and alerts without requiring a human analyst in the loop. Standard AI weather models such as Aurora and GraphCast function as research outputs. They produce raw forecast files without a productised ensemble, a refresh schedule, or an agent layer. The category distinction matters because the agent converts forecast accuracy into trading decisions at the speed the market requires.
What inputs power EPT-2, and what outputs does Jua for Energy provide?
EPT-2 is trained on more than 5 petabytes of weather and climate data from over 120 distinct sources, including geostationary and polar-orbiting satellites, surface station networks covering more than 10,000 stations, national radar networks, ocean buoys, ERA5 reanalysis, and operational ECMWF HRES initial-condition fields. Jua for Energy delivers weather forecasts across 25 variables, including wind at 11 height levels from 10 m to 200 m, surface solar radiation, temperature, precipitation, and cloud cover. It also delivers power forecasts for solar, wind onshore, wind offshore, total wind, total renewables, load, and residual load across Germany, Great Britain, France, the Netherlands, and Belgium. Forecasts are accessible via the Jua platform workspace, the REST API, and the Python SDK.
How should meteorologists or quant teams evaluate forecast quality before procurement?
Teams should run the live benchmark on the Jua platform. They select their highest-stakes region and variable, typically a wind-rich zone in their home market, and compare EPT-2 and EPT-2e against ECMWF HRES and their current provider on MAE, RMSE, and CRPS over a time window that includes at least one high-volatility event. The benchmark returns results in under 30 seconds. For backtesting, they request hindcast access via the SDK and run their own strategy evaluation. Athena can execute a standard backtest in approximately 5 minutes. EPT-2 performance is documented in peer-reviewed technical reports at arXiv:2507.09703 and arXiv:2410.15076, benchmarked against more than 10,000 real ground stations with no post-processing.
What are the integration requirements for connecting Jua for Energy to existing trading or risk systems?
The Jua platform exposes a REST API at query.jua.ai/docs with Apache Arrow support for large payloads and a Python SDK installable via pip install jua. The unified schema covers all 25+ models on the platform, so switching or comparing models does not require re-engineering pipelines. ENTSO-E grid data, including actual generation, capacity, and PSR classifications, is integrated directly. Hindcast data is available across multiple Jua and third-party models for backtesting. Teams that have built comparable integrations elsewhere report that the Jua integration stands up in days rather than a quarter.
Which teams typically deploy Jua for Energy, and how do they use it?
Three roles account for most deployments. Meteorologists at regulated utilities use the benchmarking surface and the weather forecast variables to evaluate model quality and brief the trading desk, while Athena replaces the manual morning briefing production step. Quant developers at trading houses and funds use the Python SDK and REST API to pipe forecast and hindcast data into systematic models, and they value the unified schema and Apache Arrow support. Traders at all three organisation types use the Day-Ahead and Intraday briefings, divergence and correction alerts, and Athena’s natural-language query layer to act before the market reprices. Senior decision-makers evaluate the platform on the live benchmark moment and on the market-sizing economics, such as a 1 GW wind portfolio at four percentage points of accuracy gain saving approximately €1.5 million per year.
Conclusion: Applying Objective Criteria to European Forecast Accuracy
European energy market forecast accuracy has deteriorated because renewables now account for nearly half of EU electricity consumption, cross-border coupling has tightened the relationship between zone-level imbalances, and the NWP infrastructure the industry runs on updates only two to four times per day. The gap between the physics of the atmosphere and the speed of the market has widened, and the cost of that gap, measured in MAE, MAPE, CRPS, and ultimately in P&L, is quantifiable.
Evaluation criteria remain straightforward. Teams should measure MAE and CRPS on their specific geography and lead time rather than on global averages. They should require a productised ensemble with documented probabilistic skill against ECMWF ENS. They should verify update frequency against their intraday trade horizon and test the agent layer on a real briefing question before procurement. Running the live benchmark themselves provides the final check, because the numbers speak.
Jua operates as a foundation model and agent company. Jua for Energy is the first applied product, built on EPT and Athena and used by Axpo, TotalEnergies, Statkraft, EnBW, EDF, and Hydro-Québec. EPT-2’s documented performance advantage over ECMWF HRES, combined with EPT-2e’s probabilistic skill against ECMWF ENS, shows how a physics-learned architecture can serve multiple domains. The architecture learns physics, and the domain becomes a variable.
Run a head-to-head benchmark on your own region and variables in under 5 minutes