How To Compute, Interpret, and Benchmark Hourly CRPS

How To Compute, Interpret, and Benchmark Hourly CRPS

ON THIS PAGE

Written by: Olivier Lam, Physical AI Team, Jua.ai AG

Key Takeaways for Hourly CRPS

  • CRPS is the standard metric for evaluating hourly ensemble forecast skill, and lower values mean better probabilistic calibration and sharpness.
  • The finite-ensemble CRPS formula includes a spread penalty term that prevents degenerate single-member solutions and keeps comparisons meaningful.
  • Per-hour evaluation preserves diurnal structure and suits energy trading, where skill at specific hours carries very different economic weight.
  • Diurnal CRPS patterns vary by variable: solar peaks in early afternoon, wind in the morning boundary-layer transition, and load during the evening demand ramp.
  • Jua for Energy is the only productised ensemble that beats ECMWF ENS on CRPS at virtually every lead time — see how it compares on your portfolio.

Finite-Ensemble CRPS Formula in Practice

For a finite ensemble of M members {x1, …, xM} and observation y, the energy-form CRPS decomposes into:

CRPS(x, y) = (1/M) Σi |xi − y| − (1/2M²) Σi Σj |xi − xj|

The second term is the ensemble spread penalty. Without it, a single-member ensemble always minimises the score by collapsing to the observation, which is a degenerate solution. The almost-fair CRPS formulation introduces a fairness parameter ε, where ε = 0 yields fully fair CRPS and ε = 1/M recovers conventional CRPS. This distinction matters when working with small finite ensembles. For operational ensembles, the bias from conventional CRPS is often small. For small ensembles used during training, the almost-fair form is often preferred. The almost-fair CRPS is used as the loss function for training AIFS-CRPS.

Python CRPS Workflow Using the scores Library

This example shows a complete workflow that loads hourly ensemble data, computes per-hour and per-lead-time CRPS with the scores library, and produces both a per-lead CRPS curve and a 24-hour diurnal profile. These outputs highlight which lead times and hours of the day carry the highest forecast uncertainty. Install dependencies with pip install scores xarray numpy pandas.

import numpy as np import pandas as pd import xarray as xr import scores.probability as sp # synthetic hourly ensemble data # dims: (lead_time_h, init_time, member, lat, lon) rng = np.random.default_rng(42) n_lead, n_init, n_member = 240, 30, 10 lats, lons = [52.5], [13.4] # Berlin, single point for illustration times = pd.date_range("2024-01-01", periods=n_init, freq="6h") leads = np.arange(1, n_lead + 1) fcst_data = rng.normal( loc=8.0, scale=2.0, size=(n_lead, n_init, n_member, len(lats), len(lons)), ) obs_data = rng.normal( loc=8.0, scale=1.5, size=(n_lead, n_init, len(lats), len(lons)), ) fcst = xr.DataArray( fcst_data, dims=["lead_time", "init_time", "member", "lat", "lon"], coords={ "lead_time": leads, "init_time": times, "member": np.arange(n_member), "lat": lats, "lon": lons, }, ) obs = xr.DataArray( obs_data, dims=["lead_time", "init_time", "lat", "lon"], coords={ "lead_time": leads, "init_time": times, "lat": lats, "lon": lons, }, ) # per-lead-time CRPS (mean over init_time, lat, lon) crps_lead = sp.crps_ensemble(obs, fcst, member_dim="member") crps_by_lead = crps_lead.mean(dim=["init_time", "lat", "lon"]) # diurnal CRPS profile # valid_hour = (init_time + lead_time hours) mod 24 valid_hour = ( (fcst["init_time"].dt.hour.values[:, None] + fcst["lead_time"].values[None, :]) % 24 ) crps_vals = crps_lead.mean(dim=["lat", "lon"]).values # (lead, init) diurnal = {} for h in range(24): mask = valid_hour.T == h # (lead, init) vals = crps_vals[mask] diurnal[h] = float(np.nanmean(vals)) if vals.size else np.nan diurnal_series = pd.Series(diurnal, name="CRPS").rename_axis("valid_hour_UTC") print(diurnal_series.round(4)) print(f"\nMean CRPS across all leads: {float(crps_by_lead.mean()):.4f}") 

The key design choice is computing CRPS at the finest available granularity, per lead time and per initialisation, before aggregating. Increasing output to hourly forecasts is treated as a distinct analysis dimension rather than collapsing it early via lead-time aggregation. This approach preserves diurnal structure that aggregation would otherwise mask.

Run this benchmark on your own region and variables on the Jua platform, then start your live comparison against 25+ models in under 5 minutes.

Choosing Hourly vs Aggregated Evaluation

Two evaluation strategies apply to large hourly ensemble datasets. Per-hour evaluation retains the full temporal structure, because each (lead time, valid hour) cell is scored independently. This produces a matrix that exposes both lead-time degradation and diurnal patterns at the same time. Lead-time aggregation collapses the valid-hour dimension by averaging CRPS across all initialisation times at each lead. That approach produces a single curve that is easier to communicate but hides intraday skill variation.

For energy trading applications, per-hour evaluation is the correct default. Solar generation forecasts at 13:00 UTC carry different economic weight than those at 03:00 UTC, and aggregating across both obscures the skill that matters at peak demand. CRPS-based assessment of uncertainty quantification in power-system applications uses day-ahead solar, wind, and load forecasts evaluated at the hourly level, not collapsed to daily means. Aggregation is appropriate for summary reporting and cross-model comparison tables where a single number per lead band is required.

Reading Diurnal CRPS Patterns

CRPS varies across the 24-hour cycle rather than remaining flat. For wind and solar variables in energy applications, probabilistic forecasts generated via Bayesian Model Averaging for solar and wind show that CRPS-based uncertainty quantification changes systematically with the diurnal cycle. Solar CRPS peaks in the early afternoon when irradiance variance is highest and cloud-cover uncertainty is largest. Wind CRPS typically peaks during the morning boundary-layer transition (06:00–10:00 local time) when the nocturnal jet decouples from the surface and ensemble spread widens. Load CRPS peaks in the early evening ramp when demand uncertainty compounds weather uncertainty.

A flat diurnal CRPS profile indicates the ensemble is uniformly calibrated across the day, which is desirable but rare. A profile with sharp peaks at specific hours indicates the ensemble underestimates spread at those hours, and this behaviour translates directly to underpriced risk in day-ahead and intraday positions. Traders should weight CRPS improvements at peak-demand hours more heavily than improvements at off-peak hours when they compare ensemble providers.

EPT-2e vs ECMWF ENS CRPS Benchmarks

EPT-2e, the ensemble variant of Jua’s Earth Physics Transformer foundation model, beats the 50-member ECMWF ENS mean on CRPS at virtually every lead time across the 0–240 hour range, as documented in arXiv:2507.09703. The table below demonstrates that EPT-2e maintains its CRPS advantage across short, medium, and extended lead times, which indicates systematic rather than lead-time-specific skill improvement. Lower values indicate better probabilistic skill.

Lead Time (h) EPT-2e CRPS (m s⁻¹) ECMWF ENS Mean CRPS (m s⁻¹) Difference (EPT-2e − ENS)
24 lower than ENS mean Reference EPT-2e superior
72 lower than ENS mean Reference EPT-2e superior
120 lower than ENS mean Reference EPT-2e superior
240 lower than ENS mean Reference EPT-2e superior

The full numeric CRPS surface across all lead times and variables appears in arXiv:2507.09703. EPT-2e achieves this performance with 30 members against the ECMWF ENS’s 50 members and updates 4x/day. This result reflects the quality of the underlying EPT physics foundation model rather than ensemble size alone. The evaluation methodology uses StationBench, benchmarked against more than 10,000 real ground stations with no post-processing or station fine-tuning.

Request a live EPT-2e benchmark for your portfolio.

CRPSS Skill Score Calculation

The Continuous Ranked Probability Skill Score (CRPSS) normalises CRPS against a reference forecast to produce a dimensionless skill measure:

CRPSS = 1 − (CRPSmodel / CRPSreference)

CRPSS = 1 indicates a perfect forecast. CRPSS = 0 indicates no improvement over the reference. CRPSS < 0 indicates the model is worse than the reference. Because CRPSS is a relative measure, the choice of reference determines what improvement means in practice. Climatological ensemble, persistence, or a competing operational model are all valid, but each sets a different skill floor. For energy trading, the natural reference is the ECMWF ENS mean, which is the incumbent probabilistic standard. A CRPSS of 0.05 against ECMWF ENS at 72-hour lead time for 100 m wind is economically meaningful. For a 1 GW wind portfolio, a four-percentage-point accuracy gain translates to approximately €1.5 M per year in reduced hedging and imbalance costs.

For quantile-based models, CRPS is approximated by empirical integration over the predicted quantile levels. This approach allows CRPSS to be computed even when the full forecast CDF is not available, which is relevant when benchmarking vendor outputs that expose quantiles rather than raw ensemble members.

Energy-Trading Use Case

CRPS and CRPSS translate directly into trading decisions. Lower CRPS at 24–72 hour lead times means tighter, better-calibrated probability intervals around day-ahead and intraday positions. Traders using ensemble forecasts with superior CRPS can size positions more confidently, set tighter stop-loss thresholds on weather-driven spread trades, and price weather-derivative exposures with less model uncertainty embedded in the premium.

Three specific use cases illustrate how CRPS improvements at different lead times unlock distinct trading strategies. For short-lead wind positions at 6–18 hour horizons, low CRPS enables wind-ramp detection, because the ensemble identifies the probability mass of a ramp event before the market prices it in. For intraday solar rebalancing, a diurnal CRPS profile that remains flat through the afternoon peak signals reliable calibration during the highest-uncertainty hours. This condition allows traders to hold intraday positions through cloud-cover transitions with more confidence. At longer horizons of 5–10 days, temperature ensemble CRPS determines how far out a gas trader can hedge demand exposure with statistical confidence, which links forecast skill directly to hedging strategy.

Given EPT-2e’s consistent CRPS advantage over ECMWF ENS, traders can use the Jua platform to turn that skill into P&L. The platform provides a live 25-model benchmarking surface, including ECMWF ENS, ECMWF HRES, ECMWF AIFS, Microsoft Aurora, GFS GraphCast, and DWD ICON, on any region and variable, and it returns results in seconds. Hindcast data is available across multiple Jua and third-party models for backtesting CRPSS against years of historical forecasts. Athena, Jua’s AI agent instrumented with the Jua for Energy tool surface, answers natural-language CRPS verification queries such as “what is the CRPSS of EPT-2e versus ECMWF ENS on 100 m wind for northern Germany over the last two winters?” and returns a full backtest report in approximately 5 minutes.

Test EPT-2e against your current provider on live trading scenarios.

Frequently Asked Questions

Why does the finite-ensemble CRPS formula include a spread penalty, and when does it matter most?

The spread penalty, the double-sum term (1/2M²) Σi Σj |xi − xj|, prevents the score from being gamed by collapsing all ensemble members to the observation. Without it, a single deterministic forecast would always score zero CRPS, which would make the metric useless for comparing probabilistic systems. The penalty matters most when ensemble size is small. For M = 2, the almost-fair CRPS formulation with a small fairness parameter ε is preferred. For operational ensembles with M ≥ 10, the bias introduced by the conventional formula is small enough to ignore in most energy applications. EPT-2e uses 10 members, and the ECMWF ENS uses 50. Comparing raw CRPS values between them is valid because both are well above the threshold where finite-ensemble bias is negligible, but CRPSS normalisation against a common reference removes any residual size effect from the comparison.

What diurnal hours show the highest CRPS for wind and solar ensemble forecasts in energy applications?

For solar variables, CRPS peaks in the early-to-mid afternoon, roughly 12:00–15:00 local time, when irradiance is highest and cloud-cover uncertainty is largest. A well-calibrated solar ensemble should show elevated but controlled CRPS at these hours. An overconfident ensemble will show artificially low spread and then verify poorly against observations. For wind variables, CRPS typically peaks during the morning boundary-layer transition, 06:00–10:00 local time, when the nocturnal low-level jet decouples from the surface layer and ensemble spread widens rapidly. For load forecasts, the evening demand ramp, 17:00–20:00 local time, is the highest-uncertainty period because demand uncertainty and weather uncertainty compound. Traders should inspect the diurnal CRPS profile of any ensemble they use and weight accuracy improvements at these peak hours more heavily than improvements at off-peak hours when they make procurement decisions.

How should CRPSS be interpreted when benchmarking AI ensembles against ECMWF ENS in energy trading?

CRPSS against ECMWF ENS as the reference is the correct benchmark for energy trading because ECMWF ENS is the incumbent probabilistic standard the industry prices against. A positive CRPSS means the model under evaluation is more skillful than ECMWF ENS at that lead time and variable. Even a CRPSS of 0.03–0.05 at 72-hour lead time for wind or temperature is economically significant, with the cost reductions described earlier. For a 1 GW solar portfolio, the same gain translates to approximately €3 M per year because intraday volatility is higher. CRPSS should be computed separately for each variable and lead-time band relevant to the trading horizon. Aggregating across all lead times can mask skill at the specific horizons that drive P&L. Negative CRPSS at short lead times, 0–6 hours, is common for global ensembles that lack rapid-refresh initialisation. This behaviour is expected and does not invalidate the ensemble at medium-range horizons.

What is the most efficient way to compute CRPS on large hourly ensemble datasets without running out of memory?

The energy-form CRPS formula requires computing pairwise absolute differences across ensemble members, which scales as O(M²) per grid point. For large datasets such as continental grids at hourly cadence over multiple years, three strategies reduce memory pressure. First, compute CRPS point-wise in physical space before any spatial aggregation, which avoids loading the full spatial field into memory at once. Second, chunk the xarray dataset along the initialisation-time dimension and process each chunk independently, writing results to disk before loading the next. Third, for M ≥ 10, the sorted-member form of CRPS, which scales as O(M log M) via sorting rather than O(M²) via double summation, is numerically equivalent and substantially faster. The scores library implements this optimisation automatically. For diurnal profiling specifically, computing valid_hour as (init_hour + lead_time) mod 24 and grouping by that coordinate before averaging is more memory-efficient than constructing a full (lead_time × init_time × valid_hour) tensor and then reducing.

Want to talk to the team
behind the writing?

Book a demo to see EPT-2 and Athena in production, or read the open papers behind the work.