Atmospheric Model Skill Scores: HSS, ACC & BSS Guide 2026

Name: Athena
Brand: Jua

Written by: Olivier Lam, Physical AI Team, Jua.ai AG

Key Takeaways for Energy and Weather Teams

Atmospheric model skill scores measure how much a forecast improves on baselines like climatology or persistence. Positive values show better performance than the reference.
Core metrics include RMSE for error size, CRPS for probabilistic accuracy, and skill scores such as HSS, ACC, BSS, and RPSS that range from negative values to a perfect score of 1.
Good skill scores depend on use case. Anomaly correlations above 0.6 are skillful for medium-range forecasts, while energy traders typically need scores above 0.2–0.3 for profitable decisions.
EPT-2 and its ensemble version EPT-2e outperform ECMWF HRES and GFS on RMSE and CRPS for wind, temperature, and solar variables across most lead times.
Book a demo with Jua to benchmark forecasts on your region and variables and see results in under 5 minutes.

How forecasters calculate atmospheric skill scores

The Heidke Skill Score (HSS) evaluates categorical forecasts by comparing correct predictions against those expected by random chance. The formula is HSS = (Po – Pe) / (1 – Pe), where Po is the observed accuracy and Pe is the expected accuracy by chance. HSS ranges from -1 to 1, where 0 indicates no skill relative to random chance and 1 indicates perfect forecasts.

The Anomaly Correlation Coefficient (ACC) measures linear agreement between forecast and observed anomalies relative to climatology. ACC = Σ[(Fi – Fc)(Oi – Oc)] / √[Σ(Fi – Fc)² × Σ(Oi – Oc)²], where Fi and Oi are forecast and observed values, and Fc and Oc are climatological means. Values near 1 indicate strong linear agreement, 0 indicates no relationship, and negative values indicate anticorrelation.

The Brier Skill Score (BSS) evaluates probabilistic forecasts against climatological frequency. BSS = 1 – (BS / BSref), where BS is the Brier Score of the forecast and BSref is the Brier Score of the reference forecast. BSS is positive when a probabilistic forecast outperforms the climatological reference forecast, zero when it shows no improvement over climatology, and negative when it performs worse than climatology.

The Ranked Probability Skill Score (RPSS) extends the Brier Score to multi-category probabilistic forecasts. RPSS = 1 – (RPS / RPSref), where RPS is the Ranked Probability Score and RPSref is the reference RPS. The Mean Squared Error skill score compares forecast error against a baseline: MSE skill score = 1 – (MSEforecast / MSEbaseline).

Consider a worked example using 10-meter wind speed forecasts. Station observations show 8.2 m/s, while the forecast predicted 7.8 m/s and climatology averages 6.5 m/s. The forecast error is 0.4 m/s and the climatological error is 1.7 m/s. The MSE skill score = 1 – (0.16 / 2.89) = 0.94, which indicates a 94% error reduction relative to climatology.

How to judge whether a skill score is “good”

Skill score interpretation depends on the variable, lead time, and reference baseline. For anomaly correlation or correlation-based metrics, values near 1 indicate strong linear agreement between forecast and observations, 0 indicates no linear relationship, and negative values indicate anticorrelation. Anomaly correlation coefficients above 0.6 are considered skillful for medium-range forecasts, while values above 0.8 indicate high skill.

For wind energy applications, RMSE skill scores above 0.3 represent meaningful improvement over persistence forecasts at 24–48 hour lead times. Solar irradiance forecasts typically achieve skill scores of 0.4–0.7 against climatology at day-ahead horizons. Temperature forecasts often maintain skill scores above 0.5 out to 7–10 days in mid-latitudes.

The threshold for operational utility varies by application. Energy traders usually require skill scores above 0.2–0.3 for profitable positioning, while grid operators need higher thresholds of 0.4–0.6 for reliable reserve management. Book a demo to evaluate skill scores on your specific region and variables.

ECMWF vs GFS accuracy for operational decisions

ECMWF HRES consistently outperforms NOAA GFS across most variables and lead times. ECMWF maintains anomaly correlation coefficients above 0.6 for 500 hPa geopotential height out to longer lead times, while GFS drops below this threshold sooner. For surface variables critical to energy trading, ECMWF HRES achieves improved RMSE skill scores compared to GFS for 10-meter wind speed and 2-meter temperature at multi-day lead times.

The accuracy gap reflects differences in model resolution, data assimilation, and physics parameterizations. ECMWF operates at 9 km resolution, which enables better representation of topographic effects on wind patterns. ECMWF uses a 4D-Var data assimilation system that incorporates observations more effectively than the GFS hybrid ensemble-variational approach.

Despite ECMWF’s accuracy advantage, operational considerations may favor GFS in some scenarios. GFS provides free global forecasts with multiple daily updates, while ECMWF HRES has varying update schedules with member access restrictions. For applications that require frequent updates, the operational cadence can outweigh the accuracy differential. Both models serve as benchmarks for evaluating AI-based alternatives.

How EPT-2 performs on RMSE and CRPS

Jua’s EPT-2 outperforms both ECMWF HRES and NOAA GFS across key energy-relevant variables. EPT-2 beats ECMWF HRES on every lead time and on 10 m wind, 100 m wind, 2 m temperature, and surface solar radiation. The ensemble variant EPT-2e achieves superior probabilistic skill, beating the ECMWF ENS mean on both RMSE and CRPS at virtually every lead time.

The following table summarizes how EPT-2 and EPT-2e perform against traditional models across the variables most critical to energy trading:

Model	10m Wind RMSE (m/s)	100m Wind RMSE (m/s)	2m Temp RMSE (K)	CRPS Skill vs Climatology
EPT-2	lowest	lowest	lowest	highest
EPT-2e	lowest	lowest	lowest	highest
ECMWF HRES	baseline	baseline	baseline	N/A
ECMWF ENS	baseline	baseline	baseline	baseline
NOAA GFS	higher	higher	higher	lower

ECMWF reports CRPS skill scores for its ensemble (ENS) forecasts rather than the deterministic HRES. No CRPS value is defined or published for HRES.

EPT-2 updates four times per day, and the EPT2-HRRR variant forecasts at ~5 km resolution over Europe. This combination of frequent updates and fine spatial resolution provides higher temporal and spatial granularity than traditional numerical weather prediction. The model architecture learns governing physics directly from observational data, producing outputs that respect conservation laws for mass, momentum, and energy. Unlike previous AI weather models that achieved lower RMSE than ECMWF operational IFS on reanalysis data, EPT-2 maintains skill on real-time operational forecasts against live observations.

Microsoft Aurora and Google DeepMind GraphCast represent the previous generation of AI weather models. EPT-2 outperforms Aurora on 10 m wind, 100 m wind, and 2 m temperature across the full 0–240 hour range. Aurora lacks surface solar radiation output entirely, which limits its utility for solar energy applications. GraphCast operates at 25 km resolution with 6-hour fixed time steps, while EPT-2 provides native any-Δt forecasting at arbitrary lead times without rolling forward in fixed increments.

Higher skill scores translate directly to operational value in energy trading. A 1 GW wind portfolio that gains four percentage points of forecast accuracy saves approximately €1.5 million annually through reduced imbalance costs and improved hedging strategies. Solar portfolios achieve even greater savings, with a 1 GW installation saving roughly €3 million per year at equivalent accuracy improvements. Energy traders increasingly seek proprietary forecasts to differentiate their market view, and they move beyond consensus models when weather volatility increases market uncertainty.

Jua is a foundation model and agent company building EPT, a general physics foundation model, and Athena, an AI agent that operates inside it. Jua for Energy applies both technologies to atmospheric forecasting and energy trading, delivering highly accurate forecasts in production. Run benchmarks on your own region and variables on the Jua platform. See your forecasts in less than 5 minutes, head-to-head against 25+ models.

Frequently Asked Questions

How EPT-2 differs from other AI weather models

EPT-2 is built on a spatiotemporal transformer foundation model that learns governing physics directly from observational data. Unlike language models that operate on discrete tokens, EPT learns continuous conservation laws for mass, momentum, and energy that constrain atmospheric behavior. The architecture produces native any-Δt forecasting at arbitrary lead times, while competitors like Aurora roll forward in fixed 6-hour steps that compound error. Unlike competitors that update less frequently, EPT-2’s four-times-daily refresh cycle and ensemble forecasts through EPT-2e deliver both deterministic and probabilistic skill that surpass traditional numerical weather prediction.

How skill scores connect to trading performance

Skill scores quantify forecast improvement over baseline methods, and that improvement directly affects energy trading profitability. Higher skill scores reduce imbalance costs, improve hedging effectiveness, and enable better positioning around weather-driven price movements. A four percentage point improvement in wind forecast accuracy typically saves €1.5 million annually for a 1 GW portfolio, while solar portfolios achieve €3 million in savings at similar accuracy gains. As mentioned earlier, skill scores above the 0.2–0.3 threshold for traders translate directly to cost savings and stronger positioning around weather-driven price moves.

Why ensemble forecasts often achieve higher skill

Ensemble forecasts sample uncertainty by running multiple model simulations with slightly different initial conditions or physics parameters. This approach captures forecast uncertainty and provides probabilistic information that single deterministic runs cannot deliver. EPT-2e with 30 members consistently outperforms the 50-member ECMWF ENS mean on both RMSE and CRPS metrics. Ensemble means typically reduce random error through averaging, while ensemble spread quantifies forecast confidence. Energy traders use ensemble information to assess risk and size positions appropriately when model uncertainty is high.

How often to recalculate skill scores for operations

Skill score monitoring should match operational decision cycles. For day-ahead energy trading, weekly skill score updates provide enough frequency to detect model performance changes. Intraday trading benefits from daily skill score tracking, particularly during high-volatility weather periods. Seasonal recalibration accounts for systematic biases that vary with weather patterns, solar angles, and vegetation cycles. Real-time skill score monitoring becomes critical during extreme weather events when model performance may degrade rapidly. Automated alerts when skill scores drop below operational thresholds enable traders to adjust strategies before costly forecast errors impact positions.

Skill score thresholds for switching forecast providers

Skill score degradation below operational thresholds signals the need for provider evaluation. For wind forecasts, sustained RMSE skill scores below 0.2 at 24–48 hour lead times indicate insufficient accuracy for profitable trading. Solar forecasts require skill scores above 0.3 for day-ahead positioning. Temperature forecasts should maintain anomaly correlation above 0.5 out to 7 days for gas trading applications. However, skill score comparison must account for evaluation methodology, reference baselines, and geographic coverage. Live benchmarking against multiple providers on identical datasets provides the most reliable basis for switching decisions and removes methodological differences that can bias comparisons.

Atmospheric Model Skill Scores: HSS, ACC & BSS Guide 2026

ON THIS PAGE

Key Takeaways for Energy and Weather Teams

How forecasters calculate atmospheric skill scores

How to judge whether a skill score is “good”

ECMWF vs GFS accuracy for operational decisions

How EPT-2 performs on RMSE and CRPS

Frequently Asked Questions

How EPT-2 differs from other AI weather models

How skill scores connect to trading performance

Why ensemble forecasts often achieve higher skill

How often to recalculate skill scores for operations

Skill score thresholds for switching forecast providers

Want to talk to the team
behind the writing?

Atmospheric Model Skill Scores: HSS, ACC & BSS Guide 2026

ON THIS PAGE

Key Takeaways for Energy and Weather Teams

How forecasters calculate atmospheric skill scores

How to judge whether a skill score is “good”

ECMWF vs GFS accuracy for operational decisions

How EPT-2 performs on RMSE and CRPS

Frequently Asked Questions

How EPT-2 differs from other AI weather models

How skill scores connect to trading performance

Why ensemble forecasts often achieve higher skill

How often to recalculate skill scores for operations

Skill score thresholds for switching forecast providers

Want to talk to the teambehind the writing?

Want to talk to the team
behind the writing?