{"id":544,"date":"2026-06-08T05:16:09","date_gmt":"2026-06-08T05:16:09","guid":{"rendered":"https:\/\/jua.ai\/articles\/arxiv-ai-weather-papers-2026\/"},"modified":"2026-06-08T05:16:09","modified_gmt":"2026-06-08T05:16:09","slug":"arxiv-ai-weather-papers-2026","status":"publish","type":"post","link":"https:\/\/jua.ai\/articles\/arxiv-ai-weather-papers-2026\/","title":{"rendered":"arXiv AI Weather Papers 2024\u20132026: Key Breakthroughs"},"content":{"rendered":"<p><em>Written by: Olivier Lam, Physical AI Team, Jua.ai AG<\/em><\/p>\n<h2 id=\"key-takeaways\">What 2024\u20132026 AI Weather Research Means for Traders<\/h2>\n<ul>\n<li>Three themes dominate 2024\u20132026 arXiv AI weather research: large foundation models, physics-constrained architectures, and probabilistic ensembles.<\/li>\n<li>Microsoft Aurora, Aardvark Weather, and NeuralGCM deliver strong benchmarks but lack native high-resolution, high-frequency, or energy-specific outputs.<\/li>\n<li>Jua\u2019s EPT-2 and EPT-1.5 beat Aurora, GraphCast, and ECMWF HRES on energy-critical variables while training faster and supporting native any-\u0394t forecasting.<\/li>\n<li>EPT-2e\u2019s 10-member ensemble outperforms the 50-member ECMWF ENS on RMSE and CRPS at nearly every lead time, with four daily updates and far lower compute cost.<\/li>\n<li><a href=\"https:\/\/meetings-eu1.hubspot.com\/guett\/energy-trading?uuid=d780665f-ff71-439c-addf-c80e49af0627\" target=\"_blank\"><strong>See EPT-2 benchmarked live<\/strong><\/a> against your current forecast provider in under five minutes.<\/li>\n<\/ul>\n<h2>Core AI Foundation Models Reshaping Weather Forecasting<\/h2>\n<p>The 2024\u20132026 period introduced several large-scale foundation models that treat atmospheric prediction as a general sequence-learning problem over geophysical data.<\/p>\n<p><a href=\"https:\/\/terramosaic.org\/news\/top-journal-earth-environment-foundation-models-rolling-updates\" target=\"_blank\" rel=\"noindex nofollow\">Microsoft Aurora<\/a>, published in Nature in May 2025, was pretrained on more than one million hours of geophysical data spanning ERA5, CAMS, MERRA-2, and other sources. It outperforms several operational systems across air quality forecasting, wave prediction, tropical cyclone tracking, and 10-day high-resolution weather forecasting at substantially lower computational cost than traditional NWP. Aurora required 32 A100 GPUs over 18 days of training, which highlights the heavy infrastructure needed to reach this performance level.<\/p>\n<p><a href=\"https:\/\/terramosaic.org\/news\/top-journal-earth-environment-foundation-models-rolling-updates\" target=\"_blank\" rel=\"noindex nofollow\">Aardvark Weather<\/a>, published in Nature in March 2025, takes a distinct approach. It trains end-to-end directly on raw observations from satellites, buoys, and radiosondes, bypassing conventional NWP pipelines entirely and delivering global-to-local forecasts up to 10 days ahead.<\/p>\n<p><a href=\"https:\/\/terramosaic.org\/news\/top-journal-earth-environment-foundation-models-rolling-updates\" target=\"_blank\" rel=\"noindex nofollow\">NeuralGCM<\/a>, published in Nature in 2024, combines a differentiable dynamical core with machine-learning parameterizations. It matches or exceeds strong baselines on 1- to 15-day weather prediction and reproduces realistic multi-decadal climate statistics when trained on ERA5. This behavior makes it a hybrid system rather than a purely data-driven model.<\/p>\n<p>All three models mark a shift from task-specific systems to general-purpose architectures. None, however, ships a productized ensemble, an operational refresh schedule above four runs per day, or a native 5 km resolution capability, which limits direct use in high-frequency energy trading.<\/p>\n<h2>Physics-Constrained Models That Learn Conservation Laws<\/h2>\n<p>The most consequential development in 2024\u20132026 arXiv AI weather papers is the rise of physics-constrained foundation models that learn conservation laws such as mass, momentum, and energy directly from observational data rather than imposing them as hard symbolic constraints. This learned-constraint approach lets the model discover conservation relationships that hold in real atmospheric data without relying solely on hand-coded physics equations.<\/p>\n<p>Jua&#8217;s Earth Physics Transformer (EPT) family demonstrates this approach at scale. EPT is a general spatiotemporal transformer foundation model. The architecture learns the governing physics of complex systems in a latent representation that is integrated forward in time. The domain becomes a variable; data and fine-tuning define the specific physical system.<\/p>\n<p><a href=\"https:\/\/arxiv.org\/html\/2507.09703v1\" target=\"_blank\" rel=\"noindex nofollow\">EPT-2 (arXiv:2507.09703)<\/a> achieves lower RMSE than Microsoft Aurora across the full 0\u2013240 hour forecast horizon for 10 m wind speed, and it outperforms Aurora on 2 m temperature across the full 0\u2013240 hour range. EPT-2 was pretrained on 8 H100 GPUs over 10 days, which uses four times fewer GPUs and a shorter training cycle than Aurora&#8217;s 32 A100 GPUs over 18 days, and it delivers approximately 25% faster inference. EPT-2 produces global forecasts for energy-relevant variables including 10 m wind speed, 100 m wind speed, 2 m temperature, and surface solar radiation (SSRD). Aurora has no SSRD output at all.<\/p>\n<p><a href=\"https:\/\/arxiv.org\/html\/2410.15076v1\" target=\"_blank\" rel=\"noindex nofollow\">EPT-1.5 (arXiv:2410.15076)<\/a> outperforms GraphCast, FuXi, Pangu-Weather, and ECMWF IFS HRES on 10 m wind speed over Europe, achieving skill scores of 20.8 at 6-hour lead time, 17.0 at 1 day, and 13.9 at 5 days relative to HRES. For 100 m wind speed, which is closest to commercial turbine hub heights, EPT-1.5 achieves a skill score of 25.25 at 6-hour lead time. EPT-1.5 also delivers a skill score of 20.44 for surface solar radiation at 6-hour lead time, a capability absent from GraphCast, FuXi, and Pangu-Weather.<\/p>\n<p>EPT-2 does not roll forward in fixed 6-hour increments. It is trained to predict at arbitrary lead times, which enables native any-\u0394t forecasting. Aurora and most peers use a fixed 6-hour grid and roll forward in steps, which compounds error at longer lead times.<\/p>\n<h2>Probabilistic and Ensemble Systems for Trading Risk<\/h2>\n<p>Probabilistic forecasting, which quantifies uncertainty across possible atmospheric states rather than issuing a single deterministic trajectory, forms the second major methodological frontier in 2024\u20132026 arXiv AI weather papers.<\/p>\n<p>The <a href=\"https:\/\/arxiv.org\/html\/2605.06944v2\" target=\"_blank\" rel=\"noindex nofollow\">AIMIP Phase 1 intercomparison (arXiv:2605.06944v2, May 2026)<\/a> establishes a common AMIP-style evaluation protocol. It requires AI weather and climate models to simulate the atmosphere forced solely by historical monthly sea-surface temperatures from 1979\u20132024, with training restricted to ERA5 data from 1979\u20132014 and a 2015\u20132024 out-of-sample test period. Eight AIWCMs submitted results, including ACE2.1-ERA5 (Allen Institute for AI), NeuralGCM (Google Research), cBottle-1.3 (NVIDIA), and ArchesWeatherGen (INRIA). AIMIP Phase 1 finds that participating models simulate historical climate means, variability, and forcing responses at a level comparable to the physics-based GFDL-CM4 model, though several underestimate historical warming trends and show divergent behavior in the out-of-sample period and in +2 K and +4 K SST perturbation experiments.<\/p>\n<p><a href=\"https:\/\/arxiv.org\/html\/2507.09703v1\" target=\"_blank\" rel=\"noindex nofollow\">EPT-2e<\/a>, Jua&#8217;s ensemble variant, uses 30 members and achieves lower RMSE and CRPS than the ECMWF ENS mean, which uses 50 members, at virtually every lead time in the 0\u2013240 hour range, verified against global weather station observations. CRPS (Continuous Ranked Probability Score) measures the sharpness and calibration of a probabilistic forecast, and a lower value indicates higher skill. EPT-2e updates four times per day. No AI peer has shipped a comparable productized ensemble.<\/p>\n<p>ECMWF AIFS ENS became operational on July 1, 2025, representing the incumbent&#8217;s move into AI-based ensemble forecasting and signaling how rapidly the probabilistic AI weather space is maturing. Yet even as ensemble methods improve, one critical weakness remains consistent across all AI weather models.<\/p>\n<h2>Extreme-Event Performance and Tail-Risk Management<\/h2>\n<p>A consistent finding across 2024\u20132026 evaluation studies is that AI weather models, including the leading foundation models, underperform traditional NWP on record-breaking or out-of-distribution extreme events. The <a href=\"https:\/\/arxiv.org\/html\/2605.06944v2\" target=\"_blank\" rel=\"noindex nofollow\">AIMIP Phase 1 results<\/a> show that several AI models exhibit divergent behavior in the out-of-sample 2015\u20132024 period, particularly in response to novel SST forcing patterns.<\/p>\n<p>A <a href=\"https:\/\/rmets.onlinelibrary.wiley.com\/doi\/10.1002\/wea.70043\" target=\"_blank\" rel=\"noindex nofollow\">2026 Hong Kong Observatory evaluation<\/a> found that while Pangu and AIFS matched or exceeded ECMWF IFS skill for tropical cyclone genesis at 5\u20138 day lead times, higher-resolution AI models sometimes achieved strong anomaly correlation coefficients at the expense of forecast realism. For energy traders, the practical implication is clear. AI models should run alongside, not instead of, operational NWP for tail-risk scenarios. Jua for Energy does not replace ECMWF; it replaces the plumbing around it.<\/p>\n<h2>Trading Impact of AI Weather Accuracy<\/h2>\n<p>Forecast accuracy maps directly into P&amp;L. A 1 GW wind portfolio that gains four percentage points of forecast accuracy saves approximately \u20ac1.5 M per year under typical hedging and imbalance-cost structures. A 1 GW solar portfolio at the same accuracy gain saves approximately \u20ac3 M per year. Multi-GW portfolios scale these economics linearly.<\/p>\n<p>The four energy-critical variables, 10 m wind speed, 100 m wind speed, 2 m temperature, and surface solar radiation, are precisely where <a href=\"https:\/\/arxiv.org\/html\/2507.09703v1\" target=\"_blank\" rel=\"noindex nofollow\">EPT-2&#8217;s accuracy advantage<\/a> over traditional baselines translates into P&amp;L impact. EPT-2e beats the 50-member ECMWF ENS mean on both RMSE and CRPS at virtually every lead time. Both results are verified against more than 10,000 real ground stations on open-source StationBench, with no post-processing or station fine-tuning.<\/p>\n<table>\n<thead>\n<tr>\n<th>Model<\/th>\n<th>Deterministic Accuracy vs. ECMWF HRES<\/th>\n<th>Ensemble Skill<\/th>\n<th>Energy-Trading Relevance<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><a href=\"https:\/\/arxiv.org\/html\/2507.09703v1\" target=\"_blank\" rel=\"noindex nofollow\">EPT-2 (Jua)<\/a><\/td>\n<td>Beats HRES on 10 m wind, 100 m wind, 2 m temp, SSRD across 0\u2013240 h<\/td>\n<td>EPT-2e (10 members) beats 50-member ECMWF ENS mean on RMSE and CRPS at virtually every lead time<\/td>\n<td>Native 5 km resolution, up to 24 runs\/day (EPT-2 RR), SSRD output, 100 m wind at hub height<\/td>\n<\/tr>\n<tr>\n<td><a href=\"https:\/\/terramosaic.org\/news\/top-journal-earth-environment-foundation-models-rolling-updates\" target=\"_blank\" rel=\"noindex nofollow\">Aurora (Microsoft)<\/a><\/td>\n<td>Loses to EPT-2 on 10 m wind, 100 m wind across full 0\u2013240 h range; loses on 2 m temp up to ~130 h<\/td>\n<td>No productized ensemble<\/td>\n<td>No SSRD output, fixed 6-hour roll-forward compounds error, no operational refresh schedule<\/td>\n<\/tr>\n<tr>\n<td>GraphCast (Google DeepMind)<\/td>\n<td><a href=\"https:\/\/arxiv.org\/html\/2410.15076v1\" target=\"_blank\" rel=\"noindex nofollow\">Loses to EPT-1.5 on European 10 m wind at all evaluated lead times<\/a><\/td>\n<td>No productized ensemble<\/td>\n<td>No SSRD output, research output without productized refresh<\/td>\n<\/tr>\n<tr>\n<td>ECMWF HRES<\/td>\n<td>40-year benchmark; beaten by EPT-2 on all four energy-critical variables across 0\u2013240 h<\/td>\n<td>N\/A (deterministic)<\/td>\n<td>Universal reference, ~8,400 kWh per simulation, 2\u20134 runs\/day, 9 km resolution<\/td>\n<\/tr>\n<tr>\n<td>ECMWF ENS<\/td>\n<td>N\/A (probabilistic)<\/td>\n<td>50-member gold standard; beaten by EPT-2e (10 members) on RMSE and CRPS at virtually every lead time<\/td>\n<td>Probabilistic benchmark, ~8,400 kWh per simulation, 2\u20134 runs\/day<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>On the Jua platform, all five models above run on a single benchmarking surface. A meteorologist or quant developer can select any region, any variable, and any time window and receive a head-to-head accuracy comparison in seconds. There are no separate subscriptions, no separate pipelines, one schema, and one API.<\/p>\n<p><a href=\"https:\/\/meetings-eu1.hubspot.com\/guett\/energy-trading?uuid=d780665f-ff71-439c-addf-c80e49af0627\" target=\"_blank\"><strong>Run live benchmarks on 25+ models<\/strong><\/a> and see results on your own region and variables in seconds.<\/p>\n<h2>Where to Track New arXiv AI Weather Papers<\/h2>\n<p>The most active arXiv categories for AI weather forecasting papers in 2024\u20132026 are:<\/p>\n<ul>\n<li><strong>cs.LG<\/strong> (Machine Learning), which covers foundation model architectures, training methodology, and benchmark evaluations<\/li>\n<li><strong>physics.ao-ph<\/strong> (Atmospheric and Oceanic Physics), which covers physics-constrained models, hybrid NWP\u2013AI systems, and conservation-law approaches<\/li>\n<li><strong>stat.ML<\/strong> (Statistics and Machine Learning), which covers probabilistic forecasting, ensemble calibration, and CRPS-based evaluations<\/li>\n<li><strong>cs.AI<\/strong> (Artificial Intelligence), which covers agent-based systems and multimodal weather reasoning<\/li>\n<\/ul>\n<p>Supplementary repositories include the ECMWF Open Data portal for operational benchmark data and the DKRZ S3 storage hosting AIMIP Phase 1 model outputs for intercomparison. Jua&#8217;s technical reports are indexed on arXiv under identifiers <a href=\"https:\/\/arxiv.org\/html\/2507.09703v1\" target=\"_blank\" rel=\"noindex nofollow\">2507.09703<\/a> (EPT-2) and <a href=\"https:\/\/arxiv.org\/html\/2410.15076v1\" target=\"_blank\" rel=\"noindex nofollow\">2410.15076<\/a> (EPT-1.5).<\/p>\n<h2>Run Live Benchmarks on the Jua Platform<\/h2>\n<p>The research landscape above already shapes daily workflows for energy traders, meteorologists, and quant developers. EPT-2 and EPT-2e are in production today, used by Axpo, TotalEnergies, Statkraft, EnBW, EDF, Hydro-Qu\u00e9bec, and quant funds across five continents.<\/p>\n<p>The Jua platform exposes more than 25 models, including 10 proprietary AI models from the EPT family plus 15 third-party NWP and AI models such as ECMWF HRES, ECMWF ENS, Aurora, and GraphCast, through a single REST API and Python SDK (<code>pip install jua<\/code>). Hindcast data is available for backtesting. Athena, Jua&#8217;s AI agent instrumented with the Jua for Energy tool surface, resolves a typical benchmark query in about 90 seconds and a full backtest in about 5 minutes.<\/p>\n<p>The numbers speak clearly. The live benchmark acts as the deal trigger. You pick your region, pick your variable, and run the comparison yourself.<\/p>\n<p><a href=\"https:\/\/meetings-eu1.hubspot.com\/guett\/energy-trading?uuid=d780665f-ff71-439c-addf-c80e49af0627\" target=\"_blank\"><strong>Compare EPT-2 to your current provider<\/strong><\/a> on your highest-stakes region and variable; the benchmark runs in under five minutes.<\/p>\n<h2>FAQ<\/h2>\n<h3>What distinguishes physics-constrained AI weather models from standard deep learning approaches?<\/h3>\n<p>Standard deep learning models applied naively to atmospheric data can produce outputs that violate the conservation laws governing real physical systems, such as mass, momentum, and energy. Physics-constrained models address this by embedding those constraints into the learning process itself. Jua&#8217;s EPT family is a spatiotemporal transformer foundation model that learns the governing physics of complex systems directly from observational data, in a latent representation that is integrated forward in time.<\/p>\n<p>The outputs are physically constrained by construction. This distinction separates EPT from a generic transformer applied to weather data, because the architecture avoids physically nonsensical outputs that an unconstrained model might generate. For energy trading, this means EPT-2 forecasts are safe to trade on, since they respect the same physical limits that govern real atmospheric behavior.<\/p>\n<h3>How does EPT-2e compare to ECMWF ENS for probabilistic energy forecasting?<\/h3>\n<p>EPT-2e, Jua&#8217;s ensemble variant, uses 10 members and achieves lower RMSE and CRPS than the ECMWF ENS mean, which uses 50 members, for 2 m temperature and 10 m wind speed across nearly the entire 0\u2013240 hour forecast range. CRPS, defined earlier in the ensemble section, measures both the sharpness and calibration of a probabilistic forecast, and a lower CRPS indicates a more skillful probabilistic prediction.<\/p>\n<p>EPT-2e updates four times per day. For energy traders who position around probabilistic spread, such as wind ramp uncertainty, solar generation tails, or temperature-driven demand spikes, EPT-2e provides ensemble skill that exceeds the operational gold standard with fewer members and at a fraction of the computational cost. A single EPT-2 inference runs at approximately 0.25 kWh and $0.20\u2013$15 on a single GPU, while the equivalent ECMWF simulation consumes approximately 8,400 kWh and costs \u20ac1,000\u2013\u20ac20,000 on HPC.<\/p>\n<h3>What are the practical limitations of AI weather models for extreme-event forecasting?<\/h3>\n<p>As discussed in the extreme-event section above, AI models struggle with out-of-distribution scenarios, meaning conditions that fall outside their training data. For energy traders, this reality means AI models should run alongside operational NWP, not replace it, for tail-risk scenarios such as cold snaps, heat domes, or major wind ramp events.<\/p>\n<p>Jua for Energy is designed for this workflow. The Jua platform runs ECMWF HRES, ECMWF ENS, EPT-2, EPT-2e, Aurora, GraphCast, and 19 other models on a single surface, with divergence alerts that fire the moment two models disagree on a key variable. The trader sees the disagreement as it opens, rather than after the market has already repriced.<\/p>\n<h3>How do I evaluate arXiv AI weather model claims before integrating a model into a trading workflow?<\/h3>\n<p>The most reliable evaluation methodology uses verification against real ground-station observations rather than against model-generated reanalysis fields. Jua benchmarks EPT-2 and EPT-2e against more than 10,000 real ground stations using open-source StationBench, with no post-processing or station fine-tuning, and publishes the results in peer-reviewed technical reports on arXiv (2507.09703 for EPT-2; 2410.15076 for EPT-1.5).<\/p>\n<p>When you evaluate any arXiv AI weather paper, check whether the benchmark is against ERA5 reanalysis, which the model may have trained on, or against independent station observations. Check whether the evaluation covers energy-critical variables, such as 10 m wind, 100 m wind, 2 m temperature, and surface solar radiation, at the lead times that match your trade horizon. Check whether an ensemble variant exists and whether it reports CRPS alongside RMSE. On the Jua platform, any meteorologist or quant developer can run a live head-to-head benchmark on their own region and variable in under 30 seconds, without relying on vendor-provided graphics.<\/p>\n<h3>What is the difference between Jua as a company and Jua for Energy as a product?<\/h3>\n<p>Jua is a foundation model and agent company. EPT is a general physics foundation model that is domain-agnostic by architecture and applies to any continuous, conservation-law-constrained physical system. Athena is an AI agent that is domain-agnostic by design and instrumented with a specific tool surface for each deployment.<\/p>\n<p>Jua for Energy is the first applied product built on top of EPT and Athena, configured for the energy trading workflow. The relationship mirrors the one between Anthropic and Claude Code. Anthropic is a foundation-model company, and Claude Code is one product they ship. Jua is a foundation model and agent company, and Jua for Energy is one product Jua ships. The atmosphere is the first physical system EPT has been fine-tuned for, and energy trading is the first market Athena has been instrumented for. Both will expand to other physical-economy domains, such as plasma fusion, aerospace, materials, and fluids, as the same horizontal platform is applied to them.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Explore top arXiv AI weather papers on foundation models, physics constraints &amp; ensembles. See how Jua&#8217;s EPT-2 outperforms Aurora, GraphCast &amp; ECMWF.<\/p>\n","protected":false},"author":103,"featured_media":543,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":""},"categories":[10],"tags":[],"class_list":["post-544","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research"],"_links":{"self":[{"href":"https:\/\/jua.ai\/articles\/wp-json\/wp\/v2\/posts\/544","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/jua.ai\/articles\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/jua.ai\/articles\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/jua.ai\/articles\/wp-json\/wp\/v2\/users\/103"}],"replies":[{"embeddable":true,"href":"https:\/\/jua.ai\/articles\/wp-json\/wp\/v2\/comments?post=544"}],"version-history":[{"count":0,"href":"https:\/\/jua.ai\/articles\/wp-json\/wp\/v2\/posts\/544\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/jua.ai\/articles\/wp-json\/wp\/v2\/media\/543"}],"wp:attachment":[{"href":"https:\/\/jua.ai\/articles\/wp-json\/wp\/v2\/media?parent=544"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/jua.ai\/articles\/wp-json\/wp\/v2\/categories?post=544"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/jua.ai\/articles\/wp-json\/wp\/v2\/tags?post=544"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}