How we built the winning real time autonomous agent for power grid management in the L2RPN Challenge 2023.

La Javaness R&D
15 min readMar 25, 2024

La Javaness has been nominated as the winner of this year’s Paris Region AI Challenge for Energy Transition, organized by the Paris Region and the French electricity transmission system operator RTE. The challenge was the 6th iteration of the L2RPN challenge and focuses on developing an autonomous agent for power grid management, guaranteeing a reliable and resilient system of electricity provision.

Among other challenges, grid control has to contend with ecological imperatives: a projected surge in renewable energy production matching France’s commitment in ecological objectives highlighted by Ademe¹, as well as constraints on electric mobility and new grid infrastructure projects reshaping the landscape². The latest IPCC report, released in 2023, highlights the urgency of societal transformations to curb global warming ³. Within the scope of the French power grid system, RTE projects a reduction of 55% of CO2 emission before 2030, with the objective of carbon neutrality before 2050 in line with European objectives⁴.

Without any prior knowledge in electricity transmission systems, we tackled this challenge as a team of two, and won the competition by developing an agent able to manage 67% of the scenarii by making decisions within a few seconds, making it 10 to 45 times faster than its direct competitors on the final leaderboard.

The code is available here and as part of the l2rpn baselines package.

Overview of the challenge

We tackled two primary objectives during this challenge :

🧑‍🔬 Leveraging state-of-the-art methods of optimization and reinforcement learning for power grid management through a deep understanding of the problem and a large exploration of potential technical solutions.

🧑‍🏭 Demonstrating the potential of a solution for industrialization in order to assist operators in their daily work.


A power grid is a complex network of interconnected components as described in the table below. These entities work synergistically to ensure smooth transportation of electricity from generation points to end-users.

Graph-oriented representation of the environment, plotted with Grid2Op. Each color represent a sub-area of the environment.

The environment is modeled in the Grid2Op simulator developed by RTE⁶. The environment has 540 elements including 118 substations, 186 power lines, 99 loads and 66 generators of 5 different types (either renewable or not). This constitutes a quite large and complex simulated environment; way larger than in the first version of the challenge, but still very small when compared to the actual French national power grid managed by RTE’s operators.

One of the main difficulties of this challenge lies in the range of possible actions on the grid. At each time step, one can perform substation reconfiguration and/or line-reconnection by sub-area, making the grid highly modular in terms of topology with roughly 1011 possible reconfiguration at each time step. In addition, unexpected attacks (line disconnection) that happen stochastically on the grid can occur in several zones at the same time, making cascading failure of power lines more likely to occur in the case of simultaneous attacks on the grid⁶.

Grid2Op is available on RTE’s github.

Goals and scoring system

Aside from the environment itself, the objectives and scoring system of the challenge made it highly complex. The scoring system had 3 contributions :

  • Operational Score (Op. Score). Measures the cost of operations on the grid combined with potential blackout costs (substantially higher).
  • Renewable Energy Score (Nres. Score). Measures the use of available renewable energy generators. This score is highly important in the context of the energy transition for a sustainable energy mix in 2035.
  • Assistant Score (Assist. Score). Assesses the ability of the developed agent to automatically raise a relevant alert in case of potential danger on the grid.

The calculated value of each score is between -100 and 100, and the total score is a weighted average of these scores.

Moreover, the competition was divided in 2 different tracks with 2 leaderboards :

  1. The Main-Assistant track, which used the Grid2Op simulator and aimed at optimizing all of the 3 scores.
  2. The Sim2Real track, which used the same environment with a more realistic simulation, making the the grid management more complex. The objective was to assess the robustness and adaptability of the agent.

In addition to the final score on both competition’s tracks, several other important criteria were considered to choose the winning agent :

  • ⏩ The ability to make decisions quickly
  • 🎯 The frugality and efficiency of the solution as well as future R&D endeavors
  • 🏭 The technical perpective to industrialize the solution

Our top-down modular approach


Our primary focus in the development of a working agent was to leverage previous working methods from past iterations of the challenge, ensuring that we could obtain positive scores within the time constraints of the challenge. Therefore, our solution is a combination of several modules directly inspired by existing methods, especially the curriculum agent⁵and the convex optimization method for continuous control. The orchestration of modules is handled by the agent using simple business rules. This approach ensures that our method remains flexible.

The global architecture of our final submitted agent is described in the following figure :

Agent architecture

Segmenting the development of the agent in different submodules helped in achieving a highly competitive solution. The modules are described below:

  • Dynamic topology optimization. By leveraging greedy search-like algorithms and iterative simulations from a reduced action space, the Agent evaluates multiple topological configurations to identify the most efficient grid structure in real-time. This dynamic approach ensures that the grid can adapt to changing demand and supply conditions without compromising safety.
  • Continuous control optimization. By modeling the problem of Generator and Storage control as a convex minimization problem under constraints, the agent is able to balance generator curtailment, storage utilization and redispatching. This ensures that the grid operates at peak efficiency, minimizing waste and associated costs.
  • Alert Module. The Agent is designed with safety as a primary concern. It continuously monitors the grid’s state and proactively takes actions to prevent potential failures. To adapt the agent for the assistant track, we used simple a simple Alert module that simulates the effect of line attack and raise alert in case of an expected blackout within an hour.
  • Power lines reconnection. This function performs a local search within each area of the power grid. For each area, it evaluates potential reconnections and selects the one that most improves the grid’s state based on a greedy approach. In other words, it chooses the best immediate action for each area without considering global optimizations.

Main business rules : High-level decision making

We mainly used power line loads (ρ) and especially, the maximum load value at a given time step to measure the state of the grid. We set two threshold values defining a safe state, a danger state, and an intermediate state. This constitutes the highest level of decision making, which remains very simple and hence offers flexibility. The simulation and forecasting functions implemented in the Grid2Op package also helped a lot to assess the quality of a chosen action before actually taking it on the grid.

Topological action space reduction and segmentation

One of the major challenges of this competition was to be able to build reduced topological action space from the ≃ 70000 unitary substation reconfigurations available, representing 10¹¹ possible reconfigurations on the system at a given time step.

Distribution of possible reconfiguration by substation for each zone. The number of reconfiguration is on a logarithmic scale.

The reduction of the action space was made possible by a costly but efficient exhaustive greedy search on the complete unitary action space through different cases. As it is inspired by the curriculum agent method, we use the same terminology and call the greedy search agent Teacher⁶. We use the three types of Teachers :

  • General Teacher: Find best action when an overflow occurs on the grid.
  • Attacking Teacher: Find best action in case of an attack on a power line.
  • N − 1 Teacher: Find best actions to mitigate the effect of a potential attack on a power line.

Each teacher was run through 8 days of continuous simulation in order to get a large dataset of useful actions. Then, the resulting dataset (containing both observations and chosen actions) is processed to create different action subsets. The reduced action spaces are saved as arrays in static files that are read and stored as agent’s properties when instantiating the agent.

Distribution of actions in reduced action space by zone (on the left), and by substation (on the right — colors corresponds to sub-zones)

We observe that the relatively balanced distribution of best actions is suitable to a segmentation of the action space by zone. In addition, one should note that the distribution of best actions by substation in the reduced space is not fully correlated to the actual distribution of the complete action space, making some substations with a lot of connections less important in the reduced action space than other substations of interest. This is particularly the case for substation 14 in zone 1 and sub- station 68 in zone 3. These results constitute an argument towards the potential development of multi-agent approach in order to operate each zone with a dedicated agent, hence, somehow imitating how a real power grid is managed.

Continuous control hyper-parameters tuning

Reduced topological action spaces provides a convenient and efficient way to handle most occurring grid instabilities with costless actions. However, avoiding blackout generally requires to use continuous control actions namely redispatching, curtailment and storage control. The problem of adapting grid injections to match power demand induced by loads and mitigate eventual overflows issue can be formulated as a convex optimization problem under constraints:

Here E refers to the quantity of energy (respectively for curtailment, redispatch and storage), p-values are parameters of the cost function which can be tuned and the last term of the cost function measure the load variation on power lines. The implementation is based on the python solver CVXPY⁷ and directly enhanced from B. Donnot’s implementation in the l2rpn baselines package.

A crucial step during the development of our agent was to tune hyper-parameters of the optimization modules. Tuning all the hyper-parameters of the problem is highly computationally expensive. Thus, after a few small manual changes, most of the parameters were set. The main parameters to tune are actually the state thresholds ρdanger and ρsafe as well as the penalty values of the optimization problem. These parameters are highly important to tune since they adjust the penalty weight of each type of action defining a balance between redispatching, curtailment and storage. Especially, one wants to limit the usage of curtailment to maximize the use of renewable energy.

The final decision algorithm of our agent is quite straightforward and calls specific modules based on simple high-level decision making.

Exploration of deep learning based methods

While the agent we developed achieves acceptable results and provides a working solution for the particular setup of the challenge, we wanted to explore Neural Network-based solutions through Deep Reinforcement Learning. This would help circumventing potential limitations of a rule-based model that includes time-consuming greedy search and optimization modules.

Greedy search, while reasonably usable in a reduced action space, has several major limitations, particularly when considering scaling the methods on a larger grid. Firstly, it is highly time-consuming since it requires simulating the effect of every action within the reduced action space. Secondly, it chooses actions based on very short term considerations: each action is tested to simulate the state of the grid at the next time step, i.e. 5 minutes later, while an algorithmic assistant to the operator would need to make decisions based on longer-term considerations. For all these reasons, innovative deep reinforcement learning methods should be considered in order to improve current topological action decision making.

Using our developed ”Expert agent”, we propose a training pipeline to efficiently replace greedy search modules among the different action spaces by neural network based policies. This training pipeline is described in the figure below, with consecutive supervised learning through imitation learning algorithms⁹ and reinforcement learning with PPO algorithm¹⁰ ¹¹. For the alert module, preliminary work has already proven the method to be effective and actually improve the ability for the agents to predict upcoming blackouts on the system.

Proposed imitation learning and reinforcement learning framework, inspired by curriculum agent framework

Results and agent’s behavior

Our agent achieve very good results with stable performances across datasets. Especially, the final score on the Sim2Real track (more realistic, hence harder) is relatively close to the score on main assistant track which showcases the robustness of our agent. Comparing those results to the other competitors, there is a large gap between the top-3 challengers and the others, which can probably be explained by a difference in the approach used. In fact, the top-3 on the leaderboard have a final score ranging from 60.83 (Our agent) to 64.96, while the score of the participant at the 4th position is 10.45.

One of the main assets of our method is its computational efficiency. In the third position of the final leaderboard, our agents achieves roughly similar scores as the top-2 challengers while being overall faster to simulate all the scenarios but also significantly faster in the decision making.

La Javaness’s agent is significantly faster than its 2 competitors in the maximum decision time. This time measure is probably the most important in the context of decision making on power grids. In particular, our agent is almost 10 times faster than Competitor #2’s agent and 45 times faster than Competitor #1’s agent. Considering that the time step between 2 observations, and thus, 2 decisions is 5 minutes (300 seconds), taking more than 15 seconds for decision making can lead to obsolete action as the state of a real power grid is continuously evolving, especially in dangerous situations. By keeping max decision time low, we ensure that the action is not obsolete when actually performed, and the error between the input observation and the state of the grid at the moment of the action is small enough to insure the quality of the decision making.

Aside from computational efficiency and average scoring, statistical analysis on our agent behavior showcases large improvements when compared to the Do Nothing baseline throughout all scenarii with cost efficient decisions :

❌ Most of the time doing nothing on the grid is preferable (here 9 out of 10 time steps)

🌍 Topology actions represents 98% of all actions : these actions are costless and help avoiding curtailment on renewable energy.

1️⃣ Most of the dangerous situations can be solved in 1 or 2 consecutive actions.

The overall behavior of our agent is very similar to the behavior of a human operator which constitute a great asset for our solution and the potential development for an AI-assistant tool. Indeed, this could ease supervised training with imitation learning as proposed in our framework and, in further development enable a smooth integration as a recommendation system for decision making for real power grid systems.

Agent’s behavior statistics

3 axes for future development

As Research & Development on algorithms for decision making on simulated power grid will continue in the coming months at La Javaness, we propose 3 main perspectives to improve the current solution:


💡 Engaging in the cutting-edge realm of power grid management through AI and Deep Reinforcement Learning has been an invaluable experience for La Javaness’s team. It has not only showcased our adaptability and proficiency but also affirmed our commitment to collaborative research, aiming to optimize power grid management for a safer, greener, and more efficient transition towards renewable energy — a pursuit deeply ingrained in the core values of our company. Moreover, we firmly believe that the acceleration of AI development is intrinsically tied to the synergistic collaboration of academic and industrial research, underscored by open-source projects and dynamic competitions like the L2RPN challenge.

⚡ Our proudest achievement in this challenge lies in the remarkable progress we made in an incredibly short span of time, with a modest team of just two individuals. Without prior expertise in power grid management, we successfully developed a competitive agent that stands on par with the top performers. Significantly, our agent boasts decision-making speeds that outperform our closest competitors by factors of 10 and 45. This unparalleled efficiency positions our solution as not only ripe for further research and development but also poised for swift integration into industrial applications within reasonable timeframes.

⚙️ With our extensive expertise in AI and ML, coupled with a track record of delivering highquality, industrial-grade code, La Javaness is well-positioned to spearhead the development of an ambitious assistant tool for power grid operators. This collaborative endeavor, leveraging RTE’s profound expertise in power grid management at an industrial scale, holds the potential to revolutionize grid operations, enhancing efficiency, reliability, and sustainability on an unprecedented scale. The prospect of amalgamating expert data and knowledge through an intelligent tool to assist operators is a highly compelling project for our company. This vision embodies our unwavering commitment to advancing the frontiers of AI-driven solutions for a sustainable energy future.


The work done as a team of two during this intensive competition was made possible by the powerful Open-Source eco-system provided by RTE with the Grid2Op simulator¹⁵. Grid2Op provides a comprehensive and powerful experimental platform to develop autonomous agents for powergrid management, including smooth interfacing with the major Reinforcement Learning frameworks but also many features allowing to work from the algorithmic and AI-oriented perspective making innovation possible for smart grids. Along with the simulator, we extensively used the L2RPN Baselines packages which helped us a lot kickstarting the project and get results with strong baselines implementation of agents from past challenges. Finally, we would like to address a special thanks to the challenge’s organizers, Antoine Marot, Benjamin Donnot, RTE’s team and the Paris Region for their support throughout the challenge.

About the author

After a successful internship project on speaker diarization, Jules SINTES joined La Javaness as a full-time data scientist in 2023. He was part of the winning team of the RTE and Paris Region 2023 L2RPN power grid management challenge. At La Javaness, Jules focuses on deep learning projects, with a special interest in sound and speech processing.

Note: The other half of the winning team is Van Tuan DANG, who joined La Javaness in 2020 after a PhD in machine learning. He is an active contributor to the open source community, and the author of one of the most popular French sentence-embedding models on Huggingface.


[1] Vidalenc, É., Bergey, J.-L., Quiniou, V., Marchal, D., & Combet, E. (2022). Quatre scénarios pour la transition écologique: L’exercice de prospective de l’ADEME Transition (s) 2050 1. Futuribles, (3), 5–21.

[2] Marot, A., Kelly, A., Naglic, M., Barbesant, V., Cremer, J., Stefanov, A., & Viebahn, J. (2022). Perspectives on future power system control centers for energy transition. Journal of Modern Power Systems and Clean Energy, 10(2), 328–344.

[3] Calvin, K., Dasgupta, D., Krinner, G., Mukherji, A., Thorne, P. W., Trisos, C., … Ha, M. (2023). IPCC, 2023: Climate Change 2023: Synthesis Report. Contribution of Working Groups I, II and III to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change [Core Writing Team, H. Lee and J. Romero (eds.)]. IPCC, Geneva, Switzerland (P. Arias, M. Bustamante, I. Elgizouli, G. Flato, M. Howden, C. Méndez-Vallejo, … C. Péan, Eds.). doi:10.59327/ipcc/ar6–9789291691647

[4] RTE (2022). Future Energétique 2050 : les scénarios de mix de production à l’étude permettant d’atteindre la neutralité carbone à l’horizon 2050.

[5] Serré, G., Boguslawski, E., Donnot, B., Pavão, A., Guyon, I., & Marot, A. (2022). Reinforcement learning for Energies of the future and carbon neutrality: a Challenge Design. arXiv Preprint arXiv:2207. 10330.

[6] Lehna, M., Viebahn, J., Marot, A., Tomforde, S., & Scholz, C. (2023). Managing power grids through topology actions: A comparative study between advanced rule-based and reinforcement learning agents. Energy and AI, 100276. doi:10.1016/j.egyai.2023.100276

[7] Diamond, S., & Boyd, S. (2016). CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83), 1–5.

[8] Agrawal, A., Verschueren, R., Diamond, S., & Boyd, S. (2018). A rewriting system for convex optimization problems. Journal of Control and Decision, 5(1), 42–60.

[9] Gleave, A., Taufeeque, M., Rocamonde, J., Jenner, E., Wang, S. H., Toyer, S., … Russell, S. (2022). imitation: Clean Imitation Learning Implementations. arXiv [Cs.LG]. Retrieved from

[10] Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., & Dormann, N. (2021). Stable-Baselines3: Reliable Reinforcement Learning Implementations. Journal of Machine Learning Research, 22(268), 1–8. Retrieved from

[11] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv Preprint arXiv:1707. 06347.

[12] Marot, A., Donnot, B., Romero, C., Donon, B., Lerousseau, M., Veyrin-Forrer, L., & Guyon, I. (2020). Learning to run a power network challenge for training topology controllers. Electric Power Systems Research, 189, 106635.

[13] Fuxjäger, A. R., Kozak, K., Dorfer, M., Blies, P. M., & Wasserer, M. (2023). Reinforcement Learning Based Power Grid Day-Ahead Planning and AI-Assisted Control. arXiv Preprint arXiv:2302. 07654.

[14] Kelly, A., O’Sullivan, A., de Mars, P., & Marot, A. (2020). Reinforcement learning for electricity network operation. arXiv Preprint arXiv:2003. 07339.

[15] Donnot, B. (2020). Grid2op- A testbed platform to model sequential decision making in power systems. GitHub Repository. Retrieved from



La Javaness R&D

We help organizations to succeed in the new paradigm of “AI@scale”, by using machine intelligence responsibly and efficiently :