M2oE: Modular Mixture of Experts for Multi-Morphology Reinforcement Learning of Modular Robots

Chang Liu1,2, Qinchao Xu1, Satoshi Yagi1, Satoshi Yamamori1,3,
Yaonan Zhu2, Yusuke Iwasawa2, Kazuya Yoshida4, Jun Morimoto1,3

1 Graduate School of Informatics, Kyoto University, Kyoto, Japan

2 School of Engineering, The University of Tokyo, Tokyo, Japan

3 Dept. of Brain Robot Interface, Computational Neuroscience Labs, ATR, Kyoto, Japan

4 Dept. of Aerospace Engineering, Graduate School of Engineering, Tohoku University, Sendai, Japan

ICRA 2026 (Accepted)

Project Video

Abstract

Modular robots offer a promising solution for building versatile and adaptable robotic systems. For instance, space exploration robots can be designed to reconfigure to meet diverse task demands across varying environments. However, training such systems by Reinforcement Learning (RL) remains challenging due to the diversity of morphologies and the lack of simulation environments that support simultaneous multi-morphology learning. We present Modular Mixture of Experts (M2oE), a novel reinforcement learning backbone network that imitates the modular structure of robots to enable efficient and module-wise parallelizable policy learning for modular robots. In M2oE, the shared pool of experts, combined with an attention-based gating mechanism that dynamically selects experts based on inter-module correlations, enables both specialization and generalization. This structure supports training across multiple morphologies within a single framework, avoiding gradient conflicts and enhancing experience sharing across modules and morphologies. To support training, we also extend the Isaac Lab simulator with multi-morphology extensions that enable concurrent training across diverse robot configurations. Experiments on a space-exploration-inspired modular robot, Moonbot, demonstrate that M2oE significantly improves learning efficiency and achieves superior performance compared to both MLP and Transformer baselines.

Method Overview

M2oE architecture
Structure of Modular Mixture of Experts (M2oE). There are two main components: a module-wise shared pool of experts and an attention-based gating mechanism (M2oE-Gate) that extract inter-module correlations to select experts for each module. The design of M2oE enables module-wise parallelism and automatic adaptation to morphologies with different numbers of modules. This figure illustrates the action inference process for the first module in a robot with M modules.

Experiment Platform - Moonbot

Moonbot morphologies
Three morphologies of the modular robot Moonbot: Minimal (left), Dragon (middle), and Tricycle (right). These diverse configurations pose challenges for reinforcement learning policies to generalize effectively across morphologies.
Multi-morphology extension
Multi-morphology extension in Isaac Lab for concurrent training across diverse robot configurations.

Results

Locomotion performance
Locomotion performance of multiple Moonbot morphologies trained with M2oE and the multi-morphology extension in Isaac Lab. All morphologies are controlled by the same M2oE policy and reliably track target velocity commands while maintaining stability and balance.
Learning curves
Learning Curves of different methods on the locomotion task across three Moonbot morphologies. We report the average episodic return and the EMA of tracking error (MAE of target linear and angular velocities) during training. Solid lines show the EMA-smoothed (span = 50) mean performance across three seeds, while shaded bands represent the mean ±1 standard deviation. M2oE demonstrates stable and efficient convergence across all morphologies. JointMLP is able to exceed M2oE in linear velocity tracking error on Minimal morphology, but it fails to converge on other morphologies.
Gate heatmap
Heatmap of expert utilization in M2oE across different morphologies. Each column represents a module, and each row represents an expert. The color intensity indicates the average gating value over test episodes.
t-SNE query visualization
t-SNE visualization of gating query features across morphologies. Each point represents a transformer output feature (query vector) produced by the gating network during forward locomotion at a fixed velocity command. Distinct clusters indicate that the gating network learns morphology- and module-specific representations in the shared embedding space, enabling expert selection across heterogeneous body structures.