Abstract
Our method introduces a geometry-aware token merging algorithm that identifies and combines redundant spatial tokens based on local geometric similarity and attention patterns. This approach maintains the expressive power of the transformer while dramatically reducing computational overhead and memory requirements for 3D point cloud processing.
Key Contributions
- A geometry-aware token merging strategy for 3D point cloud transformers
- Dynamic token reduction algorithm based on globally informed graph
- Significant computational efficiency gains while preserving geometric fidelity
Method Overview

b) For each Point Transformer layer, we compute global-informed energy scores, which are later used to calculate patch-level energy scores. a) These patch-level scores guide adaptive merging, retaining more information for high-energy patches. c) Each patch is divided into evenly sized bins, and destination tokens are randomly selected within these bins to enable spatially aware merging.
Insights
3D point cloud tokens are highly redundant!

Observation: After merging 90% of the tokens in each attention layer, the change in principal component analysis (PCA) visualization of feature representation (2nd image, 2nd row) is minimal compared to the original feature (2nd image, 1st row). Most of the predictions remain unchanged after merging, with red indicating the areas where predictions differ (3rd image, 2nd row). This leads us to conclude that there is high redundancy in the point cloud processing model.
Results

Computational Efficiency
GitMerge3D achieves up to 21% in computational cost while maintaining with minimal change in accuracy on point cloud tasks.

Memory Optimization
Our token merging strategy reduces peak memory usage significantly during inference, enabling processing of larger point clouds on resource-constrained devices.

Feature Preservation
GitMerge3D maintains critical geometric features even with aggressive token reduction.

Illustration of ScanNet segmentation results with and without our merging method. As shown in the fourth column, the differences - highlighted in red - are limited to only a few points among hundreds of thousands.

We visualize the output of various token compression techniques after removing 80% of the tokens, comparing their visual quality degradation (or preservation) on the 3D object reconstruction task
Acknowledgement
This work was supported by Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy - EXC 2075 – 390740016, the DARPA ANSR program under award FA8750-23-2-0004, the DARPA CODORD program under award HR00112590089. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Duy M. H. Nguyen. Tuan Anh Tran, Duy M. H. Nguyen, Michael Barz and Daniel Sonntag are also supported by the No-IDLE project (BMBF, 01IW23002), the MASTER project (EU, 101093079), and the Endowed Chair of Applied Artificial Intelligence, Oldenburg University.
Citation
@article{gitmerge3d2024, title={How Many Tokens Do 3D Point Cloud Transformer Architectures Really Need?}, author={Tuan Anh Tran and Duy Minh Ho Nguyen and Hoai-Chau Tran and Michael Barz and Khoa D. Doan and Roger Wattenhofer and Vien Anh Ngo and Mathias Niepert and Daniel Sonntag and Paul Swoboda}, journal={Conference/Journal Name}, year={2024}, volume={XX}, pages={XXX-XXX} }