How Many Tokens Do 3D Point Cloud Transformer Architectures Really Need?

Tuan Anh Tran1 Duy Minh Ho Nguyen2 1 3 Hoai-Chau Tran4 Michael Barz1 Khoa D. Doan4
Roger Wattenhofer5 Vien Anh Ngo6 Mathias Niepert2 3 Daniel Sonntag1 7 Paul Swoboda8
1German Research Centre for Artificial Intelligence (DFKI)
2Max Planck Research School for Intelligent Systems (IMPRS-IS)
3University of Stuttgart 4College of Engineering and Computer Science, VinUniversity 5ETH Zurich 6VinRobotics, Hanoi, Vietnam
7University of Oldenburg 8Heinrich Heine University Düsseldorf
Correspondence to: Tuan Anh Tran <tuan.tran@dfki.de>, Paul Swoboda <paul.swoboda@hhu.de>.
Performance and computational efficiency comparison

                            Minimal Performance Loss

Higher Computational Efficiency             

Abstract

3D point cloud transformers face significant computational challenges due to the quadratic complexity of self-attention mechanisms and the large number of tokens required to represent dense point clouds. We present GitMerge3D, a novel token merging strategy specifically designed for 3D point cloud transformers that dynamically reduces token count while preserving critical geometric information.

Our method introduces a geometry-aware token merging algorithm that identifies and combines redundant spatial tokens based on local geometric similarity and attention patterns. This approach maintains the expressive power of the transformer while dramatically reducing computational overhead and memory requirements for 3D point cloud processing.

Key Contributions

  • A geometry-aware token merging strategy for 3D point cloud transformers
  • Dynamic token reduction algorithm based on globally informed graph
  • Significant computational efficiency gains while preserving geometric fidelity

Method Overview

GitMerge3D methodology diagram showing token merging pipeline

b) For each Point Transformer layer, we compute global-informed energy scores, which are later used to calculate patch-level energy scores. a) These patch-level scores guide adaptive merging, retaining more information for high-energy patches. c) Each patch is divided into evenly sized bins, and destination tokens are randomly selected within these bins to enable spatially aware merging.

Insights

3D point cloud tokens are highly redundant!

Visualization showing 3D point cloud redundancy comparison between original and merged predictions

Observation: After merging 90% of the tokens in each attention layer, the change in principal component analysis (PCA) visualization of feature representation (2nd image, 2nd row) is minimal compared to the original feature (2nd image, 1st row). Most of the predictions remain unchanged after merging, with red indicating the areas where predictions differ (3rd image, 2nd row). This leads us to conclude that there is high redundancy in the point cloud processing model.

Results

Computational efficiency comparison

Computational Efficiency

GitMerge3D achieves up to 21% in computational cost while maintaining with minimal change in accuracy on point cloud tasks.

Memory usage reduction

Memory Optimization

Our token merging strategy reduces peak memory usage significantly during inference, enabling processing of larger point clouds on resource-constrained devices.

Geometric feature preservation visualization

Feature Preservation

GitMerge3D maintains critical geometric features even with aggressive token reduction.

ScanNet segmentation results comparison

Illustration of ScanNet segmentation results with and without our merging method. As shown in the fourth column, the differences - highlighted in red - are limited to only a few points among hundreds of thousands.

3D object reconstruction comparison

We visualize the output of various token compression techniques after removing 80% of the tokens, comparing their visual quality degradation (or preservation) on the 3D object reconstruction task

Acknowledgement

This work was supported by Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy - EXC 2075 – 390740016, the DARPA ANSR program under award FA8750-23-2-0004, the DARPA CODORD program under award HR00112590089. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Duy M. H. Nguyen. Tuan Anh Tran, Duy M. H. Nguyen, Michael Barz and Daniel Sonntag are also supported by the No-IDLE project (BMBF, 01IW23002), the MASTER project (EU, 101093079), and the Endowed Chair of Applied Artificial Intelligence, Oldenburg University.

Citation

@article{gitmerge3d2024,
    title={How Many Tokens Do 3D Point Cloud Transformer Architectures Really Need?},
    author={Tuan Anh Tran and Duy Minh Ho Nguyen and Hoai-Chau Tran and Michael Barz and Khoa D. Doan and Roger Wattenhofer and Vien Anh Ngo and Mathias Niepert and Daniel Sonntag and Paul Swoboda},
    journal={Conference/Journal Name},
    year={2024},
    volume={XX},
    pages={XXX-XXX}
}