Poster
in
Workshop: The 2nd Workshop on Reliable and Responsible Foundation Models
Learning Robust 3D Representation from CLIP via Dual Denoising
Shuqing Luo · Bowen Qu · Wei Gao
Keywords: [ zero-shot adversarial robustness ] [ 3D pre-training ]
In this paper, we explore a critical yet under-investigated challenge: whether pre-trained vision language models like CLIP can be adapted for zero-shot adversarial-robust point cloud recognition. Since point clouds are a crucial data format for representing a 3D world, particularly in safety-critical applications, there is a pressing need to develop adversarially robust 3D recognition algorithms due to the inherent vulnerability of deep models to adversarial attacks. Recent advances in vision-language pre-training have endowed point cloud recognition models with powerful zero-shot generalization capacity, leading to a new paradigm for large-scale 3D recognition. This is usually achieved via cross-modal distillation, a scalable approach for multi-modal aware 3D learning. However, current methods primarily rely on direct alignment to map point cloud features to a shared multi-modal feature space, providing no improvement in 3D robustness. This raises a critical question: can both high-performing zero-shot 3D recognition and zero-shot 3D adversarial robustness be achieved in large-scale 3D learning? Our answer is affirmative. In this paper, we propose a novel distillation algorithm designed to learn robust 3D representations from CLIP. It is capable of simultaneously enhancing both zero-shot 3D recognition performance and zero-shot 3D adversarial robustness compared to baseline models. Our approach is built upon two key components, namely robust 3D pre-training and parallel feature denoising. This enables robust and high-performing 3D zero-shot generalization without the dependence on adversarial training, which is often inefficient and prone to overfit. Experiments indicate that our model achieves a 7% improvement with clean input and varying degrees of enhancement with perturbed input, outperforming other models of similar scale on zero-shot 3D recognition benchmarks.