Skip to yearly menu bar Skip to main content


Poster

Stray Intrusive Outliers-Based Feature Selection on Intra-Class Asymmetric Instance Distribution or Multiple High-Density Clusters

Lixin Yuan · Yirui Wu · WENXIAO ZHANG · Minglei Yuan · Jun Liu

East Exhibition Hall A-B #E-1700
[ ] [ ]
Tue 15 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract: For data with intra-class Asymmetric instance Distribution or Multiple High-density Clusters (ADMHC), outliers are real and have specific patterns for data classification, where the class body is necessary and difficult to identify. Previous Feature Selection (FS) methods score features based on all training instances or rarely target intra-class ADMHC. In this paper, we propose a supervised FS method, Stray Intrusive Outliers-based FS (SIOFS), for data classification with intra-class ADMHC. By focusing on Stray Intrusive Outliers (SIOs), SIOFS modifies the skewness coefficient and fuses the threshold in the 3$\sigma$ principle to identify the class body, scoring features based on the intrusion degree of SIOs. In addition, the refined density-mean center is proposed to represent the general characteristics of the class body reasonably. Mathematical formulations, proofs, and logical exposition ensure the rationality and universality of the settings in the proposed SIOFS method. Extensive experiments on 16 diverse benchmark datasets demonstrate the superiority of SIOFS over 12 state-of-the-art FS methods in terms of classification accuracy, normalized mutual information, and confusion matrix. SIOFS source codes is available at https://github.com/XXXly/2025-ICML-SIOFS

Lay Summary:

In many real-world datasets, such as images or medical records, data within the same class can have complex patterns, like uneven spreads or multiple dense clusters, making it hard to distinguish between classes. Some data points, called stray outliers, which look more like another class (e.g., a resort image mistaken for a school, or handwritten digits 4 and 9 appearing similar). Traditional feature selection (FS) methods treat all data points equally, ignoring these critical outliers. This paper introduces a new FS method, SIOFS, which focuses on these stray outliers that intrude other class bodies. SIOFS identifies the main characteristic of each class using a refined statistical approach, helping identify features that best separate classes. By testing on 16 diverse datasets, SIOFS outperformed 12 existing FS methods in accuracy and reliability. This advance is particularly useful for small or complex datasets where outliers and overlapping classes are common. This paper provides an interesting way to mine the patterns of tricky data, improving automated classification in fields like healthcare or image recognition.

Chat is not available.