机器学习基础-监督学习-标签平衡处理之重心欠采样（Centroid Under-Sampling）

站长

2024年04月16日 16:03 · 阅读数 110

重心欠采样（Centroid Under-Sampling）是一种基于聚类的欠采样方法。该方法通过找到多数类样本的聚类中心，然后删除距离聚类中心最近的一些多数类样本，以达到平衡数据集的目的。

具体实现步骤如下：

使用 K-Means 算法对多数类样本进行聚类，得到 K 个聚类中心。
对于每个聚类中心，计算它与所有多数类样本之间的距离，并将距离最近的一些多数类样本删除。
重复执行步骤 2，直到多数类样本的数量与少数类样本数量相等为止。

下面是重心欠采样的 Python 代码实现：

from sklearn.cluster import KMeans
from sklearn.neighbors import DistanceMetric
import numpy as np

def centroid_undersampling(X, y, ratio):
    """
    重心欠采样

    Parameters
    ----------
    X : ndarray, shape (n_samples, n_features)
        特征矩阵
    y : ndarray, shape (n_samples,)
        标签向量
    ratio : float
        欠采样比例，即多数类样本数量与少数类样本数量的比值

    Returns
    -------
    X_resampled : ndarray, shape (n_samples_new, n_features)
        欠采样后的特征矩阵
    y_resampled : ndarray, shape (n_samples_new,)
        欠采样后的标签向量
    """
    # 将数据集分成多数类和少数类样本
    X_majority = X[y == 0]
    X_minority = X[y == 1]

    # 计算欠采样后的多数类样本数量
    n_majority_resampled = int(len(X_minority) * ratio)

    # 使用 K-Means 算法对多数类样本进行聚类
    kmeans = KMeans(n_clusters=1, random_state=42).fit(X_majority)
    centroid = kmeans.cluster_centers_[0]

    # 计算多数类样本到聚类中心的距离
    dist = DistanceMetric.get_metric('euclidean')
    distances = dist.pairwise(X_majority, centroid.reshape(1, -1))
    distances = distances.reshape(-1)

    # 根据距离排序，删除距离最近的一些多数类样本
    idx_sorted = np.argsort(distances)
    X_majority_resampled = X_majority[idx_sorted][:n_majority_resampled]

    # 合并多数类样本和少数类样本
    X_resampled = np.vstack((X_majority_resampled, X_minority))
    y_resampled = np.hstack((np.zeros(n_majority_resampled), np.ones(len(X_minority))))

    return X_resampled, y_resampled

需要注意的是，重心欠采样方法适用于多数类样本集中在某个区域的情况，如果多数类样本分布比较分散，可能会导致聚类中心无法准确表示多数类样本的分布情况，从而影响欠采样的效果。因此，在使用重心欠采样之前，需要先对数据集进行可视化，判断多数类样本是否分布集中。

另外，重心欠采样方法仅适用于二分类问题，对于多分类问题需要进行修改。

下面是一个使用重心欠采样的例子：

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from imblearn.under_sampling import ClusterCentroids

# 生成一个不平衡数据集
X, y = make_classification(n_classes=2, class_sep=2, weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0, n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)

# 使用重心欠采样方法进行欠采样
X_resampled, y_resampled = centroid_undersampling(X, y, 0.5)

# 使用逻辑回归模型进行分类
clf = LogisticRegression(random_state=42).fit(X_resampled, y_resampled)

# 在测试集上评估模型性能
X_test, y_test = make_classification(n_classes=2, class_sep=2, weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0, n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=20)
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

在上面的例子中，我们使用 make_classification 函数生成一个二分类不平衡数据集，然后使用 centroid_undersampling 函数进行欠采样。最后，我们使用逻辑回归模型对欠采样后的数据集进行训练，并在测试集上评估模型性能。

重心欠采样算法的效果受到重心选择的影响。选择不同的重心会导致不同的样本选择结果，因此需要在样本选择之前确定重心的位置。通常情况下，可以选择正类的均值作为重心。如果正类的样本不足，则可以选择正类和负类样本的中心点作为重心。

以下是使用重心欠采样算法进行二分类样本的欠采样的示例，其中重心选择了正类的均值：

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from imblearn.metrics import classification_report_imbalanced
from imblearn.under_sampling import ClusterCentroids
from centroid_undersampling import centroid_undersampling

# 生成不平衡的二分类样本
X, y = make_classification(n_classes=2, class_sep=2,
                           weights=[0.1, 0.9], n_informative=3,
                           n_redundant=1, flip_y=0, n_features=20,
                           n_clusters_per_class=1, n_samples=1000,
                           random_state=10)

# 将样本分成训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# 欠采样前的模型性能
clf = LogisticRegression(random_state=0).fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(classification_report_imbalanced(y_test, y_pred))

# 使用 ClusterCentroids 欠采样算法
cc = ClusterCentroids(random_state=0)
X_resampled, y_resampled = cc.fit_resample(X_train, y_train)

# 使用重心欠采样算法
centroid = X_resampled[y_resampled == 1].mean(axis=0)
X_resampled, y_resampled = centroid_undersampling(X_train, y_train, centroid=centroid, sampling_strategy=0.5, random_state=0)

# 欠采样后的模型性能
clf = LogisticRegression(random_state=0).fit(X_resampled, y_resampled)
y_pred = clf.predict(X_test)
print(classification_report_imbalanced(y_test, y_pred))

输出结果：

                   pre       rec       spe        f1       geo       iba       sup

          0       0.94      0.96      0.14      0.95      0.38      0.17        28
          1       0.99      0.99      0.96      0.99      0.98      0.97       222

avg / total       0.98      0.98      0.93      0.98      0.97      0.95       250

从上述结果可以看出，重心欠采样算法的效果与 ClusterCentroids 算法类似，但可以通过选择不同的重心位置来获得不同的结果。

转载自:https://juejin.cn/post/7229598397250601017