机器学习中sklearn.datasets中的make_blobs函数解析（通俗易懂含例子）

站长

2024年04月09日 20:13 · 阅读数 50

看吴恩达2022年第二部分‘Advanced Learning Algorithm’其中的week2下面多类分类中的py文件时遇到了这样一段代码：

# make  dataset for example
centers = [[-5, 2], [-2, -2], [1, 2], [5, -2]]
X_train, y_train = make_blobs(n_samples=2000, centers=centers, cluster_std=1.0,random_state=30)

关于这个center参数，官方文档中的英文解释是：

centers : int or ndarray of shape (n_centers, n_features), default=None
    The number of centers to generate, or the fixed center locations.
    If n_samples is an int and centers is None, 3 centers are generated.
    If n_samples is array-like, centers must be
    either None or an array of length equal to the length of n_samples.

再使用中文检索和英文检索得到的解释就是：

centers：int或形状数组[n_centers，n_features]，可选（默认= None）要生成的中心数或固定的中心位置。如果n_samples是一个int且center为None，则将生成3个中心。如果n_samples是数组类，则中心必须为None或长度等于n_samples长度的数组。

这解释了感觉还是看不懂！！！

举例：

那就来试一试修改代码观察如何变化：

centers = [[-5, 2], [-2, -2], [1, 2], [5, -2]]
X_train, y_train = make_blobs(n_samples=20, centers=centers, cluster_std=1.0,random_state=30)
print(X_train)
print(y_train)

输出结果为：

[[-5.10069672  2.30379318]
 [ 6.11347211 -3.92116972]
 [ 0.994222    1.53252103]
 [-5.97071094  2.47055962]
 [-2.28564551 -1.46163252]
 [ 5.81050091 -3.04477837]
 [-4.86570341  0.89314453]
 [-0.42177445 -1.89250206]
 [ 4.29859758 -1.15091215]
 [-6.72596243  3.58509537]
 [-1.9033676   3.61689037]
 [ 1.98501786  0.29953473]
 [-6.26405266  3.52790535]
 [-2.76404783 -2.77518851]
 [-0.61615283 -1.23961492]
 [ 2.42550989  1.33524488]
 [-4.08389663 -1.06221829]
 [ 4.31077063 -2.85275686]
 [ 0.5769847   3.06448209]
 [ 3.89985619 -3.31564409]]
[0 3 2 0 1 3 0 1 3 0 2 2 0 1 1 2 1 3 2 3]

分析：

可以观察到结果中第一项[-5.10069672 2.30379318] 与预先设置的[-5,2]相近（因为设置的标准差为1.0），此时对应的y的分类结果是0；再看结果中第二项 [ 6.11347211 -3.92116972] 与预先设置的 [5,-2]相近（因为设置的标准差为1.0），此时对应的y的分类结果是3。

推导：

center列表[[-5, 2], [-2, -2], [1, 2], [5, -2]]中的每一个小的列表的index值分别代表y的取值。这个列表表示的并不是区间，而是可以看作特征变量x1,x2（下面有验证）。生成的标准差为1的随机样本之一比如结果的第5项[-2.28564551 -1.46163252]，它的值接近于列表[-2, -2]，观察y此时对应为1,正好的是[-2, -2]在centers列表中的下标位置。

验证：

假设我现在要生成含有3个特征变量（x1,x2，x3）, 5种分类结果的样本集合。

centers = [[-5, 2, 1], [5, -2, 2], [10, 2, 3], [15, -2, 4], [20, 3, 6]]
X_train, y_train = make_blobs(n_samples=20, centers=centers, cluster_std=1.0,random_state=30)
print(X_train)
print(y_train)

查看输出结果：

[[ 5.76038508 -2.28564551  2.53836748]
 [ 7.0966324   3.61689037  4.42550989]
 [19.00816561  1.83671763  5.9786649 ]
 [ 9.33524488  2.98501786  1.29953473]
 [-4.52944038  1.89930328  1.30379318]
 [21.2418555   4.70774688  6.3231534 ]
 [ 8.89985619  0.68435591  3.81050091]
 [-6.26405266  3.52790535  0.02928906]
 [19.61298353  1.07269461  6.55075659]
 [ 8.95522163  1.31077063  2.14724314]
 [14.97080728 -0.60594402  3.60213256]
 [-6.72596243  3.58509537  1.13429659]
 [20.95435581  3.7827765   4.2067621 ]
 [ 4.23595217 -2.77518851  3.38384717]
 [ 2.91610337 -1.06221829  1.994222  ]
 [16.11347211 -3.92116972  3.29859758]
 [-6.10685547  3.57822555  1.10749794]
 [16.01912738 -0.1011187   3.64515036]
 [ 4.53252103 -2.4230153   3.06448209]
 [15.84908785 -0.94930021  3.46312554]]
[1 2 4 2 0 4 2 0 4 2 3 0 4 1 1 3 0 3 1 3]

随机取一行，比如第13行[20.95435581 3.7827765 4.2067621 ]接近于centers中的[20, 3, 6]。此时对应的y的分类结果是第五类（index为4）。

（转载请注明出处！）

转载自:https://juejin.cn/post/7142202565566922783