当前位置:   article > 正文


多模态 离群点检测


离群点检测:训练数据包含离群点,这些离群点被定义为远离其它内点的观察值。因此,离群点检测估计器会尝试拟合出训练数据中内围点聚集的区域, 而忽略异常值观察。


离群点检测和新奇点检测都用于异常检测, 其中一项感兴趣的是检测异常或异常观察。离群点检测又被称之为无监督异常检测,新奇点检测又被称之为半监督异常检测。 在离群点检测的背景下, 离群点/异常点不能够形成密集的簇,因为可用的估计器假设离群点/异常点位于低密度区域。相反的,在新奇点检测的背景下, 新奇点/异常点只要位于训练数据的低密度区域,是可以形成稠密聚类簇的,在此背景下被认为是正常的。

scikit-learn有一套机器学习工具estimator.fit(X_train),可用于新奇点或离群值检测。然后可以使用estimator.predict(X_test)方法将新观察值分类为离群点或内点 :内围点会被标记为1,而离群点标记为-1。







sklearn.ensemble。IsolationForest sklearn.neighbors。LocalOutlierFactor对于多模态数据集似乎表现得相当好。sklearn的优势。第三个数据集的局部离群因子超过其他估计显示,其中两种模式有不同的密度。这种优势是由LOF的局域性来解释的,即它只比较一个样本的异常分数与其相邻样本的异常分数。




  1. # Author: Alexandre Gramfort <alexandre.gramfort@inria.fr>
  2. # Albert Thomas <albert.thomas@telecom-paristech.fr>
  3. # License: BSD 3 clause
  4. import time
  5. import numpy as np
  6. import matplotlib
  7. import matplotlib.pyplot as plt
  8. from sklearn import svm
  9. from sklearn.datasets import make_moons, make_blobs
  10. from sklearn.covariance import EllipticEnvelope
  11. from sklearn.ensemble import IsolationForest
  12. from sklearn.neighbors import LocalOutlierFactor
  13. print(__doc__)
  14. matplotlib.rcParams['contour.negative_linestyle'] = 'solid'
  15. # Example settings
  16. n_samples = 300
  17. outliers_fraction = 0.15
  18. n_outliers = int(outliers_fraction * n_samples)
  19. n_inliers = n_samples - n_outliers
  20. # define outlier/anomaly detection methods to be compared
  21. anomaly_algorithms = [
  22. ("Robust covariance", EllipticEnvelope(contamination=outliers_fraction)),
  23. ("One-Class SVM", svm.OneClassSVM(nu=outliers_fraction, kernel="rbf",
  24. gamma=0.1)),
  25. ("Isolation Forest", IsolationForest(contamination=outliers_fraction,
  26. random_state=42)),
  27. ("Local Outlier Factor", LocalOutlierFactor(
  28. n_neighbors=35, contamination=outliers_fraction))]
  29. # Define datasets
  30. blobs_params = dict(random_state=0, n_samples=n_inliers, n_features=2)
  31. datasets = [
  32. make_blobs(centers=[[0, 0], [0, 0]], cluster_std=0.5,
  33. **blobs_params)[0],
  34. make_blobs(centers=[[2, 2], [-2, -2]], cluster_std=[0.5, 0.5],
  35. **blobs_params)[0],
  36. make_blobs(centers=[[2, 2], [-2, -2]], cluster_std=[1.5, .3],
  37. **blobs_params)[0],
  38. 4. * (make_moons(n_samples=n_samples, noise=.05, random_state=0)[0] -
  39. np.array([0.5, 0.25])),
  40. 14. * (np.random.RandomState(42).rand(n_samples, 2) - 0.5)]
  41. # Compare given classifiers under given settings
  42. xx, yy = np.meshgrid(np.linspace(-7, 7, 150),
  43. np.linspace(-7, 7, 150))
  44. plt.figure(figsize=(len(anomaly_algorithms) * 2 + 3, 12.5))
  45. plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05,
  46. hspace=.01)
  47. plot_num = 1
  48. rng = np.random.RandomState(42)
  49. for i_dataset, X in enumerate(datasets):
  50. # Add outliers
  51. X = np.concatenate([X, rng.uniform(low=-6, high=6,
  52. size=(n_outliers, 2))], axis=0)
  53. for name, algorithm in anomaly_algorithms:
  54. t0 = time.time()
  55. algorithm.fit(X)
  56. t1 = time.time()
  57. plt.subplot(len(datasets), len(anomaly_algorithms), plot_num)
  58. if i_dataset == 0:
  59. plt.title(name, size=18)
  60. # fit the data and tag outliers
  61. if name == "Local Outlier Factor":
  62. y_pred = algorithm.fit_predict(X)
  63. else:
  64. y_pred = algorithm.fit(X).predict(X)
  65. # plot the levels lines and the points
  66. if name != "Local Outlier Factor": # LOF does not implement predict
  67. Z = algorithm.predict(np.c_[xx.ravel(), yy.ravel()])
  68. Z = Z.reshape(xx.shape)
  69. plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors='black')
  70. colors = np.array(['#377eb8', '#ff7f00'])
  71. plt.scatter(X[:, 0], X[:, 1], s=10, color=colors[(y_pred + 1) // 2])
  72. plt.xlim(-7, 7)
  73. plt.ylim(-7, 7)
  74. plt.xticks(())
  75. plt.yticks(())
  76. plt.text(.99, .01, ('%.2fs' % (t1 - t0)).lstrip('0'),
  77. transform=plt.gca().transAxes, size=15,
  78. horizontalalignment='right')
  79. plot_num += 1
  80. plt.show()



