赞
踩
打仗的时候只有站最前面的人在打而已
支持向量机也是完成分类问题的一个工具,不同于逻辑回归,在支持向量机解决的分类问题中,只有最靠近对方阵营的样本对分界线的确定起到作用,而远离分界线的那些样本对分界线的确定没有作用。在这样的机制下,SVM拥有更好的鲁棒性,受离群点的影响几乎可忽略不计。
本次演示使用美国成人收入统计模型
数据说明如下:
标签有两种:>50K, <=50K.
import pandas as pd
import numpy as np
from sklearn import svm
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
dataframe = pd.read_table('datasets/Adult/adult.data',sep=',',header=None)
dataframe.columns=["age", "workclass", "fnlwgt", "education", "education-num", "marital-status",
"occupation", "relationship", "race", "sex", "capital-gain", "capital-loss",
"hours-per-week", "native-country","salary"]
dataframe.head(3)
/Users/yaochenli/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: read_table is deprecated, use read_csv instead.
"""Entry point for launching an IPython kernel.
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
dataframe.workclass.unique()
可以看到这个原始数据里面的缺失值是用“?”表示的
dataframe.shape
(32561, 15)
(dataframe==" ?").sum()
/Users/yaochenli/anaconda3/lib/python3.7/site-packages/pandas/core/ops.py:1649: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison result = method(y) age 0 workclass 1836 fnlwgt 0 education 0 education-num 0 marital-status 0 occupation 1843 relationship 0 race 0 sex 0 capital-gain 0 capital-loss 0 hours-per-week 0 native-country 583 salary 0 dtype: int64
对比样本量,缺失值不算很多,由于缺失的值都不是标量,而是标签变量,所以我们根据分布进行填充
dataframe.workclass.value_counts()
Private 22696
Self-emp-not-inc 2541
Local-gov 2093
? 1836
State-gov 1298
Self-emp-inc 1116
Federal-gov 960
Without-pay 14
Never-worked 7
Name: workclass, dtype: int64
这里private明显占了大多数,我们就把缺失值使用Private填充
dataframe.workclass.replace(" ?", " Private", inplace=True)
dataframe.workclass.value_counts()
Private 24532
Self-emp-not-inc 2541
Local-gov 2093
State-gov 1298
Self-emp-inc 1116
Federal-gov 960
Without-pay 14
Never-worked 7
Name: workclass, dtype: int64
dataframe.occupation.value_counts()
Prof-specialty 4140 Craft-repair 4099 Exec-managerial 4066 Adm-clerical 3770 Sales 3650 Other-service 3295 Machine-op-inspct 2002 ? 1843 Transport-moving 1597 Handlers-cleaners 1370 Farming-fishing 994 Tech-support 928 Protective-serv 649 Priv-house-serv 149 Armed-Forces 9 Name: occupation, dtype: int64
这里分布比较平均,我们把“?”单独以Others替代
dataframe.occupation.replace(" ?", " Other", inplace=True)
dataframe.occupation.value_counts()
Prof-specialty 4140 Craft-repair 4099 Exec-managerial 4066 Adm-clerical 3770 Sales 3650 Other-service 3295 Machine-op-inspct 2002 Other 1843 Transport-moving 1597 Handlers-cleaners 1370 Farming-fishing 994 Tech-support 928 Protective-serv 649 Priv-house-serv 149 Armed-Forces 9 Name: occupation, dtype: int64
dataframe["native-country"].value_counts()
United-States 29170 Mexico 643 ? 583 Philippines 198 Germany 137 Canada 121 Puerto-Rico 114 El-Salvador 106 India 100 Cuba 95 England 90 Jamaica 81 South 80 China 75 Italy 73 Dominican-Republic 70 Vietnam 67 Guatemala 64 Japan 62 Poland 60 Columbia 59 Taiwan 51 Haiti 44 Iran 43 Portugal 37 Nicaragua 34 Peru 31 Greece 29 France 29 Ecuador 28 Ireland 24 Hong 20 Trinadad&Tobago 19 Cambodia 19 Thailand 18 Laos 18 Yugoslavia 16 Outlying-US(Guam-USVI-etc) 14 Hungary 13 Honduras 13 Scotland 12 Holand-Netherlands 1 Name: native-country, dtype: int64
这里美国的样本量占了绝大多数,我们先使用美国来填充缺失值
dataframe["native-country"].replace(" ?", " United-States", inplace=True)
dataframe["native-country"].value_counts()
United-States 29753 Mexico 643 Philippines 198 Germany 137 Canada 121 Puerto-Rico 114 El-Salvador 106 India 100 Cuba 95 England 90 Jamaica 81 South 80 China 75 Italy 73 Dominican-Republic 70 Vietnam 67 Guatemala 64 Japan 62 Poland 60 Columbia 59 Taiwan 51 Haiti 44 Iran 43 Portugal 37 Nicaragua 34 Peru 31 France 29 Greece 29 Ecuador 28 Ireland 24 Hong 20 Cambodia 19 Trinadad&Tobago 19 Laos 18 Thailand 18 Yugoslavia 16 Outlying-US(Guam-USVI-etc) 14 Honduras 13 Hungary 13 Scotland 12 Holand-Netherlands 1 Name: native-country, dtype: int64
plt.figure(figsize=(8,5))
sns.color_palette("Set3")
# sns.set(style="whitegrid")
sns.countplot(dataframe.salary, palette="rocket")
plt.title("distribution of salary")
Text(0.5, 1.0, 'distribution of salary')
可以看到样本标签不是很平衡,但我们用SVM问题不大
plt.figure(figsize
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。