当前位置:   article > 正文

Python数据分析与机器学习实战<三>Pandas_python embarked

python embarked

目录

Pandas数据读取

查看数据的类型

读取前几行或后几行数据

pandas索引与计算 

.loc[index]

 通过列名取值

计算

排序

 pandas数据预处理实例

案例:“泰坦尼克穿越获救"

Pandas常用预处理方法

Pandas自定义函数

.apply()

Series结构

series相加


pandas做数据预处理是是很方便的。pandas中的缺失值用NAN表示

Pandas数据读取

查看数据的类型

  1. import pandas
  2. dia = pandas.read_csv("E:\cluster\seaborn-data\diamonds.csv")
  3. print(type(dia))# 核心结构(DataFrame)
  4. print(dia.dtypes)# 字符型叫object
  5. print(help(pandas.read_csv))

读取前几行或后几行数据

  1. print(dia.head())# 默认显示前五条数据
  2. print("--------------------------------------------------------------")
  3. print(dia.head(2))# 指定显示行数
  4. print("--------------------------------------------------------------")
  5. print(dia.tail(4))# 显示后四条数据
  6. print("--------------------------------------------------------------")
  7. print(dia.columns)# 列名
  8. print(dia.shape) #(a,b)表示当前数据有a个样本,每个样本b个指标也可说a行b列

pandas索引与计算 

.loc[index]

pandas取数据比较麻烦,不能直接通过index,需要通过函数.loc[index]

  1. print(dia.loc[0])# 取第0行数据
  2. print(dia.loc[53939]) # 取最后一行数据(如果超过范围就会报错)
  3. # print(dia.loc[3:6])# 取3到6行

 通过列名取值

pandas 读取csv文件认为第一行就是列名,可以通过第一行的名字来访问某一列

  1. # 定位一列
  2. col = dia["carat"]
  3. print(col)
  4. #与上面等价
  5. #name = "carat"
  6. # print(dia[name])
  7. # 定位两列
  8. cols = ["carat", "color"]
  9. print(dia[cols])
  10. # print(dia[["carat","color"]])
  11. # print(dia[["carat","color"]])
  12. # 查找单位为g的数据
  13. #cols_name = dia.colunms.tolist() 列名存为一个列表
  14. # print(cols_name)
  15. # for i in cols_name:
  16. # if i.endswith("g"): 单位为g
  17. # cols.append(i)
  18. # print(dia[cols])

计算

  1. # 当两列维度相同时,结果为对应位置进行相应的运算
  2. xandy = dia["x"] * dia["y"]
  3. print(xandy.head(3))# 显示前3行数据
  4. # 对每个元素都/1000
  5. x_ = dia["x"]/1000
  6. print(dia.shape)
  7. # 加一列(注意:行数要对应)
  8. dia["x_"] = x_
  9. print(dia.shape)
  10. # 求某一列的最大值
  11. print(dia["x"].max())
  12. # 让某一列都除以最大值
  13. print((dia["y"]/dia["x"].max()).head(3))

排序

  1. # 排序(默认从小到大)inplace=TRUE说明改变原来的数据,而不是新建数据
  2. dia.sort_values("x", inplace=True)
  3. print(dia['x'].tail(3))
  4. print("---------------")
  5. # 降序排
  6. dia.sort_values("x", inplace=True, ascending=False)
  7. print(dia["x"].head(3))

 pandas数据预处理实例

案例:“泰坦尼克穿越获救"

  1. import numpy as np
  2. import pandas as pd
  3. titanic_survival = pd.read_csv("titanic.csv")
  4. titanic_survival.head()

survived:表示当前数据的一个label值(即标签值)后面有个分类任务
pclass:表船内仓位的等级
sex:当前乘客的性别
age:当前乘客的年龄
sibsp:当前乘客的兄弟姐妹的数量
parch:(parents and child)当前乘客的老人和孩子总共多少
fare:船票价格
embarked:登船地点

  1. age = titanic_survival["age"]
  2. # print(age.loc[0:10])
  3. # 判断是否为缺失值
  4. age_is_null = pd.isnull(age)
  5. # 不是缺失值打印FALSE
  6. print(age_is_null)
  7. # 筛选出所有缺失值
  8. age_is_true = age[age_is_null]
  9. print(age_is_true)
  10. # 缺失值的个数
  11. age_is_true_sum = len(age_is_true)
  12. print(age_is_true_sum)

 

 

Pandas常用预处理方法

  1. # 当数据中有缺失值并且没有做任何处理时,会出现nan
  2. mean_age = sum(titanic_survival["age"])/len(titanic_survival["age"])
  3. print(mean_age)

nan

  1. # 处理:如果是缺失值就不参与计算
  2. good_ages = titanic_survival["age"][age_is_null == False]
  3. correct_mean_age = sum(good_ages)/len(good_ages)
  4. # 也可以通过titanic_survival["age"].mean()直接求均值(但一般不用,一般用平均数或中位数/众数来填充,使之成为完整的样本)
  5. print(correct_mean_age)
29.69911764705882
  1. # mean fares for each class(传统方法)
  2. passenger_classes = [1, 2, 3]
  3. fares_by_class = {}
  4. for this_class in passenger_classes:
  5. pclass_rows = titanic_survival[titanic_survival["pclass"] == this_class]# 取出对应pclass的数据
  6. pclass_fares = pclass_rows["fare"]# 定位到价格这一列
  7. fare_for_class = pclass_fares.mean() # 求平均值
  8. fares_by_class[this_class] = fare_for_class
  9. print(fares_by_class)
{1: 84.1546875, 2: 20.662183152173913, 3: 13.675550101832993}
  1. # pandas 较简介的方法
  2. passenger_survival = titanic_survival.pivot_table(index = "pclass", values="survived", aggfunc = np.mean)
  3. # index:以谁为基准
  4. # values:跟某个变量的关系值
  5. # aggfunc:求平均
  6. # pivot_table数据透视表,统计一个量与其他量关系的一个函数
  7. print(passenger_survival)# 获救几率
  8. # 求不同仓的乘客的平均年龄
  9. passenger_age = titanic_survival.pivot_table(index = "pclass", values="age")# 不指定aggfunc时,默认求均值
  10. print(passenger_age)
        survived
pclass          
1       0.629630
2       0.472826
3       0.242363
              age
pclass           
1       38.233441
2       29.877630
3       25.140620
  1. # 同时看一下一个量与其他两个量之间的关系
  2. port_stats = titanic_survival.pivot_table(index ="embarked", values=["fare","survived"], aggfunc = np.sum)# 总值
  3. print(port_stats)
                fare  survived
embarked                      
C         10072.2962        93
Q          1022.2543        30
S         17439.3988       217
  1. # axis=1或axis="columns" 会丢掉所有为空值的样本
  2. drop_na_columns = titanic_survival.dropna(axis=1)
  3. new_titanic_survival = titanic_survival.dropna(axis=0,subset=['age','sex'])
  4. print(new_titanic_survival)
     survived  pclass     sex   age  sibsp  parch     fare embarked   class  \
0           0       3    male  22.0      1      0   7.2500        S   Third   
1           1       1  female  38.0      1      0  71.2833        C   First   
2           1       3  female  26.0      0      0   7.9250        S   Third   
3           1       1  female  35.0      1      0  53.1000        S   First   
4           0       3    male  35.0      0      0   8.0500        S   Third   
..        ...     ...     ...   ...    ...    ...      ...      ...     ...   
885         0       3  female  39.0      0      5  29.1250        Q   Third   
886         0       2    male  27.0      0      0  13.0000        S  Second   
887         1       1  female  19.0      0      0  30.0000        S   First   
889         1       1    male  26.0      0      0  30.0000        C   First   
890         0       3    male  32.0      0      0   7.7500        Q   Third   

       who  adult_male deck  embark_town alive  alone  
0      man        True  NaN  Southampton    no  False  
1    woman       False    C    Cherbourg   yes  False  
2    woman       False  NaN  Southampton   yes   True  
3    woman       False    C  Southampton   yes  False  
4      man        True  NaN  Southampton    no   True  
..     ...         ...  ...          ...   ...    ...  
885  woman       False  NaN   Queenstown    no  False  
886    man        True  NaN  Southampton    no   True  
887  woman       False    B  Southampton   yes   True  
889    man        True    C    Cherbourg   yes   True  
890    man        True  NaN   Queenstown    no   True  

[714 rows x 15 columns]
  1. # 定位到具体的某一个数据
  2. row_index_83_age = titanic_survival.loc[83, "age"]# 第83个样本的年龄
  3. row_index_1000_pclass = titanic_survival.loc[766,"pclass"]
  4. print(row_index_83_age)
  5. print(row_index_1000_pclass)
28.0
1

Pandas自定义函数

.apply()

  1. # .apply()自定义函数操作(做很多操作时)
  2. def hundredth_row(column):# 第一百行数据
  3. hundredth_item = column.loc[99]
  4. return hundredth_item
  5. hundredth_row = titanic_survival.apply(hundredth_row)
  6. #print(hundredth_row)
  7. # 将class换一种说法
  8. def which_class(row):
  9. pclass = row["pclass"]
  10. if pd.isnull(pclass):
  11. return "Unknown"
  12. elif pclass == 1:
  13. return "First class"
  14. elif pclass == 2:
  15. return "Second class"
  16. elif pclass == 3:
  17. return "Third class"
  18. classes = titanic_survival.apply(which_class, axis=1)
  19. # print(classes)
  20. # 每一个属性的缺失值的数量
  21. def not_null_count(column):
  22. column_null = pd.isnull(column)
  23. null = column[column_null]
  24. return len(null)
  25. column_null_count = titanic_survival.apply(not_null_count)
  26. print(column_null_count)
  27. print("--------------")
  28. # 将连续值年龄变成离散的
  29. def generate_age_label(row):
  30. age = row["age"]
  31. if pd.isnull(age):
  32. return "Unknown"
  33. elif age<18:
  34. return "minor"
  35. else:
  36. return "adult"
  37. age_labels = titanic_survival.apply(generate_age_label, axis=1)
  38. print(age_labels)
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64
-------------------------
0        adult
1        adult
2        adult
3        adult
4        adult
        ...   
886      adult
887      adult
888    Unknown
889      adult
890      adult
Length: 891, dtype: object
  1. # 显示各年龄段的人获救几率
  2. titanic_survival['age_labels'] = age_labels
  3. age_group_survival = titanic_survival.pivot_table(index="age_labels", values="survived")
  4. print(age_group_survival)
            survived
age_labels          
Unknown     0.293785
adult       0.381032
minor       0.539823

Series结构

前面提的都是pandas中的DateFrame结构(由行和列组成)有一些列series组成

Series结构:dateframe其中的一行或一列

  1. import pandas as pd
  2. survival = pd.read_csv("titanic.csv")
  3. series_fare = survival['fare']
  4. print(type(series_fare))
  5. print(series_fare[0:5])
  6. series_class = survival["class"]
  7. print(series_class[0:5])
<class 'pandas.core.series.Series'>
0     7.2500
1    71.2833
2     7.9250
3    53.1000
4     8.0500
Name: fare, dtype: float64
0    Third
1    First
2    Third
3    First
4    Third
Name: class, dtype: object
  1. from pandas import Series
  2. fares = series_fare.values
  3. print(type(fares))
  4. # dateframe里面是series,series里面又是ndarray,因此说明pandas是在numpy基础之上封装的
  5. class_ = series_class.values
  6. # print(class_)
  7. survival = Series(fares,index=class_)
  8. # 索引要选择能唯一确定这个样本的数据,这里只是举个例子(这个例子不是很好)
  9. print(survival)
  10. #survival[["First","Second"]]
  11. fiveten = survival[889:891]
  12. print(fiveten)
  13. original_index = survival.index.tolist()
  14. sorted_index = sorted(original_index)
  15. # cannot reindex from a duplicate axis,当reindex中有重复的值时,会报错
  16. sorted_by_index = survival.reindex(sorted_index)
  17. # 索引不能重复
<class 'numpy.ndarray'>
Third      7.2500
First     71.2833
Third      7.9250
First     53.1000
Third      8.0500
           ...   
Second    13.0000
First     30.0000
Third     23.4500
First     30.0000
Third      7.7500
Length: 891, dtype: float64
First    30.00
Third     7.75
dtype: float64

series也可以按照值或者键进行排序

series相加

  1. import numpy as np
  2. # 两个series维度一样,对应位置相加;维度不同,分别相加
  3. print(np.add(survival,survival)[0:5])
  4. np.max(survival)
Third     14.5000
First    142.5666
Third     15.8500
First    106.2000
Third     16.1000
dtype: float64
512.3292

以上是一些pandas的简单操作,还有其他的会在后面的案例里再说。

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/不正经/article/detail/142673
推荐阅读
相关标签
  

闽ICP备14008679号