赞
踩
最新工作比较忙,python这块搁置了好久都没有好好学习以及更新相关学习笔记,立下flag,争取两天更新一个练习题,到十一月初更新完这块内容
- 练习1-开始了解你的数据(2021-11-02已完成)
- 练习2-数据过滤与排序(2021-11-02已完成)
- 练习3-数据分组(2021-11-02已完成)
- 练习4-Apply函数(2021-11-03已完成)
- 练习5-合并(2021-11-03已完成)
- 练习6-统计(2021-11-03已完成)
- 练习7-可视化(2021-11-05已完成)
- 练习8-创建数据框(2021-11-04已完成)
- 练习9-时间序列(2021-11-05已完成)
- 练习10-删除数据(2021-11-04已完成)
目前已完成这块内容,有自己想出来的解决方法,也有许多不会的地方,后续将基于最近练习的这十道题做个小小的总结汇总
目前觉得如果是单纯学习python数据分析的话,可以先看看特别基础的书,在开始练习题目可能会事半功倍
习题编号 | 内容 | 相应数据集 |
---|---|---|
练习1 - 开始了解你的数据 | 探索Chipotle快餐数据 | chipotle.tsv |
练习2 - 数据过滤与排序 | 探索2012欧洲杯数据 | Euro2012_stats.csv |
练习3 - 数据分组 | 探索酒类消费数据 | drinks.csv |
练习4 -Apply函数 | 探索1960 - 2014 美国犯罪数据 | US_Crime_Rates_1960_2014.csv |
练习5 - 合并 | 探索虚拟姓名数据 | 练习中手动内置的数据 |
练习6 - 统计 | 探索风速数据 | wind.data |
练习7 - 可视化 | 探索泰坦尼克灾难数据 | train.csv |
练习8 - 创建数据框 | 探索Pokemon数据 | 练习中手动内置的数据 |
练习9 - 时间序列 | 探索Apple公司股价数据 | Apple_stock.csv |
练习10 - 删除数据 | 探索Iris纸鸢花数据 | iris.csv |
import os
os.getcwd()
'D:\\PythonFlie\\python\\pandas'
os.listdir( )
['.ipynb_checkpoints',
'pandas_exercise',
'Pandas基础命令速查表0922.ipynb',
'测试数据.csv',
'测试数据.xlsx',
'这十套练习,教你如何用Pandas做数据分析0929.ipynb']
os.chdir("D:\\PythonFlie\\python\\pandas\\pandas_exercise")
print(os.getcwd()) #查看是否进入对应的路径
print(os.listdir()) #查看上述路径下的文件后,发现数据在exercise_data文件下面,继续更改路径
os.chdir("D:\\PythonFlie\\python\\pandas\\pandas_exercise\\exercise_data")
print(os.getcwd()) #查看是否进入对应的路径
os.listdir() #查看该路径下的文件
D:\PythonFlie\python\pandas\pandas_exercise
['exercise_data']
D:\PythonFlie\python\pandas\pandas_exercise\exercise_data
['Apple_stock.csv',
'cars.csv',
'chipotle.tsv',
'drinks.csv',
'Euro2012_stats.csv',
'iris.csv',
'second_cars_info.csv',
'train.csv',
'US_Crime_Rates_1960_2014.csv',
'wechart.csv',
'wind.data']
import pandas as pd
path1 = "D:\\PythonFlie\\python\\pandas\\pandas_exercise\\exercise_data\\chipotle.tsv" # chipotle.tsv
chipo = pd.read_csv(path1, sep = '\t')
chipo.head(10)
print(chipo.shape) #查看数据集的行与列
print(chipo.shape[1]) #查看数据集的列数
(4622, 5)
5
chipo.columns
Index(['order_id', 'quantity', 'item_name', 'choice_description',
'item_price'],
dtype='object')
chipo.index
RangeIndex(start=0, stop=4622, step=1)
#将chipo中的item_name和quantity两列取出来后,对item_name进行分组后对quantity进行求和
c = chipo[['item_name','quantity']].groupby(['item_name']).agg({'quantity':sum})
#对quantity列进行降序排列
c.sort_values(['quantity'],ascending=False,inplace=True)
#取前五项查看
c.head()
#先取出item_name这一列后去重在计算
chipo['item_name'].drop_duplicates().count()
50
#参考答案
chipo['item_name'].nunique()
50
#思路:取出item_name以及order_id,在计算order_id数
chipo[['item_name',"choice_description","order_id"]].groupby(['item_name',"choice_description"]).aggregate({'order_id':"count"}).sort_values("order_id",ascending=False).head(1)
#参考答案
chipo['choice_description'].value_counts().head()
[Diet Coke] 134
[Coke] 123
[Sprite] 77
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Lettuce]] 42
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Guacamole, Lettuce]] 40
Name: choice_description, dtype: int64
chipo['quantity'].sum()
4972
#参考答案
total_items_orders = chipo['quantity'].sum()
total_items_orders
4972
#这一步很关键,转换后后续才能继续计算
dollarizer = lambda x: float(x[1:])
chipo['item_price'] = chipo['item_price'].apply(dollarizer)
chipo["总价"] = chipo['quantity']*chipo['item_price']
chipo["总价"].sum()
39237.02
#参考答案
chipo['sub_total'] = round(chipo['item_price'] * chipo['quantity'],2)
chipo['sub_total'].sum()
39237.02
chipo["order_id"].drop_duplicates().count()
1834
#参考答案
chipo['order_id'].nunique()
1834
#这个计算与参考答案算的不是一样的
chipo.groupby(["order_id"]).agg({"item_price":"mean"})
#参考答案
chipo[['order_id','sub_total']].groupby(by=['order_id']).agg({'sub_total':'sum'})['sub_total'].mean()
21.394231188658654
chipo['item_name'].nunique()
50
import numpy as np
import pandas as pd
euro12 = pd.read_csv(r"D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\Euro2012_stats.csv", sep = ',')
euro12.info() #一共有35列,分别是 #Team 队名 #Goals 分数 #Shots on target 射中目标 #Shots off target 射门偏出 #Shooting Accuracy 射击精度 #% Goals-to-shots 进球数 #Total shots (inc. Blocked) 总投篮数(包括封盖) #Hit Woodwork 打木工 #Penalty goals 点球进球 #Penalties not scored 未计罚分 #Headed goals 头球 #Passes 通行证 #Passes completed 通行证完成 #Passing Accuracy 传球准确率 #Touches 触摸 #Crosses 十字架 #Dribbles 运球 #Corners Taken 转角 #Tackles 铲球 #Clearances 清关 #Interceptions 拦截 #Clearances off line 下线清关 #Clean Sheets 干净的床单 #Blocks 失球 #Goals conceded 已保存 #Saves made 已保存 #Saves-to-shots ratio 比率 #Fouls Won 赢得犯规 #Fouls Conceded 承认犯规 #Offsides 越位 #Yellow Cards 黄牌 #Red Cards 红牌 #Subs on 订阅 #Subs off 订阅关闭 #Players Used 玩家使用
<class 'pandas.core.frame.DataFrame'> RangeIndex: 16 entries, 0 to 15 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Team 16 non-null object 1 Goals 16 non-null int64 2 Shots on target 16 non-null int64 3 Shots off target 16 non-null int64 4 Shooting Accuracy 16 non-null object 5 % Goals-to-shots 16 non-null object 6 Total shots (inc. Blocked) 16 non-null int64 7 Hit Woodwork 16 non-null int64 8 Penalty goals 16 non-null int64 9 Penalties not scored 16 non-null int64 10 Headed goals 16 non-null int64 11 Passes 16 non-null int64 12 Passes completed 16 non-null int64 13 Passing Accuracy 16 non-null object 14 Touches 16 non-null int64 15 Crosses 16 non-null int64 16 Dribbles 16 non-null int64 17 Corners Taken 16 non-null int64 18 Tackles 16 non-null int64 19 Clearances 16 non-null int64 20 Interceptions 16 non-null int64 21 Clearances off line 15 non-null float64 22 Clean Sheets 16 non-null int64 23 Blocks 16 non-null int64 24 Goals conceded 16 non-null int64 25 Saves made 16 non-null int64 26 Saves-to-shots ratio 16 non-null object 27 Fouls Won 16 non-null int64 28 Fouls Conceded 16 non-null int64 29 Offsides 16 non-null int64 30 Yellow Cards 16 non-null int64 31 Red Cards 16 non-null int64 32 Subs on 16 non-null int64 33 Subs off 16 non-null int64 34 Players Used 16 non-null int64 dtypes: float64(1), int64(29), object(5) memory usage: 4.5+ KB
euro12["Goals"]
0 4 1 4 2 4 3 5 4 3 5 10 6 5 7 6 8 2 9 2 10 6 11 1 12 5 13 12 14 5 15 2 Name: Goals, dtype: int64
#参考答案
euro12.Goals
0 4 1 4 2 4 3 5 4 3 5 10 6 5 7 6 8 2 9 2 10 6 11 1 12 5 13 12 14 5 15 2 Name: Goals, dtype: int64
euro12["Team"].count()
16
#参考答案
euro12.shape[0]
16
euro12.shape[1]
35
#参考答案
euro12.info()
discipline = euro12[["Team","Yellow Cards","Red Cards"]]
discipline
discipline.sort_values(by = ["Red Cards","Yellow Cards"])
#参考答案
discipline.sort_values(['Red Cards', 'Yellow Cards'], ascending = False)
discipline["Yellow Cards"].mean()
7.4375
#参考答案
round(discipline['Yellow Cards'].mean())
7
euro12[euro12["Goals"] > 6]
#参考答案
euro12[euro12.Goals > 6]
euro12[euro12.Team.str.startswith('G')]
euro12.head(7)
#参考答案
euro12.iloc[: , 0:7]
euro12.iloc[: , :-3]
euro12.loc[euro12.Team.isin(['England', 'Italy', 'Russia']), ['Team','Shooting Accuracy']]
import pandas as pd
drinks = pd.read_csv(r"D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\drinks.csv")
drinks
drinks.info()
#country 国家
#beer_servings 啤酒消耗
#spirit_servings 精神消耗
#wine_servings 红酒消耗
#total_litres_of_pure_alcohol 总升纯酒精
#continent 大陆
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 193 non-null object
1 beer_servings 193 non-null int64
2 spirit_servings 193 non-null int64
3 wine_servings 193 non-null int64
4 total_litres_of_pure_alcohol 193 non-null float64
5 continent 170 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 9.2+ KB
drinks.groupby("continent").aggregate({"beer_servings":"mean"}).sort_values(by = "beer_servings",ascending = False)
beer_servings
continent
EU 193.777778
SA 175.083333
OC 89.687500
AF 61.471698
AS 37.045455
#参考答案
drinks.groupby('continent').beer_servings.mean()
continent
AF 61.471698
AS 37.045455
EU 193.777778
OC 89.687500
SA 175.083333
Name: beer_servings, dtype: float64
drinks.groupby("continent").describe()["wine_servings"]
count mean std min 25% 50% 75% max
continent
AF 53.0 16.264151 38.846419 0.0 1.0 2.0 13.00 233.0
AS 44.0 9.068182 21.667034 0.0 0.0 1.0 8.00 123.0
EU 45.0 142.222222 97.421738 0.0 59.0 128.0 195.00 370.0
OC 16.0 35.625000 64.555790 0.0 1.0 8.5 23.25 212.0
SA 12.0 62.416667 88.620189 1.0 3.0 12.0 98.50 221.0
#参考答案
drinks.groupby('continent').wine_servings.describe()
count mean std min 25% 50% 75% max
continent
AF 53.0 16.264151 38.846419 0.0 1.0 2.0 13.00 233.0
AS 44.0 9.068182 21.667034 0.0 0.0 1.0 8.00 123.0
EU 45.0 142.222222 97.421738 0.0 59.0 128.0 195.00 370.0
OC 16.0 35.625000 64.555790 0.0 1.0 8.5 23.25 212.0
SA 12.0 62.416667 88.620189 1.0 3.0 12.0 98.50 221.0
drinks.groupby("continent").aggregate({"beer_servings":"mean","spirit_servings":"mean","wine_servings":"mean"})
beer_servings spirit_servings wine_servings
continent
AF 61.471698 16.339623 16.264151
AS 37.045455 60.840909 9.068182
EU 193.777778 132.555556 142.222222
OC 89.687500 58.437500 35.625000
SA 175.083333 114.750000 62.416667
#参考答案
drinks.groupby('continent').mean()
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
continent
AF 61.471698 16.339623 16.264151 3.007547
AS 37.045455 60.840909 9.068182 2.170455
EU 193.777778 132.555556 142.222222 8.617778
OC 89.687500 58.437500 35.625000 3.381250
SA 175.083333 114.750000 62.416667 6.308333
drinks.groupby("continent").aggregate({"beer_servings":"quantile","spirit_servings":"quantile","wine_servings":"quantile"})
beer_servings spirit_servings wine_servings
continent
AF 32.0 3.0 2.0
AS 17.5 16.0 1.0
EU 219.0 122.0 128.0
OC 52.5 37.0 8.5
SA 162.5 108.5 12.0
#参考答案
drinks.groupby('continent').median()
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
continent
AF 32.0 3.0 2.0 2.30
AS 17.5 16.0 1.0 1.20
EU 219.0 122.0 128.0 10.00
OC 52.5 37.0 8.5 1.75
SA 162.5 108.5 12.0 6.85
drinks.groupby("continent").describe()["spirit_servings"][["mean","max","min"]]
mean max min
continent
AF 16.339623 152.0 0.0
AS 60.840909 326.0 0.0
EU 132.555556 373.0 0.0
OC 58.437500 254.0 0.0
SA 114.750000 302.0 25.0
#参考答案
drinks.groupby('continent').spirit_servings.agg(['mean', 'min', 'max'])
mean min max
continent
AF 16.339623 0 152
AS 60.840909 0 326
EU 132.555556 0 373
OC 58.437500 0 254
SA 114.750000 25 302
import pandas as pd
crime = pd.read_csv(r"D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\US_Crime_Rates_1960_2014.csv")
crime.head()
#Year 年份
#Population 人口
#Total 总计
#Violent 暴力
#Property 财产
#Murder 谋杀
#Forcible_Rape 强暴
#Robbery 抢劫
#Aggravated_assault 严重袭击
#Burglary 入室盗窃
#Larceny_Theft 盗窃盗窃
#Vehicle_Theft 车辆盗窃
crime.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 55 entries, 0 to 54 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year 55 non-null int64 1 Population 55 non-null int64 2 Total 55 non-null int64 3 Violent 55 non-null int64 4 Property 55 non-null int64 5 Murder 55 non-null int64 6 Forcible_Rape 55 non-null int64 7 Robbery 55 non-null int64 8 Aggravated_assault 55 non-null int64 9 Burglary 55 non-null int64 10 Larceny_Theft 55 non-null int64 11 Vehicle_Theft 55 non-null int64 dtypes: int64(12) memory usage: 5.3 KB
crime.Year = pd.to_datetime(crime.Year, format='%Y')
crime.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 55 entries, 0 to 54 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year 55 non-null datetime64[ns] 1 Population 55 non-null int64 2 Total 55 non-null int64 3 Violent 55 non-null int64 4 Property 55 non-null int64 5 Murder 55 non-null int64 6 Forcible_Rape 55 non-null int64 7 Robbery 55 non-null int64 8 Aggravated_assault 55 non-null int64 9 Burglary 55 non-null int64 10 Larceny_Theft 55 non-null int64 11 Vehicle_Theft 55 non-null int64 dtypes: datetime64[ns](1), int64(11) memory usage: 5.3 KB
crime.set_index("Year",inplace = True)
crime.head()
#参考答案
crime = crime.set_index('Year', drop = True)
crime.head()
crime.drop(columns = ["Total"],inplace = True)
#参考答案
del crime['Total']
crime.head()
crime.groupby("Year").sum().head()
这块不太明白
#参考答案 # 更多关于 .resample 的介绍 # (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html) # 更多关于 Offset Aliases的介绍 # (http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases) # 运行以下代码 crimes = crime.resample('10AS').sum() # resample a time series per decades # 用resample去得到“Population”列的最大值 population = crime['Population'].resample('10AS').max() # 更新 "Population" crimes['Population'] = population crimes
crime.idxmax(0)
Population 2014
Total 1991
Violent 1992
Property 1991
Murder 1991
Forcible_Rape 1992
Robbery 1991
Aggravated_assault 1993
Burglary 1980
Larceny_Theft 1991
Vehicle_Theft 1991
dtype: int64
import pandas as pd
raw_data_1 = {
'subject_id': ['1', '2', '3', '4', '5'],
'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}
raw_data_2 = {
'subject_id': ['4', '5', '6', '7', '8'],
'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}
raw_data_3 = {
'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}
data1 = pd.DataFrame(raw_data_1, columns = ['subject_id', 'first_name', 'last_name'])
data2 = pd.DataFrame(raw_data_2, columns = ['subject_id', 'first_name', 'last_name'])
data3 = pd.DataFrame(raw_data_3, columns = ['subject_id','test_id'])
print(data1)
print("----------")
print(data2)
print("----------")
print(data3)
subject_id first_name last_name 0 1 Alex Anderson 1 2 Amy Ackerman 2 3 Allen Ali 3 4 Alice Aoni 4 5 Ayoung Atiches ---------- subject_id first_name last_name 0 4 Billy Bonder 1 5 Brian Black 2 6 Bran Balwner 3 7 Bryce Brice 4 8 Betty Btisan ---------- subject_id test_id 0 1 51 1 2 15 2 3 15 3 4 61 4 5 16 5 7 14 6 8 15 7 9 1 8 10 61 9 11 16
#重新设置索引
all_data = pd.concat([data1,data2])
all_data
subject_id first_name last_name
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
3 4 Alice Aoni
4 5 Ayoung Atiches
0 4 Billy Bonder
1 5 Brian Black
2 6 Bran Balwner
3 7 Bryce Brice
4 8 Betty Btisan
all_data_col = pd.concat([data1, data2], axis = 1)
all_data_col
subject_id first_name last_name subject_id first_name last_name
0 1 Alex Anderson 4 Billy Bonder
1 2 Amy Ackerman 5 Brian Black
2 3 Allen Ali 6 Bran Balwner
3 4 Alice Aoni 7 Bryce Brice
4 5 Ayoung Atiches 8 Betty Btisan
all_data_col = pd.merge(data1, data2,left_index = True,right_index = True)
all_data_col
subject_id_x first_name_x last_name_x subject_id_y first_name_y last_name_y
0 1 Alex Anderson 4 Billy Bonder
1 2 Amy Ackerman 5 Brian Black
2 3 Allen Ali 6 Bran Balwner
3 4 Alice Aoni 7 Bryce Brice
4 5 Ayoung Atiches 8 Betty Btisan
data3
subject_id test_id
0 1 51
1 2 15
2 3 15
3 4 61
4 5 16
5 7 14
6 8 15
7 9 1
8 10 61
9 11 16
pd.merge(all_data,data3,on = "subject_id")
subject_id first_name last_name test_id
0 1 Alex Anderson 51
1 2 Amy Ackerman 15
2 3 Allen Ali 15
3 4 Alice Aoni 61
4 4 Billy Bonder 61
5 5 Ayoung Atiches 16
6 5 Brian Black 16
7 7 Bryce Brice 14
8 8 Betty Btisan 15
pd.merge(data1,data2,on = "subject_id")
subject_id first_name_x last_name_x first_name_y last_name_y
0 4 Alice Aoni Billy Bonder
1 5 Ayoung Atiches Brian Black
9.步骤9 找到 data1 和 data2 合并之后的所有匹配结果
pd.merge(data1,data2,how = "outer",on = "subject_id")
subject_id first_name_x last_name_x first_name_y last_name_y
0 1 Alex Anderson NaN NaN
1 2 Amy Ackerman NaN NaN
2 3 Allen Ali NaN NaN
3 4 Alice Aoni Billy Bonder
4 5 Ayoung Atiches Brian Black
5 6 NaN NaN Bran Balwner
6 7 NaN NaN Bryce Brice
7 8 NaN NaN Betty Btisan
import pandas as pd
import time
import datetime
import dateutil
data = pd.read_table(r"D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\wind.data",sep = "\s+",parse_dates = [[0,1,2]])
data.head()
#read_csv()方法指定parse_dates会使得读取csv文件的时间大大增加
#infer_datetime_format=True可显著减少read_csv命令日期解析时间
#keep_date_col=True/False参数则是用来指定解析为日期格式的列是否保留下来,True保留,False不保留
#在原有的日期上减去100年
data["Yr_Mo_Dy"] = data["Yr_Mo_Dy"].apply(lambda x :x - dateutil.relativedelta.relativedelta(years=100))
data.head()
#参考答案
# 运行以下代码
def fix_century(x):
year = x.year - 100 if x.year > 1989 else x.year
return datetime.date(year, x.month, x.day)
# apply the function fix_century on the column and replace the values to the right ones
data['Yr_Mo_Dy'] = data['Yr_Mo_Dy'].apply(fix_century)
# data.info()
data.head()
data.set_index("Yr_Mo_Dy",drop = True,inplace = True)
data.head()
#参考答案
# transform Yr_Mo_Dy it to date type datetime64
data["Yr_Mo_Dy"] = pd.to_datetime(data["Yr_Mo_Dy"])
# set 'Yr_Mo_Dy' as the index
data = data.set_index('Yr_Mo_Dy')
data.head()
# data.info()
data.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 6574 entries, 1961-01-01 to 1978-12-31 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 RPT 6568 non-null float64 1 VAL 6571 non-null float64 2 ROS 6572 non-null float64 3 KIL 6569 non-null float64 4 SHA 6572 non-null float64 5 BIR 6574 non-null float64 6 DUB 6571 non-null float64 7 CLA 6572 non-null float64 8 MUL 6571 non-null float64 9 CLO 6573 non-null float64 10 BEL 6574 non-null float64 11 MAL 6570 non-null float64 dtypes: float64(12) memory usage: 667.7 KB
#参考答案
data.isnull().sum()
RPT 6
VAL 3
ROS 2
KIL 5
SHA 2
BIR 0
DUB 3
CLA 2
MUL 3
CLO 1
BEL 0
MAL 4
dtype: int64
data.shape[0] - data.isnull().sum()
RPT 6568
VAL 6571
ROS 6572
KIL 6569
SHA 6572
BIR 6574
DUB 6571
CLA 6572
MUL 6571
CLO 6573
BEL 6574
MAL 6570
dtype: int64
data.mean().mean()
10.227982360836924
loc_stats = data.aggregate(["min","max","mean","std"])
loc_stats
#参考答案
loc_stats = pd.DataFrame()
loc_stats['min'] = data.min() # min
loc_stats['max'] = data.max() # max
loc_stats['mean'] = data.mean() # mean
loc_stats['std'] = data.std() # standard deviations
loc_stats
day_stats = data.aggregate(["min","max","mean","std"],axis=1)
day_stats.head()
min max mean std
1961-01-01 9.29 18.50 13.018182 2.808875
1961-01-02 6.50 17.54 11.336364 3.188994
1961-01-03 6.17 18.50 11.641818 3.681912
1961-01-04 1.79 11.75 6.619167 3.198126
1961-01-05 6.17 13.33 10.630000 2.445356
#参考答案
# create the dataframe
day_stats = pd.DataFrame()
# this time we determine axis equals to one so it gets each row.
day_stats['min'] = data.min(axis = 1) # min
day_stats['max'] = data.max(axis = 1) # max
day_stats['mean'] = data.mean(axis = 1) # mean
day_stats['std'] = data.std(axis = 1) # standard deviations
day_stats.head()
min max mean std
Yr_Mo_Dy
1961-01-01 9.29 18.50 13.018182 2.808875
1961-01-02 6.50 17.54 11.336364 3.188994
1961-01-03 6.17 18.50 11.641818 3.681912
1961-01-04 1.79 11.75 6.619167 3.198126
1961-01-05 6.17 13.33 10.630000 2.445356
# creates a new column 'date' and gets the values from the index
data['date'] = data.index
# creates a column for each value from date
data['month'] = data['date'].apply(lambda date: date.month)
data['year'] = data['date'].apply(lambda date: date.year)
data['day'] = data['date'].apply(lambda date: date.day)
# gets all value from the month 1 and assign to janyary_winds
january_winds = data.query('month == 1')
# gets the mean from january_winds, using .loc to not print the mean of month, year and day
january_winds.loc[:,'RPT':"MAL"].mean()
RPT 14.847325
VAL 12.914560
ROS 13.299624
KIL 7.199498
SHA 11.667734
BIR 8.054839
DUB 11.819355
CLA 9.512047
MUL 9.543208
CLO 10.053566
BEL 14.550520
MAL 18.028763
dtype: float64
data.query('month == 1 and day == 1')
data.query('day == 1')
import pandas as pd
titanic = pd.read_csv(r"D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\train.csv")
titanic.head()
#PassengerId 乘客id
#Survived 是否生还
#Pclass
#Name 姓名
#Sex 性别
#Age 年龄
#SibSp
#Parch
#Ticket 船票号
#Fare 船票
#Cabin 舱
#Embarked 登船
titanic.set_index("PassengerId",inplace = True)
titanic.head()
import matplotlib.pyplot as plt male_ct = titanic[titanic["Sex"] == "male"].count()[1] female_ct = titanic[titanic["Sex"] == "female"].count()[1] x = [male_ct,female_ct] plt.pie(x, labels = ["male","female"], explode = (0.1 , 0), startangle = 90, autopct = '%1.1f%%') plt.axis('equal') plt.title("Sex Proportion") plt.tight_layout() plt.show()
# 参考答案 # sum the instances of males and females males = (titanic['Sex'] == 'male').sum() females = (titanic['Sex'] == 'female').sum() # put them into a list called proportions proportions = [males, females] # Create a pie chart plt.pie( # using proportions proportions, # with the labels being officer names labels = ['Males', 'Females'], # with no shadows shadow = False, # with colors colors = ['blue','red'], # with one slide exploded out explode = (0.15 , 0), # with the start angle at 90% startangle = 90, # with the percent listed as a fraction autopct = '%1.1f%%') # View the plot drop above plt.axis('equal') # Set labels plt.title("Sex Proportion") # View the plot plt.tight_layout() plt.show()
import seaborn as sns
# creates the plot using
lm = sns.lmplot(x = 'Age', y = 'Fare', data = titanic, hue = 'Sex', fit_reg=False)
# set title
lm.set(title = 'Fare x Age')
# get the axes object and tweak it
axes = lm.axes
axes[0,0].set_ylim(-5,)
axes[0,0].set_xlim(-5,85)
titanic.query("Survived == 1").Survived.count()
342
#参考答案
titanic.Survived.sum()
342
fare = titanic["Fare"]
plt.hist(fare,
bins = 20)
plt.axis('tight')
plt.tight_layout()
plt.show()
#参考答案 # sort the values from the top to the least value and slice the first 5 items import numpy as np df = titanic.Fare.sort_values(ascending = False) df # create bins interval using numpy binsVal = np.arange(0,600,10) binsVal # create the plot plt.hist(df, bins = binsVal) # Set the title and labels plt.xlabel('Fare') plt.ylabel('Frequency') plt.title('Fare Payed Histrogram') # show the plot plt.show()
import pandas as pd
raw_data = {"name": ['Bulbasaur', 'Charmander','Squirtle','Caterpie'],
"evolution": ['Ivysaur','Charmeleon','Wartortle','Metapod'],
"type": ['grass', 'fire', 'water', 'bug'],
"hp": [45, 39, 44, 45],
"pokedex": ['yes', 'no','yes','no']
}
pokemon = pd.DataFrame(raw_data)
pokemon.head()
name evolution type hp pokedex
0 Bulbasaur Ivysaur grass 45 yes
1 Charmander Charmeleon fire 39 no
2 Squirtle Wartortle water 44 yes
3 Caterpie Metapod bug 45 no
pokemon = pokemon[['name', 'type', 'hp', 'evolution','pokedex']]
pokemon
name type hp evolution pokedex
0 Bulbasaur grass 45 Ivysaur yes
1 Charmander fire 39 Charmeleon no
2 Squirtle water 44 Wartortle yes
3 Caterpie bug 45 Metapod no
pokemon['place'] = ['park','street','lake','forest']
pokemon
name type hp evolution pokedex place
0 Bulbasaur grass 45 Ivysaur yes park
1 Charmander fire 39 Charmeleon no street
2 Squirtle water 44 Wartortle yes lake
3 Caterpie bug 45 Metapod no forest
pokemon.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 4 non-null object
1 type 4 non-null object
2 hp 4 non-null int64
3 evolution 4 non-null object
4 pokedex 4 non-null object
5 place 4 non-null object
dtypes: int64(1), object(5)
memory usage: 320.0+ bytes
#参考答案
pokemon.dtypes
name object
type object
hp int64
evolution object
pokedex object
place object
dtype: object
import pandas as pd
apple = pd.read_csv(r"D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\Apple_stock.csv")
apple.head()
apple.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8465 entries, 0 to 8464
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 8465 non-null object
1 Open 8465 non-null float64
2 High 8465 non-null float64
3 Low 8465 non-null float64
4 Close 8465 non-null float64
5 Volume 8465 non-null int64
6 Adj Close 8465 non-null float64
dtypes: float64(5), int64(1), object(1)
memory usage: 463.1+ KB
apple["Date"] = apple["Date"].apply(lambda x : pd.to_datetime(x))
apple.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8465 entries, 0 to 8464
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 8465 non-null datetime64[ns]
1 Open 8465 non-null float64
2 High 8465 non-null float64
3 Low 8465 non-null float64
4 Close 8465 non-null float64
5 Volume 8465 non-null int64
6 Adj Close 8465 non-null float64
dtypes: datetime64[ns](1), float64(5), int64(1)
memory usage: 463.1 KB
apple.set_index("Date",drop = True,inplace = True)
apple.head()
#参考答案
apple = apple.set_index('Date')
apple.head()
apple.groupby("Date").count().sort_values(by = "Date",ascending = False)
#参考答案
apple.index.is_unique
True
apple.sort_values(by = "Date",ascending = True)
#参考答案
apple.sort_index(ascending = True).head()
apple_month = apple.resample('BM')
apple_month.agg("mean")
(apple.index.max() - apple.index.min()).days
12261
apple_months = apple.resample('BM').mean()
len(apple_months.index)
404
# makes the plot and assign it to a variable
appl_open = apple['Adj Close'].plot(title = "Apple Stock")
# changes the size of the graph
fig = appl_open.get_figure()
fig.set_size_inches(13.5, 9)
import pandas as pd
import numpy as np
iris = pd.read_csv(r"D:\PythonFlie\python\pandas\pandas_exercise\exercise_data\iris.csv")
iris.head()
5.1 3.5 1.4 0.2 Iris-setosa
0 4.9 3.0 1.4 0.2 Iris-setosa
1 4.7 3.2 1.3 0.2 Iris-setosa
2 4.6 3.1 1.5 0.2 Iris-setosa
3 5.0 3.6 1.4 0.2 Iris-setosa
4 5.4 3.9 1.7 0.4 Iris-setosa
#iris.columns = ['sepal_length','sepal_width', 'petal_length', 'petal_width', 'class']
iris.rename(columns={'5.1':'sepal_length','3.5':'sepal_width','1.4':'petal_length','0.2':'petal_width','Iris-setosa':'class'},inplace = True)
iris.head()
sepal_length sepal_width petal_length petal_width class
0 4.9 3.0 1.4 0.2 Iris-setosa
1 4.7 3.2 1.3 0.2 Iris-setosa
2 4.6 3.1 1.5 0.2 Iris-setosa
3 5.0 3.6 1.4 0.2 Iris-setosa
4 5.4 3.9 1.7 0.4 Iris-setosa
iris.isnull().sum()
5.1 0
3.5 0
1.4 0
0.2 0
Iris-setosa 0
dtype: int64
iris["petal_length"][9:19] = np.nan
#参考答案
iris.iloc[10:20,2:3] = np.nan
iris.head(20)
iris.fillna(1.0,inplace = True)
#参考答案
iris.petal_length.fillna(1, inplace = True)
iris.head(20)
iris.drop(columns="class",inplace = True)
#参考答案
del iris['class']
iris.head()
sepal_length sepal_width petal_length petal_width
0 4.9 3.0 1.4 0.2
1 4.7 3.2 1.3 0.2
2 4.6 3.1 1.5 0.2
3 5.0 3.6 1.4 0.2
4 5.4 3.9 1.7 0.4
iris[:3] = np.nan
iris.head(20)
#参考答案
iris.iloc[0:3 ,:] = np.nan
iris.head()
sepal_length sepal_width petal_length petal_width
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 5.0 3.6 1.4 0.2
4 5.4 3.9 1.7 0.4
iris.dropna(inplace=True)
#参考答案
iris = iris.dropna(how='any')
iris.head()
sepal_length sepal_width petal_length petal_width
3 5.0 3.6 1.4 0.2
4 5.4 3.9 1.7 0.4
5 4.6 3.4 1.4 0.3
6 5.0 3.4 1.5 0.2
7 4.4 2.9 1.4 0.2
iris.reset_index()
#参考答案
iris = iris.reset_index(drop = True)
iris.head()
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。