当前位置:   article > 正文

Spark项目实战,详细操作图文详解(基于Spark MLlib的鸢尾花聚类项目实战、基于Spark GraphX的航班飞行网图分析)_航班飞行网图数据

航班飞行网图数据

目录

一、基于MLlib的鸢尾花聚类项目实战

1.1 项目背景

1.1.1 背景

1.1.2 数据

1.2 项目实战步骤(图文详解)

二、基于GraphX的航班飞行网图分析

2.1 项目背景

2.1.1 背景

2.1.2 数据

2.2 项目实战步骤(图文详解)


一、基于MLlib的鸢尾花聚类项目实战

1.1 项目背景

1.1.1 背景

数据iris.txt以鸢尾花的特征作为数据来源,(数据集包含150个数据集,分为3类,每类50个数据,本节聚类实验,只保留了4个属性的值,类别值被丢弃)目的是通过使用MLlib程序库中的聚类算法(K-Means )来对数据(鸢尾花)进行分类

1.1.2 数据

数据集如下:(直接复制粘贴存为iris.txt即可)

  1. 5.1,3.5,1.4,0.2,Iris-setosa
  2. 4.9,3.0,1.4,0.2,Iris-setosa
  3. 4.7,3.2,1.3,0.2,Iris-setosa
  4. 4.6,3.1,1.5,0.2,Iris-setosa
  5. 5.0,3.6,1.4,0.2,Iris-setosa
  6. 5.4,3.9,1.7,0.4,Iris-setosa
  7. 4.6,3.4,1.4,0.3,Iris-setosa
  8. 5.0,3.4,1.5,0.2,Iris-setosa
  9. 4.4,2.9,1.4,0.2,Iris-setosa
  10. 4.9,3.1,1.5,0.1,Iris-setosa
  11. 5.4,3.7,1.5,0.2,Iris-setosa
  12. 4.8,3.4,1.6,0.2,Iris-setosa
  13. 4.8,3.0,1.4,0.1,Iris-setosa
  14. 4.3,3.0,1.1,0.1,Iris-setosa
  15. 5.8,4.0,1.2,0.2,Iris-setosa
  16. 5.7,4.4,1.5,0.4,Iris-setosa
  17. 5.4,3.9,1.3,0.4,Iris-setosa
  18. 5.1,3.5,1.4,0.3,Iris-setosa
  19. 5.7,3.8,1.7,0.3,Iris-setosa
  20. 5.1,3.8,1.5,0.3,Iris-setosa
  21. 5.4,3.4,1.7,0.2,Iris-setosa
  22. 5.1,3.7,1.5,0.4,Iris-setosa
  23. 4.6,3.6,1.0,0.2,Iris-setosa
  24. 5.1,3.3,1.7,0.5,Iris-setosa
  25. 4.8,3.4,1.9,0.2,Iris-setosa
  26. 5.0,3.0,1.6,0.2,Iris-setosa
  27. 5.0,3.4,1.6,0.4,Iris-setosa
  28. 5.2,3.5,1.5,0.2,Iris-setosa
  29. 5.2,3.4,1.4,0.2,Iris-setosa
  30. 4.7,3.2,1.6,0.2,Iris-setosa
  31. 4.8,3.1,1.6,0.2,Iris-setosa
  32. 5.4,3.4,1.5,0.4,Iris-setosa
  33. 5.2,4.1,1.5,0.1,Iris-setosa
  34. 5.5,4.2,1.4,0.2,Iris-setosa
  35. 4.9,3.1,1.5,0.1,Iris-setosa
  36. 5.0,3.2,1.2,0.2,Iris-setosa
  37. 5.5,3.5,1.3,0.2,Iris-setosa
  38. 4.9,3.1,1.5,0.1,Iris-setosa
  39. 4.4,3.0,1.3,0.2,Iris-setosa
  40. 5.1,3.4,1.5,0.2,Iris-setosa
  41. 5.0,3.5,1.3,0.3,Iris-setosa
  42. 4.5,2.3,1.3,0.3,Iris-setosa
  43. 4.4,3.2,1.3,0.2,Iris-setosa
  44. 5.0,3.5,1.6,0.6,Iris-setosa
  45. 5.1,3.8,1.9,0.4,Iris-setosa
  46. 4.8,3.0,1.4,0.3,Iris-setosa
  47. 5.1,3.8,1.6,0.2,Iris-setosa
  48. 4.6,3.2,1.4,0.2,Iris-setosa
  49. 5.3,3.7,1.5,0.2,Iris-setosa
  50. 5.0,3.3,1.4,0.2,Iris-setosa
  51. 7.0,3.2,4.7,1.4,Iris-versicolor
  52. 6.4,3.2,4.5,1.5,Iris-versicolor
  53. 6.9,3.1,4.9,1.5,Iris-versicolor
  54. 5.5,2.3,4.0,1.3,Iris-versicolor
  55. 6.5,2.8,4.6,1.5,Iris-versicolor
  56. 5.7,2.8,4.5,1.3,Iris-versicolor
  57. 6.3,3.3,4.7,1.6,Iris-versicolor
  58. 4.9,2.4,3.3,1.0,Iris-versicolor
  59. 6.6,2.9,4.6,1.3,Iris-versicolor
  60. 5.2,2.7,3.9,1.4,Iris-versicolor
  61. 5.0,2.0,3.5,1.0,Iris-versicolor
  62. 5.9,3.0,4.2,1.5,Iris-versicolor
  63. 6.0,2.2,4.0,1.0,Iris-versicolor
  64. 6.1,2.9,4.7,1.4,Iris-versicolor
  65. 5.6,2.9,3.6,1.3,Iris-versicolor
  66. 6.7,3.1,4.4,1.4,Iris-versicolor
  67. 5.6,3.0,4.5,1.5,Iris-versicolor
  68. 5.8,2.7,4.1,1.0,Iris-versicolor
  69. 6.2,2.2,4.5,1.5,Iris-versicolor
  70. 5.6,2.5,3.9,1.1,Iris-versicolor
  71. 5.9,3.2,4.8,1.8,Iris-versicolor
  72. 6.1,2.8,4.0,1.3,Iris-versicolor
  73. 6.3,2.5,4.9,1.5,Iris-versicolor
  74. 6.1,2.8,4.7,1.2,Iris-versicolor
  75. 6.4,2.9,4.3,1.3,Iris-versicolor
  76. 6.6,3.0,4.4,1.4,Iris-versicolor
  77. 6.8,2.8,4.8,1.4,Iris-versicolor
  78. 6.7,3.0,5.0,1.7,Iris-versicolor
  79. 6.0,2.9,4.5,1.5,Iris-versicolor
  80. 5.7,2.6,3.5,1.0,Iris-versicolor
  81. 5.5,2.4,3.8,1.1,Iris-versicolor
  82. 5.5,2.4,3.7,1.0,Iris-versicolor
  83. 5.8,2.7,3.9,1.2,Iris-versicolor
  84. 6.0,2.7,5.1,1.6,Iris-versicolor
  85. 5.4,3.0,4.5,1.5,Iris-versicolor
  86. 6.0,3.4,4.5,1.6,Iris-versicolor
  87. 6.7,3.1,4.7,1.5,Iris-versicolor
  88. 6.3,2.3,4.4,1.3,Iris-versicolor
  89. 5.6,3.0,4.1,1.3,Iris-versicolor
  90. 5.5,2.5,4.0,1.3,Iris-versicolor
  91. 5.5,2.6,4.4,1.2,Iris-versicolor
  92. 6.1,3.0,4.6,1.4,Iris-versicolor
  93. 5.8,2.6,4.0,1.2,Iris-versicolor
  94. 5.0,2.3,3.3,1.0,Iris-versicolor
  95. 5.6,2.7,4.2,1.3,Iris-versicolor
  96. 5.7,3.0,4.2,1.2,Iris-versicolor
  97. 5.7,2.9,4.2,1.3,Iris-versicolor
  98. 6.2,2.9,4.3,1.3,Iris-versicolor
  99. 5.1,2.5,3.0,1.1,Iris-versicolor
  100. 5.7,2.8,4.1,1.3,Iris-versicolor
  101. 6.3,3.3,6.0,2.5,Iris-virginica
  102. 5.8,2.7,5.1,1.9,Iris-virginica
  103. 7.1,3.0,5.9,2.1,Iris-virginica
  104. 6.3,2.9,5.6,1.8,Iris-virginica
  105. 6.5,3.0,5.8,2.2,Iris-virginica
  106. 7.6,3.0,6.6,2.1,Iris-virginica
  107. 4.9,2.5,4.5,1.7,Iris-virginica
  108. 7.3,2.9,6.3,1.8,Iris-virginica
  109. 6.7,2.5,5.8,1.8,Iris-virginica
  110. 7.2,3.6,6.1,2.5,Iris-virginica
  111. 6.5,3.2,5.1,2.0,Iris-virginica
  112. 6.4,2.7,5.3,1.9,Iris-virginica
  113. 6.8,3.0,5.5,2.1,Iris-virginica
  114. 5.7,2.5,5.0,2.0,Iris-virginica
  115. 5.8,2.8,5.1,2.4,Iris-virginica
  116. 6.4,3.2,5.3,2.3,Iris-virginica
  117. 6.5,3.0,5.5,1.8,Iris-virginica
  118. 7.7,3.8,6.7,2.2,Iris-virginica
  119. 7.7,2.6,6.9,2.3,Iris-virginica
  120. 6.0,2.2,5.0,1.5,Iris-virginica
  121. 6.9,3.2,5.7,2.3,Iris-virginica
  122. 5.6,2.8,4.9,2.0,Iris-virginica
  123. 7.7,2.8,6.7,2.0,Iris-virginica
  124. 6.3,2.7,4.9,1.8,Iris-virginica
  125. 6.7,3.3,5.7,2.1,Iris-virginica
  126. 7.2,3.2,6.0,1.8,Iris-virginica
  127. 6.2,2.8,4.8,1.8,Iris-virginica
  128. 6.1,3.0,4.9,1.8,Iris-virginica
  129. 6.4,2.8,5.6,2.1,Iris-virginica
  130. 7.2,3.0,5.8,1.6,Iris-virginica
  131. 7.4,2.8,6.1,1.9,Iris-virginica
  132. 7.9,3.8,6.4,2.0,Iris-virginica
  133. 6.4,2.8,5.6,2.2,Iris-virginica
  134. 6.3,2.8,5.1,1.5,Iris-virginica
  135. 6.1,2.6,5.6,1.4,Iris-virginica
  136. 7.7,3.0,6.1,2.3,Iris-virginica
  137. 6.3,3.4,5.6,2.4,Iris-virginica
  138. 6.4,3.1,5.5,1.8,Iris-virginica
  139. 6.0,3.0,4.8,1.8,Iris-virginica
  140. 6.9,3.1,5.4,2.1,Iris-virginica
  141. 6.7,3.1,5.6,2.4,Iris-virginica
  142. 6.9,3.1,5.1,2.3,Iris-virginica
  143. 5.8,2.7,5.1,1.9,Iris-virginica
  144. 6.8,3.2,5.9,2.3,Iris-virginica
  145. 6.7,3.3,5.7,2.5,Iris-virginica
  146. 6.7,3.0,5.2,2.3,Iris-virginica
  147. 6.3,2.5,5.0,1.9,Iris-virginica
  148. 6.5,3.0,5.2,2.0,Iris-virginica
  149. 6.2,3.4,5.4,2.3,Iris-virginica

1.2 项目实战步骤(图文详解)

 1)命令行开启spark shell

2)导入必要的包

3)读入文件,装载数据:通过SparkContext自带的textFile(..)方法将文件读入,并进行转换,形成一个RDD。

 对RDD使用filter算子,并通过正则表达式将鸢尾花的类标签过滤掉,然后查看数据的情况 。

4)将数据集聚类,2个类,5次迭代,进行模型训练形成数据模型

 5)打印数据模型的中心点

6)通过predict()方法来确定每个样本所属的聚类

7)使用误差平方之和来评估数据模型(度量聚类的有效性) 

 8)使用模型测试单点数据

9) 退出

二、基于GraphX的航班飞行网图分析

2.1 项目背景

2.1.1 背景

通过使用GraphX来构建航班飞行网图,统计航班飞行网图中机场与航线的数量,计算最长的飞行航线,找出最繁忙的机场

2.1.2 数据

数据集如下:

提取链接:https://pan.baidu.com/s/1bW-mwDwN6sDm4s6KGCytKA 
提取码:21g4 

2.2 项目实战步骤(图文详解)

1) 导入包

 

2)装载CSV为RDD,每个机场作为顶点,飞行距离是边 初始化顶点集airport:RDD[(VertexId,String)],顶点属性为机场名称 初始化边集lines:RDD[Edge],边属性为飞行距离

 3) 进行图分析:统计航班飞行网图中机场与航线的数量

4)计算最长的飞行航线

5)找出最繁忙的机场,哪个机场到达航班最多

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/Monodyee/article/detail/542106
推荐阅读
相关标签
  

闽ICP备14008679号