当前位置:   article > 正文

Hive 基于 Hadoop 进行数据清洗及数据统计_hdfs+hive做数据统计

hdfs+hive做数据统计
  1. 收集数据到Hadoop hdfs
  2. 使用ETL(MapReduce)进行数据清洗
  3. (更新元数据 target)
  4. Hive 关联外部表
创建工程

  1. 添加MapReduce 依赖
  2. pom.xml
  3. <?xml version="1.0" encoding="UTF-8"?>
  4. <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  5. xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  6. <modelVersion>4.0.0</modelVersion>
  7. <groupId>com.xzdream.hive</groupId>
  8. <artifactId>xzdream-hive</artifactId>
  9. <version>1.0</version>
  10. <name>xzdream-hive</name>
  11. <!-- FIXME change it to the project's website -->
  12. <url>http://www.example.com</url>
  13. <properties>
  14. <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  15. <maven.compiler.source>1.7</maven.compiler.source>
  16. <maven.compiler.target>1.7</maven.compiler.target>
  17. <!-- Hadoop 版本-->
  18. <hadoop.version>2.6.0-cdh5.7.0</hadoop.version>
  19. </properties>
  20. <!--添加cdh仓库-->
  21. <repositories>
  22. <repository>
  23. <id>cloudera</id>
  24. <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
  25. </repository>
  26. </repositories>
  27. <dependencies>
  28. <!--添加hadoop依赖-->
  29. <dependency>
  30. <groupId>org.apache.hadoop</groupId>
  31. <artifactId>hadoop-client</artifactId>
  32. <version>${hadoop.version}</version>
  33. </dependency>
  34. <dependency>
  35. <groupId>junit</groupId>
  36. <artifactId>junit</artifactId>
  37. <version>4.11</version>
  38. <scope>test</scope>
  39. </dependency>
  40. </dependencies>
  41. <build>
  42. <pluginManagement><!-- lock down plugins versions to avoid using Maven defaults (may be moved to parent pom) -->
  43. <plugins>
  44. <!-- clean lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#clean_Lifecycle -->
  45. <plugin>
  46. <artifactId>maven-clean-plugin</artifactId>
  47. <version>3.1.0</version>
  48. </plugin>
  49. <!-- default lifecycle, jar packaging: see https://maven.apache.org/ref/current/maven-core/default-bindings.html#Plugin_bindings_for_jar_packaging -->
  50. <plugin>
  51. <artifactId>maven-resources-plugin</artifactId>
  52. <version>3.0.2</version>
  53. </plugin>
  54. <plugin>
  55. <artifactId>maven-compiler-plugin</artifactId>
  56. <version>3.8.0</version>
  57. </plugin>
  58. <plugin>
  59. <artifactId>maven-surefire-plugin</artifactId>
  60. <version>2.22.1</version>
  61. </plugin>
  62. <plugin>
  63. <artifactId>maven-jar-plugin</artifactId>
  64. <version>3.0.2</version>
  65. </plugin>
  66. <plugin>
  67. <artifactId>maven-install-plugin</artifactId>
  68. <version>2.5.2</version>
  69. </plugin>
  70. <plugin>
  71. <artifactId>maven-deploy-plugin</artifactId>
  72. <version>2.8.2</version>
  73. </plugin>
  74. <!-- site lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#site_Lifecycle -->
  75. <plugin>
  76. <artifactId>maven-site-plugin</artifactId>
  77. <version>3.7.1</version>
  78. </plugin>
  79. <plugin>
  80. <artifactId>maven-project-info-reports-plugin</artifactId>
  81. <version>3.0.0</version>
  82. </plugin>
  83. </plugins>
  84. </pluginManagement>
  85. </build>
  86. </project>
程序编写,进行数据清洗,保存到hdfs
  1. 准备好清洗的数据
  2. rinse.txt
  3. 127.0.0.1 http://www.localhost.com a
  4. 192.168.1.1 http://www.xzdream.cn
  5. 192.168.2.3 http://blog.xzdream.cn
  1. 将文件put到hadoop
  2. hadoop$ ./hadoop fs -mkdir -p /hive/rinse
  3. hadoop$ ./hadoop fs -put /Users/hadoop/data/rinse.txt /hive/rinse
  1. hadoop 提交jar
  2. hadoop$ ./hadoop jar /Users/hadoop/libs/xzdream-hive-1.0.jar com.xzdream.hive.mapreduce.driver.LogETLDriver /hive/rinse /hive/rinse/day=20200606
  1. 查看清洗完成的数据
  2. hadoop$ ./hadoop fs -cat /hive/rinse/output/part-r-00000
  3. 20/06/06 17:33:38 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  4. 192.168.1.1 http://www.xzdream.cn
  5. 127.0.0.1 http://www.baidu.com
  1. 修改数据库字符集
  2. alter database hive_db character set latin1;
  3. FLUSH PRIVILEGES;
  1. 使用hive进行统计
  2. 1:创建外部表
  3. create external table rinse(
  4. ip string,
  5. domain string
  6. ) partitioned by (day string)
  7. ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
  8. LOCATION '/hive/rinse/access/clear';
  9. create external table rinse(
  10. ip string,
  11. domain string
  12. )
  13. ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
  14. LOCATION '/hive/rinse/access/clear';
  1. 移动数据
  2. hadoop$ ./hadoop fs -mkdir -p /hive/rinse/access/clear/day=20200606/
  3. ./hadoop fs -mv /hive/rinse/day=20200606/part-r-00000 /hive/rinse/access/clear/day=20200606/
  1. 将数据刷进hive
  2. alter table rinse add if not exists partition(day='20200606’);
  1. hive (default)> select * from rinse;
  2. OK
  3. 192.168.2.3 http://blog.xzdream.cn 20200606
  4. 192.168.1.1 http://www.xzdream.cn 20200606
  5. Time taken: 0.115 seconds, Fetched: 2 row(s)
  6. hive (default)>
  1. hive (default)> select count(*),domain from rinse group by domain;
  2. 1 http://blog.xzdream.cn
  3. 1 http://www.xzdream.cn

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/寸_铁/article/detail/746014
推荐阅读
相关标签