当前位置:   article > 正文

《Hadoop权威指南》学习笔记(一)_in = new url(args[i]).openstream();表示什么功能?

in = new url(args[i]).openstream();表示什么功能?

本博文是我学习《Hadoop权威指南》3.5节的笔记,主要是里面范例程序的实现,部分实现有修改

1 从Hadoop读取数据

首先新建一个文本文件test.txt作为测试

  1. hadoop fs -mkdir /poems //在Hadoop集群上新建一个目录poems
  2. hadoop fs -copyFromLocal test.txt /poems/test.txt //将本地test.txt文件上传到Hadoop集群
  3. hadoop fs -ls /poems //检查是否成功上传

上传无误就可以在poems目录下看见刚才上传的文件,准备工作完成,接下来用两种方法读取Hadoop集群上的test.txt文件

1.1 URL读取

新建类URLCat如下

  1. package com.tuan.hadoopLearn.hdfs;
  2. import java.net.URL;
  3. import java.io.InputStream;
  4. import org.apache.hadoop.io.IOUtils;
  5. import org.apache.hadoop.fs.FsUrlStreamHandlerFactory;
  6. public class URLCat {
  7.     static {
  8. URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
  9.     }
  10. public static void main(String[] args) {
  11. InputStream in = null;
  12. try {
  13. in = new URL(args[0]).openStream();
  14. IOUtils.copyBytes(in, System.out, 4096, false);
  15. } catch (Exception e) {
  16. System.out.println(e);
  17. } finally {
  18. IOUtils.closeStream(in);
  19. }
  20. }
  21. }

maven install后,进入工程jar包目录下(一般是.m2目录下),cmd执行命令,注意两个路径要写对,一个是URLCat类的路径,一个是Hadoop集群的路径

hadoop jar hadoopLearn-0.0.1-SNAPSHOT.jar com.tuan.hadoopLearn.URLCat hdfs:/poems/test.txt  //执行jar包下的URLCat类

这里可能会出现找不到或无法加载主类的错误,一般有两个原因,一个是CLASSPATH环境变量没有配置,我使用的Java1.8配置CLASSPATH应该为配置为“.;%JAVA_HOME\lib\dt.jar;%JAVA_HOME%\lib\tools.jar”,最前面有一个“.”,表示当前目录;另一个原因是类的路径没有写完整

执行成功后就可以通过URLCat类读取到刚才上传的test.txt文件并在控制台打印

1.2 FileSystem读取

类似地,写一个FileSystemCat类

  1. package com.tuan.hadoopLearn.hdfs;
  2. import java.io.InputStream;
  3. import java.net.URI;
  4. import org.apache.hadoop.conf.Configuration;
  5. import org.apache.hadoop.fs.FileSystem;
  6. import org.apache.hadoop.fs.Path;
  7. import org.apache.hadoop.io.IOUtils;
  8. public class FileSystemCat {
  9.     public static void main(String[] args) {
  10. String uri = args[0];
  11. Configuration conf = new Configuration();
  12. InputStream in = null;
  13. try {
  14.     FileSystem fs = FileSystem.get(URI.create(uri), conf);
  15.     in = fs.open(new Path(uri));
  16.     IOUtils.copyBytes(in, System.out, 4096, false);
  17. } catch (Exception e) {
  18.     System.out.println(e);
  19. } finally {
  20.     IOUtils.closeStream(in);
  21. }
  22.     }
  23. }

同样maven install后,cmd输入下列命令运行,可以看到输出test.txt内容

hadoop jar hadoopLearn-0.0.1-SNAPSHOT.jar com.tuan.hadoopLearn.FileSystemCat hdfs:/poems/test.txt

1.3 读取位置定位

将java.io的InputStream替换为hadoop的FsDataInputStream,可以实现按指定偏移量读取文件的功能,新建FileSystemDoubleCat类如下

  1. package com.tuan.hadoopLearn.hdfs;
  2. import java.net.URI;
  3. import org.apache.hadoop.conf.Configuration;
  4. import org.apache.hadoop.fs.FSDataInputStream;
  5. import org.apache.hadoop.fs.FileSystem;
  6. import org.apache.hadoop.fs.Path;
  7. import org.apache.hadoop.io.IOUtils;
  8. public class FileSystemDoubleCat {
  9.     public static void main(String[] args) {
  10. String uri = args[0];
  11. Configuration conf = new Configuration();
  12. FSDataInputStream in = null;
  13. try {
  14.     FileSystem fs = FileSystem.get(URI.create(uri), conf);
  15.     in = fs.open(new Path(uri));
  16.     IOUtils.copyBytes(in, System.out, 4096, false);
  17.     System.out.println("\n------------------------------------------------------------------");
  18.     in.seek(4);
  19.     IOUtils.copyBytes(in, System.out, 4096, false);
  20. } catch (Exception e) {
  21.     System.out.println(e);
  22. } finally {
  23.     IOUtils.closeStream(in);
  24. }
  25.     }
  26. }

同样执行类似的cmd命令,结果如图,第二次输出的时候开始的4个字母没有输出,因为执行了seek(4)

hadoop jar hadoopLearn-0.0.1-SNAPSHOT.jar com.tuan.hadoopLearn.FileSystemDoubleCat hdfs:/poems/test.txt

2 向Hadoop写入数据

在本地新建一个poem1.txt文件,准备等会通过Java传到Hadoop集群上

新建类FileCopyWithProgress

  1. package com.tuan.hadoopLearn.hdfs;
  2. import java.io.BufferedInputStream;
  3. import java.io.FileInputStream;
  4. import java.io.InputStream;
  5. import java.io.OutputStream;
  6. import java.net.URI;
  7. import org.apache.hadoop.conf.Configuration;
  8. import org.apache.hadoop.fs.FileSystem;
  9. import org.apache.hadoop.fs.Path;
  10. import org.apache.hadoop.io.IOUtils;
  11. import org.apache.hadoop.util.Progressable;
  12. public class FileCopyWithProgress {
  13. public static void main(String[] args) {
  14. String localSrc = args[0];
  15. String dst = args[1];
  16. InputStream in = null;
  17. OutputStream out = null;
  18. Configuration conf = new Configuration();
  19. try {
  20. in = new BufferedInputStream(new FileInputStream(localSrc));
  21. FileSystem fs = FileSystem.get(URI.create(dst), conf);
  22. out = fs.create(new Path(dst), new Progressable() {
  23. @Override
  24. public void progress() { //回调函数,显示上传过程
  25. System.out.print(".");
  26. }
  27. });
  28. IOUtils.copyBytes(in, out, 4096, true);
  29. } catch (Exception e) {
  30. System.out.println(e);
  31. }
  32. }
  33. }

输入cmd命令,将文件poems.txt传到hadoop集群上

  1. hadoop jar hadoopLearn-0.0.1-SNAPSHOT.jar com.tuan.hadoopLearn.FileCopyWithProgress C:\Users\Lenovo\Desktop\poem1.txt hdfs:/poems/poem1.txt
  2. hadoop fs -ls /poems //检查是否文件成功上传

3 文件信息查询

3.1 读取文件状态

这个和书上的例子不太一样,书上做了一个测试小集群,这里直接在集群上测试,新建类ShowFileStatusTest

  1. package com.tuan.hadoopLearn.hdfs;
  2. import java.io.IOException;
  3. import java.net.URI;
  4. import org.apache.hadoop.conf.Configuration;
  5. import org.apache.hadoop.fs.FileSystem;
  6. import org.apache.hadoop.fs.Path;
  7. import org.apache.hadoop.fs.FileStatus;
  8. public class ShowFileStatusTest {
  9. public static void main(String[] args) {
  10. String uri = args[0];
  11. try {
  12. FileSystem fs = FileSystem.get(URI.create(uri), new Configuration());
  13. FileStatus stat = fs.getFileStatus(new Path(uri));
  14. if (stat.isDir()) {
  15. System.out.println(uri + " is a directory");
  16. } else {
  17. System.out.println(uri + " is a file");
  18. }
  19. System.out.println("The path is " + stat.getPath());
  20. System.out.println("The length is " + stat.getLen());
  21. System.out.println("The modification time is " + stat.getModificationTime());
  22. System.out.println("The permission is " + stat.getPermission().toString());
  23. } catch (IOException e) {
  24. e.printStackTrace();
  25. }
  26. }
  27. }

 输入cmd命令

  1. hadoop jar hadoopLearn-0.0.1-SNAPSHOT.jar com.tuan.hadoopLearn.ShowFileStatusTest hdfs:/poems //读取poems目录信息
  2. hadoop jar hadoopLearn-0.0.1-SNAPSHOT.jar com.tuan.hadoopLearn.ShowFileStatusTest hdfs:/poems/poem1.txt //读取poem1.txt文件信息

3.2 列出目录下文件

这次同时列出两个指定目录下的文件,再新建一个jokes目录,放入一个joke1.txt

写好类ListStatus

  1. package com.tuan.hadoopLearn.hdfs;
  2. import java.io.IOException;
  3. import org.apache.hadoop.conf.Configuration;
  4. import org.apache.hadoop.fs.FileStatus;
  5. import org.apache.hadoop.fs.FileSystem;
  6. import org.apache.hadoop.fs.FileUtil;
  7. import org.apache.hadoop.fs.Path;
  8. public class ListStatus {
  9. public static void main(String[] args) {
  10. try {
  11. FileSystem fs = FileSystem.get( new Configuration());
  12. Path[] paths = new Path[args.length];
  13. for (int i = 0; i < args.length; i ++) {
  14. paths[i] = new Path(args[i]);
  15. }
  16. FileStatus[] stats = fs.listStatus(paths);
  17. Path[] listPaths = FileUtil.stat2Paths(stats);
  18. for (Path p : listPaths) {
  19. System.out.println(p);
  20. }
  21. } catch (IOException e) {
  22. e.printStackTrace();
  23. }
  24. }
  25. }

cmd命令

hadoop jar hadoopLearn-0.0.1-SNAPSHOT.jar com.tuan.hadoopLearn.ListStatus hdfs:/poems hdfs:/jokes

3.3 目录过滤

新建类RegexExcludedPathFilter,将指定目录下符合条件的目录过滤并将其他目录列出

  1. package com.tuan.hadoopLearn.hdfs;
  2. import java.io.IOException;
  3. import org.apache.hadoop.conf.Configuration;
  4. import org.apache.hadoop.fs.FileStatus;
  5. import org.apache.hadoop.fs.FileSystem;
  6. import org.apache.hadoop.fs.FileUtil;
  7. import org.apache.hadoop.fs.Path;
  8. import org.apache.hadoop.fs.PathFilter;
  9. public class RegexExcludedPathFilter implements PathFilter{
  10. private final String reg;
  11. public RegexExcludedPathFilter(String reg) {
  12. this.reg = reg;
  13. }
  14. @Override
  15. public boolean accept(Path path) {
  16. return !path.toString().matches(reg);
  17. }
  18. public static void main(String[] args) {
  19. String uri = args[0];
  20. String reg = args[1];
  21. try {
  22. FileSystem fs = FileSystem.get(new Configuration());
  23. FileStatus[] stats = fs.globStatus(new Path(uri), new RegexExcludedPathFilter(reg));
  24. Path[] listPaths = FileUtil.stat2Paths(stats);
  25. for (Path p : listPaths) {
  26. System.out.println(p);
  27. }
  28. } catch (IOException e) {
  29. e.printStackTrace();
  30. }
  31. }
  32. }

输入cmd命令,这里我卡了很久,本来poems目录下有两个文件poem1.txt和test.txt,我试图用/poems/t.*过滤掉test.txt,一直不成功,后来改成.*/poems/t.*就成功了,似乎globStatus()两个参数目录的输入方式不同,PathFilter的目录是完成的目录,这仅是本人不负责任的推测,留待以后继续研究

hadoop jar hadoopLearn-0.0.1-SNAPSHOT.jar com.tuan.hadoopLearn.RegexExcludedPathFilter /poems/* .*/poems/t.*

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/木道寻08/article/detail/865707
推荐阅读
相关标签
  

闽ICP备14008679号