千家信息网

Hadoop2.6.0学习笔记(二)HDFS访问

发表于:2025-12-03 作者:千家信息网编辑
千家信息网最后更新 2025年12月03日,鲁春利的工作笔记,谁说程序员不能有文艺范?通过hadoop shell与java api访问hdfs工作笔记之Hadoop2.6集群搭建已经将集群环境搭建好了,下面来进行一些HDFS的操作1、HDFS
千家信息网最后更新 2025年12月03日Hadoop2.6.0学习笔记(二)HDFS访问

鲁春利的工作笔记,谁说程序员不能有文艺范?



通过hadoop shell与java api访问hdfs

工作笔记之Hadoop2.6集群搭建已经将集群环境搭建好了,下面来进行一些HDFS的操作

1、HDFS的shell访问

HDFS设计主要用来对海量数据进行处理,即HDFS上存储大量文件。HDFS将这些文件进行分割后存储在不同的DataNode上。HDFS提供了一个shell接口,屏蔽了block存储的内部细节,所有的Hadoop操作均由bin/hadoop脚本引发。

不指定任何参数的hadoop命令将打印所有命令的描述,与hdfs文件相关的操作为hadoop fs(hadoop脚本其他的命令此处不涉及)。

[hadoop@nnode ~]$ hadoop fs Usage: hadoop fs [generic options]        [-appendToFile  ... ]        [-cat [-ignoreCrc]  ...]        [-checksum  ...]        [-chgrp [-R] GROUP PATH...]        [-chmod [-R]  PATH...]        [-chown [-R] [OWNER][:[GROUP]] PATH...]        [-copyFromLocal [-f] [-p] [-l]  ... ]        [-copyToLocal [-p] [-ignoreCrc] [-crc]  ... ]        [-count [-q] [-h]  ...]        [-cp [-f] [-p | -p[topax]]  ... ]        [-createSnapshot  []]        [-deleteSnapshot  ]        [-df [-h] [ ...]]        [-du [-s] [-h]  ...]        [-expunge]        [-get [-p] [-ignoreCrc] [-crc]  ... ]        [-getfacl [-R] ]        [-getfattr [-R] {-n name | -d} [-e en] ]        [-getmerge [-nl]  ]        [-help [cmd ...]]        [-ls [-d] [-h] [-R] [ ...]]        [-mkdir [-p]  ...]        [-moveFromLocal  ... ]        [-moveToLocal  ]        [-mv  ... ]        [-put [-f] [-p] [-l]  ... ]        [-renameSnapshot   ]        [-rm [-f] [-r|-R] [-skipTrash]  ...]        [-rmdir [--ignore-fail-on-non-empty]  ...]        [-setfacl [-R] [{-b|-k} {-m|-x } ]|[--set  ]]        [-setfattr {-n name [-v value] | -x name} ]        [-setrep [-R] [-w]   ...]        [-stat [format]  ...]        [-tail [-f] ]        [-test -[defsz] ]        [-text [-ignoreCrc]  ...]        [-touchz  ...]        [-usage [cmd ...]]Generic options supported are-conf      specify an application configuration file-D             use value for given property-fs       specify a namenode-jt     specify a ResourceManager-files     specify comma separated files to be copied to the map reduce cluster-libjars     specify comma separated jar files to include in the classpath.-archives     specify comma separated archives to be unarchived on the compute machines.The general command line syntax isbin/hadoop command [genericOptions] [commandOptions]

hadoop2.6版本中提示hadoop fs为"Deprecated, use hdfs dfs instead."(2.6之前的版本未接触过,这里就没有深究从哪一个版本开始的,但是hadoop fs仍然可以使用)。

[hadoop@nnode ~]$ hdfs dfs Usage: hadoop fs [generic options]        [-appendToFile  ... ]        [-cat [-ignoreCrc]  ...]        [-checksum  ...]        [-chgrp [-R] GROUP PATH...]        [-chmod [-R]  PATH...]        [-chown [-R] [OWNER][:[GROUP]] PATH...]        [-copyFromLocal [-f] [-p] [-l]  ... ]        [-copyToLocal [-p] [-ignoreCrc] [-crc]  ... ]        [-count [-q] [-h]  ...]        [-cp [-f] [-p | -p[topax]]  ... ]        [-createSnapshot  []]        [-deleteSnapshot  ]        [-df [-h] [ ...]]        [-du [-s] [-h]  ...]        [-expunge]        [-get [-p] [-ignoreCrc] [-crc]  ... ]        [-getfacl [-R] ]        [-getfattr [-R] {-n name | -d} [-e en] ]        [-getmerge [-nl]  ]        [-help [cmd ...]]        [-ls [-d] [-h] [-R] [ ...]]        [-mkdir [-p]  ...]        [-moveFromLocal  ... ]        [-moveToLocal  ]        [-mv  ... ]        [-put [-f] [-p] [-l]  ... ]        [-renameSnapshot   ]        [-rm [-f] [-r|-R] [-skipTrash]  ...]        [-rmdir [--ignore-fail-on-non-empty]  ...]        [-setfacl [-R] [{-b|-k} {-m|-x } ]|[--set  ]]        [-setfattr {-n name [-v value] | -x name} ]        [-setrep [-R] [-w]   ...]        [-stat [format]  ...]        [-tail [-f] ]        [-test -[defsz] ]        [-text [-ignoreCrc]  ...]        [-touchz  ...]        [-usage [cmd ...]]Generic options supported are-conf      specify an application configuration file-D             use value for given property-fs       specify a namenode-jt     specify a ResourceManager-files     specify comma separated files to be copied to the map reduce cluster-libjars     specify comma separated jar files to include in the classpath.-archives     specify comma separated archives to be unarchived on the compute machines.The general command line syntax isbin/hadoop command [genericOptions] [commandOptions]

如:

[hadoop@nnode ~]$ hdfs dfs -ls -R /user/hadoop-rw-r--r--   2 hadoop hadoop       2297 2015-06-29 14:44 /user/hadoop/20130913152700.txt.gz-rw-r--r--   2 hadoop hadoop        211 2015-06-29 14:45 /user/hadoop/20130913160307.txt.gz-rw-r--r--   2 hadoop hadoop   93046447 2015-07-18 18:01 /user/hadoop/apache-hive-1.2.0-bin.tar.gz-rw-r--r--   2 hadoop hadoop    4139112 2015-06-28 22:54 /user/hadoop/httpInterceptor_192.168.1.101_1_20130913160307.txt-rw-r--r--   2 hadoop hadoop        240 2015-05-30 20:54 /user/hadoop/lucl.gz-rw-r--r--   2 hadoop hadoop         63 2015-05-27 23:55 /user/hadoop/lucl.txt-rw-r--r--   2 hadoop hadoop    9994248 2015-06-29 14:12 /user/hadoop/scalog.txt-rw-r--r--   2 hadoop hadoop    2664495 2015-06-28 20:54 /user/hadoop/scalog.txt.gz-rw-r--r--   3 hadoop hadoop   28026803 2015-06-24 21:16 /user/hadoop/test.txt.gz-rw-r--r--   2 hadoop hadoop      28551 2015-05-27 23:54 /user/hadoop/zookeeper.out[hadoop@nnode ~]$ # 这里的点为当前目录,我是通过hadoop用户操作的因此类似于/user/hadoop# hdfs默认具有/user/{hadoop-user},但是在/下也可以自己通过mkdir命令来创建自己的目录[hadoop@nnode ~]$ hdfs dfs -ls -R .-rw-r--r--   2 hadoop hadoop       2297 2015-06-29 14:44 20130913152700.txt.gz-rw-r--r--   2 hadoop hadoop        211 2015-06-29 14:45 20130913160307.txt.gz-rw-r--r--   2 hadoop hadoop   93046447 2015-07-18 18:01 apache-hive-1.2.0-bin.tar.gz-rw-r--r--   2 hadoop hadoop    4139112 2015-06-28 22:54 httpInterceptor_192.168.1.101_1_20130913160307.txt-rw-r--r--   2 hadoop hadoop        240 2015-05-30 20:54 lucl.gz-rw-r--r--   2 hadoop hadoop         63 2015-05-27 23:55 lucl.txt-rw-r--r--   2 hadoop hadoop    9994248 2015-06-29 14:12 scalog.txt-rw-r--r--   2 hadoop hadoop    2664495 2015-06-28 20:54 scalog.txt.gz-rw-r--r--   3 hadoop hadoop   28026803 2015-06-24 21:16 test.txt.gz-rw-r--r--   2 hadoop hadoop      28551 2015-05-27 23:54 zookeeper.out[hadoop@nnode ~]$

如果不清楚hdfs命令的详细操作,可以查看帮助信息:

[hadoop@nnode ~]$ hdfs dfs -help ls-ls [-d] [-h] [-R] [ ...] :  List the contents that match the specified file pattern. If path is not  specified, the contents of /user/ will be listed. Directory entries are of the form:        permissions - userId groupId sizeOfDirectory(in bytes)  modificationDate(yyyy-MM-dd HH:mm) directoryName    and file entries are of the form:        permissions numberOfReplicas userId groupId sizeOfFile(in bytes)  modificationDate(yyyy-MM-dd HH:mm) fileName                     -d  Directories are listed as plain files.     -h  Formats the sizes of files in a human-readable fashion rather than a number of bytes.    -R  Recursively list the contents of directories.  [hadoop@nnode ~]$

2、HDFS的Java API访问

Hadoop中通过DataNode节点存储数据,而NameNode节点则记录数据的存储位置。Hadoop中各部分的通信基于RPC来实现,NameNode也是hadoop中RPC的server端(dfs.namenode.rpc-address说明了rpc端的主机名和端口号),而Hadoop提供的FileSystem类为hadoop中RPC Client的抽象实现。

a.) 通过java.util.URL来读取hdfs的数据

为了让java程序能够识别Hadoop的hdfs URL需要通过URL的setURLStreamHandlerFactory(...);

每个Java虚拟机只能调用依次这个方法,因此通常在静态方法中调用。

package com.invic.hdfs;import java.io.IOException;import java.io.InputStream;import java.io.OutputStream;import java.net.URL;import org.apache.hadoop.fs.FsUrlStreamHandlerFactory;import org.apache.hadoop.io.IOUtils;/** *  * @author lucl * @ 通过java api来访问hdfs上特定的数据 * */public class MyHdfsOfJavaApi {        static {        /**         * 为了让java程序能够识别hadoop的hdfs url需要配置额外的URLStreamHandlerFactory         * 如下方法java虚拟机只能调用一次,若原有的其他程序已经声明过该factory,则我的java程序将无法从hadoop中读取数据         */        URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());    }    public static void main(String[] args) throws IOException {        String path = "hdfs://nnode:8020/user/hadoop/lucl.txt";        InputStream in = new URL(path).openStream();        OutputStream ou = System.out;        int buffer = 4096;        boolean close = false;        IOUtils.copyBytes(in, ou, buffer, close);                IOUtils.closeStream(in);    }}

b.) 通过Hadoop的FileSystem来访问HDFS

Hadoop有一个抽象的文件系统概念,HDFS只是其中的一个实现。java抽象类org.apache.hadoop.fs.FileSystem定义了Hadoop中的一个文件系统接口。

java.lang.Object  org.apache.hadoop.conf.Configured      org.apache.hadoop.fs.FileSystem          |--org.apache.hadoop.fs.FilterFileSystem             |----org.apache.hadoop.fs.ChecksumFileSystem                  |----org.apache.hadoop.fs.LocalFileSystem            |--org.apache.hadoop.fs.ftp.FTPFileSystem          |--org.apache.hadoop.fs.s3native.NativeS3FileSystem          |--org.apache.hadoop.fs.RawLocalFileSystem          |--org.apache.hadoop.fs.viewfs.ViewFileSystem
package com.invic.hdfs;import java.io.IOException;import java.io.OutputStream;import java.net.URI;import java.util.Scanner;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FSDataInputStream;import org.apache.hadoop.fs.FSDataOutputStream;import org.apache.hadoop.fs.FileStatus;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.FileUtil;import org.apache.hadoop.fs.LocatedFileStatus;import org.apache.hadoop.fs.Path;import org.apache.hadoop.fs.PathFilter;import org.apache.hadoop.fs.RemoteIterator;import org.apache.hadoop.io.IOUtils;import org.apache.hadoop.util.Progressable;/** *  * @author lucl * @ 通过FileSystem API来实现     *  FileSystem get(Configuration)            通过设置配置文件core-site.xml读取类路径来实现,默认本地文件系统             *  FileSystem get(URI, Configuration)       通过URI来设定要使用的文件系统 *  FileSystem get(URI, Configuration, user) 作为给定用户来访问文件系统,对安全来说至关重要 */public class MyHdfsOfFS {    private static String HOST = "hdfs://nnode";    private static String PORT = "8020";        private static String NAMENODE = HOST + ":" + PORT;        public static void main(String[] args) throws IOException {        Configuration conf = new Configuration();                String path = NAMENODE + "/user/";                /**         * 由于这里设计的为hadoop的user目录,默认会查询hdfs的用户家目录下的文件         */        String user = "hadoop";        FileSystem fs = null;        try {            fs = FileSystem.get(URI.create(path), conf, user);        } catch (InterruptedException e) {            e.printStackTrace();        }                if (null == fs) {            return;        }                /**         * 递归创建目录         */        boolean mkdirs = fs.mkdirs(new Path("invic/test/mvtech"));        if (mkdirs) {            System.out.println("Dir 'invic/test/mvtech' create success.");        }                /**         * 判断目录是否存在         */        boolean exists = fs.exists(new Path("invic/test/mvtech"));        if (exists) {            System.out.println("Dir 'invic/test/mvtech' exists.");        }                /**         * FSDataInputStream支持随意位置访问         * 这里的lucl.txt默认查找路径为/user/Administrator/lucl.txt             因为我是windows的eclipse         * 如果我上面的get方法最后指定了user             则查询的路径为/user/get方法指定的user/lucl.txt         */        FSDataInputStream in = fs.open(new Path("lucl.txt"));                OutputStream os = System.out;                int buffSize = 4098;                boolean close = false;                IOUtils.copyBytes(in, os, buffSize, close);                System.out.println("\r\n跳到文件开始重新读取文件。。。。。。");        in.seek(0);        IOUtils.copyBytes(in, os, buffSize, close);                IOUtils.closeStream(in);                /**         * 创建文件         */        FSDataOutputStream create = fs.create(new Path("sample.txt"));        create.write("This is my first sample file.".getBytes());        create.flush();        create.close();                /**         * 文件拷贝         */        fs.copyFromLocalFile(new Path("F:\\Mvtech\\ftpfile\\cg-10086.com.csv"),         new Path("cg-10086.com.csv"));                /**         * 文件追加         */        FSDataOutputStream append = fs.append(new Path("sample.txt"));        append.writeChars("\r\n");        append.writeChars("New day, new World.");        append.writeChars("\r\n");                IOUtils.closeStream(append);                /**         * progress的使用         */        FSDataOutputStream progress = fs.create(new Path("progress.txt"),         new Progressable() {                        @Override            public void progress() {                System.out.println("write is in progress......");            }        });                // 接收键盘输入到hdfs上        Scanner sc = new Scanner(System.in);        System.out.print("Please type your enter : ");        String name = sc.nextLine();        while (!"quit".equals(name)) {            if (null == name || "".equals(name.trim())) {                continue;            }            progress.writeChars(name);                        System.out.print("Please type your enter : ");            name = sc.nextLine();        }                /**         * 递归列出文件         */        RemoteIterator it = fs.listFiles(new Path(path), true);        while (it.hasNext()) {            LocatedFileStatus loc = it.next();            System.out.println(loc.getPath().getName() + "|" + loc.getLen() + "|"             + loc.getOwner());         }                /**         * 文件或目录元数据:文件长度、块大小、复本、修改时间、所有者及权限信息         */        FileStatus status = fs.getFileStatus(new Path("lucl.txt"));        System.out.println(status.getPath().getName() + "|" +         status.getPath().getParent().getName() + "|" + status.getBlockSize() + "|"         + status.getReplication() + "|" + status.getOwner());                /**         * 列出目录中文件listStatus,若参数为文件则以数组方式返回长度为1的FileStatus对象         */        fs.listStatus(new Path(path));        fs.listStatus(new Path(path), new PathFilter() {                        @Override            public boolean accept(Path tmpPath) {                String tmpName = tmpPath.getName();                if (tmpName.endsWith(".txt")) {                    return true;                }                return false;            }        });                // 可以传入一组路径,会最终累计合并成一个数组返回        // fs.listStatus(Path [] files);        FileStatus [] mergeStatus = fs.listStatus(new Path[]{new Path("lucl.txt"),         new Path("progress.txt"), new Path("sample.txt")});        Path [] listPaths = FileUtil.stat2Paths(mergeStatus);        for (Path p : listPaths) {            System.out.println(p);        }                /**         * 文件模式匹配         */        FileStatus [] patternStatus = fs.globStatus(new Path("*.txt"));        for (FileStatus stat : patternStatus) {            System.out.println(stat.getPath());        }                /**         * 删除数据         */        boolean recursive = true;        fs.delete(new Path("demo.txt"), recursive);                fs.close();    }}

c.) 访问HDFS集群

package com.invic.hdfs;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.LocatedFileStatus;import org.apache.hadoop.fs.Path;import org.apache.hadoop.fs.RemoteIterator;import org.apache.log4j.Logger;/** *  * @author lucl * @ 通过访问hadoop集群来访问hdfs * */public class MyClusterHdfs {    public static void main(String[] args) throws IOException {        System.setProperty("hadoop.home.dir", "E:\\Hadoop\\hadoop-2.6.0\\hadoop-2.6.0\\");                Logger logger = Logger.getLogger(MyClusterHdfs.class);                 Configuration conf = new Configuration();                conf.set("fs.defaultFS", "hdfs://cluster");        conf.set("dfs.nameservices", "cluster");        conf.set("dfs.ha.namenodes.cluster", "nn1,nn2");        conf.set("dfs.namenode.rpc-address.cluster.nn1", "nnode:8020");        conf.set("dfs.namenode.rpc-address.cluster.nn2", "dnode1:8020");        conf.set("dfs.client.failover.proxy.provider.cluster",                 "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");                FileSystem fs = FileSystem.get(conf);                        RemoteIterator it = fs.listFiles(new Path("/"), true);        while (it.hasNext()) {            LocatedFileStatus loc = it.next();            logger.info(loc.getPath().getName() + "|" + loc.getLen() + loc.getOwner());         }                /*for (int i = 0; i < 500; i++) {            String str = "the sequence is " + i;            logger.info(str);        }*/                try {            Thread.sleep(10);        } catch (InterruptedException e) {            e.printStackTrace();        }        System.exit(0);    }}

说明:

System.setProperty("hadoop.home.dir", "E:\\Hadoop\\hadoop-2.6.0\\hadoop-2.6.0\\");# 在main方法的第一行配置hadoop的home路径,否则在Windows下可能报错如下:15/07/19 22:05:54 DEBUG util.Shell: Failed to detect a valid hadoop home directoryjava.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.    at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:302)    at org.apache.hadoop.util.Shell.(Shell.java:327)    at org.apache.hadoop.util.GenericOptionsParser.preProcessForWindows(GenericOptionsParser.java:438)    at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:484)    at org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:170)    at org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:153)    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64)    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)    at com.invic.mapreduce.wordcount.WordCounterTool.main(WordCounterTool.java:29)15/07/19 22:05:54 ERROR util.Shell: Failed to locate the winutils binary in the hadoop binary pathjava.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.    at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:355)    at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:370)    at org.apache.hadoop.util.Shell.(Shell.java:363)    at org.apache.hadoop.util.GenericOptionsParser.preProcessForWindows(GenericOptionsParser.java:438)    at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:484)    at org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:170)    at org.apache.hadoop.util.GenericOptionsParser.(GenericOptionsParser.java:153)    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64)    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)    at com.invic.mapreduce.wordcount.WordCounterTool.main(WordCounterTool.java:29)
0