HDFS Java API
Starting Point
There are two important Java classes which are the starting point for using the Java API.
Upto Hadoop Version 1.x
org.apache.hadoop.fs.FileSystem
Hadoop Version 2.x and later
org.apache.hadoop.fs.FileContext
FileContext class has better handling of multiple fileystems (A single FileContext can resolve multiple file systems) and has a cleaner and consistent interface. Having said that FileSystem is still more widely used.
Reading Data from Hadoop URL (Hadoop Version 1.x)
Simplest way to read a file from Hadoop file system is by using the java.net.URL object to open a stream and read data. The URL is of the form hdfs://host-machine/full-pat
There is additional work involved though, to make Java understand the hdfs:// scheme. The setURLStreamHandlerFactory() method of the URL class needs to be called with an instance of FsUrlStreamHandlerFactory class. The setURLStreamHandlerFactory method can only be called once per JVM so is typically executed in static block. This limitation means that if some other part of your program, (maybe a third-party component) calls the setURLStreamHandlerFactory() method, you won’t be able to use this approach for reading data from Hadoop.
Because of these limitations this approach is not recommended and there is a more elegant approach to reading and writing data to HDFS.
For completness the following is the code to read a HDFS file present at location hdfs://localhost/user/hadoop/data/input.txt to standard output
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.fs.FsUrlStreamHandlerFactory;
import java.net.URL;
import java.io.InputStream;
public class ReadHDFSFile {
static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
public static void main(String args[]) throws Exception {
InputStream in = null;
try {
in = new URL("hdfs://localhost/user/hadoop/data/input.txt").openStream();
IOUtils.copyBytes(in, System.out, 4096, false);
}
finally {
IOUtils.closeStream(in);
}
}
}
Getting Instance of FileSystem object
FileSystem is a general filesystem API and several static methods are provided to retrieve the appropriate instance for the file system we want to use, local file system or HDFS etc...
public static FileSystem get(Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf, String user)
throws IOException
The Configuration object encapsulates the client or server's configuration which is set using the configuration files read from classpath. For example: core-site.xml
To exclusively get the local file system use:
public static LocalFileSystem getLocal(Configuration conf) throws IOException
Input and Output Streams
For reading data
org.apache.hadoop.fs.FSDataInputStream
Use the following methods to open input stream
public FSDataInputStream open(Path f) throws IOException
public abstract FSDataInputStream open(Path f, int bufferSize) throws IOException
For wrting data
org.apache.hadoop.fs.FSDataOutputStream
Use the following methods to open output stream
public FSDataOutputStream create(Path f) throws IOException
FSDataInputStream
Default Buffer Size of FSDataInputStream is 4 KB.
Although not recommended FSDataInputStream supports random access using the Seekable and PositionedReadable Interfaces. It is not recommended to use random access to a file because it is an expensive operation. Ideally applications should be designed in a fashion to read entire file serially. (For example: Map Reduce)
public interface Seekable {
void seek(long pos) throws IOException;
long getPos() throws IOException;
}
public interface PositionedReadable {
public int read(long position, byte[] buffer, int offset, int length)
throws IOException;
public void readFully(long position, byte[] buffer, int offset, int length)
throws IOException;
public void readFully(long position, byte[] buffer) throws IOException;
}
Sample Program to read file from HDFS
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
public class HDFSRead {
public static void main(String args[]) throws Exception {
Configuration conf = new Configuration();
conf.set("fs.defaultFS","hdfs://localhost");
FileSystem fs = FileSystem.get(conf);
FSDataInputStream in = fs.open(new Path("/user/hadoop/data/input.txt"));
byte[] buff = new byte[2048];
while(in.read(buff)!= -1) {
System.out.println(new String(buff));
}
in.close();
}
}
FSDataOutputStream
There are many overloaded versions to create a FSDataOutputStream. The simplest to just create a new file at a given path.
public FSDataOutputStream create(Path f) throws IOException
Some functionality provided by overloaded methods:
- Force overwriting existing files
- Set replication factor of a file
- Buffer size to use while writing a file
- block size of a file
- Permissions to be set while creating a file
There’s also an overloaded method for passing a callback interface, Progressable, so the application can be notified of the progress of the data being written to the
datanodes:
public interface Progressable {
public void progress();
}
As an alternative to creating a new file, an existing file can be appended using:
public FSDataOutputStream append(Path f) throws IOException
Sample program to write to HDFS
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
public class HDFSWrite {
public static void main(String args[]) throws Exception {
Configuration conf = new Configuration();
conf.set("fs.defaultFS","hdfs://localhost");
FileSystem fs = FileSystem.get(conf);
FSDataOutputStream out = fs.create(new Path("/user/hadoop/data/hello.txt"));
out.writeBytes("hello world");
out.close();
}
}
Summary
HDFS Java API supports many more operations to work with the file system.
Create Directory
public boolean mkdirs(Path f) throws IOException
Deleting Files and Directories
public boolean delete(Path f, boolean recusrive) throws IOException