hadoop

HDFS Java API

Starting Point

There are two important Java classes which are the starting point for using the Java API.

Upto Hadoop Version 1.x

org.apache.hadoop.fs.FileSystem

Hadoop Version 2.x and later

org.apache.hadoop.fs.FileContext

FileContext class has better handling of multiple fileystems (A single FileContext can resolve multiple file systems) and has a cleaner and consistent interface. Having said that FileSystem is still more widely used.

💡

org.apache.hadoop.fs.FileSystem is still widely used class

Reading Data from Hadoop URL (Hadoop Version 1.x)

Simplest way to read a file from Hadoop file system is by using the java.net.URL object to open a stream and read data. The URL is of the form hdfs://host-machine/full-pat

There is additional work involved though, to make Java understand the hdfs:// scheme. The setURLStreamHandlerFactory() method of the URL class needs to be called with an instance of FsUrlStreamHandlerFactory class. The setURLStreamHandlerFactory method can only be called once per JVM so is typically executed in static block. This limitation means that if some other part of your program, (maybe a third-party component) calls the setURLStreamHandlerFactory() method, you won’t be able to use this approach for reading data from Hadoop.

Because of these limitations this approach is not recommended and there is a more elegant approach to reading and writing data to HDFS.

For completness the following is the code to read a HDFS file present at location hdfs://localhost/user/hadoop/data/input.txt to standard output

import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.fs.FsUrlStreamHandlerFactory;
import java.net.URL;
import java.io.InputStream;

public class ReadHDFSFile {
	static {
		URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
	}
    
    public static void main(String args[]) throws Exception {
    	InputStream in = null;
        try {
        	in = new URL("hdfs://localhost/user/hadoop/data/input.txt").openStream();
            IOUtils.copyBytes(in, System.out, 4096, false);
        }
        finally {
        	IOUtils.closeStream(in);
        }
    }
}

💡

This approach is not recommended and there is a more elegant approach to reading and writing data from and to HDFS

Getting Instance of FileSystem object

FileSystem is a general filesystem API and several static methods are provided to retrieve the appropriate instance for the file system we want to use, local file system or HDFS etc...

public static FileSystem get(Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf, String user)
throws IOException

The Configuration object encapsulates the client or server's configuration which is set using the configuration files read from classpath. For example: core-site.xml

To exclusively get the local file system use:

public static LocalFileSystem getLocal(Configuration conf) throws IOException

Input and Output Streams

For reading data

org.apache.hadoop.fs.FSDataInputStream

Use the following methods to open input stream

public FSDataInputStream open(Path f) throws IOException
public abstract FSDataInputStream open(Path f, int bufferSize) throws IOException

For wrting data

org.apache.hadoop.fs.FSDataOutputStream

Use the following methods to open output stream

public FSDataOutputStream create(Path f) throws IOException

FSDataInputStream

Default Buffer Size of FSDataInputStream is 4 KB.

Although not recommended FSDataInputStream supports random access using the Seekable and PositionedReadable Interfaces. It is not recommended to use random access to a file because it is an expensive operation. Ideally applications should be designed in a fashion to read entire file serially. (For example: Map Reduce)

public interface Seekable {
	void seek(long pos) throws IOException;
	long getPos() throws IOException;
}

public interface PositionedReadable {
	public int read(long position, byte[] buffer, int offset, int length)
throws IOException;
	public void readFully(long position, byte[] buffer, int offset, int length)
throws IOException;
	public void readFully(long position, byte[] buffer) throws IOException;
}

Sample Program to read file from HDFS

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;

public class HDFSRead {
	public static void main(String args[]) throws Exception {
    	Configuration conf = new Configuration();
        conf.set("fs.defaultFS","hdfs://localhost");
        FileSystem fs = FileSystem.get(conf);
        FSDataInputStream in = fs.open(new Path("/user/hadoop/data/input.txt"));
        byte[] buff = new byte[2048];
        while(in.read(buff)!= -1) {
        	System.out.println(new String(buff));
        }
        
        in.close();
    }
}

FSDataOutputStream

There are many overloaded versions to create a FSDataOutputStream. The simplest to just create a new file at a given path.

public FSDataOutputStream create(Path f) throws IOException

Some functionality provided by overloaded methods:

Force overwriting existing files
Set replication factor of a file
Buffer size to use while writing a file
block size of a file
Permissions to be set while creating a file

There’s also an overloaded method for passing a callback interface, Progressable, so the application can be notified of the progress of the data being written to the
datanodes:

public interface Progressable {
	public void progress();
}

As an alternative to creating a new file, an existing file can be appended using:

public FSDataOutputStream append(Path f) throws IOException

Sample program to write to HDFS

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;

public class HDFSWrite {
	public static void main(String args[]) throws Exception {
    	Configuration conf = new Configuration();
        conf.set("fs.defaultFS","hdfs://localhost");
        FileSystem fs = FileSystem.get(conf);
        FSDataOutputStream out = fs.create(new Path("/user/hadoop/data/hello.txt"));
        out.writeBytes("hello world");
        out.close();
    }
}

Summary

HDFS Java API supports many more operations to work with the file system.

Create Directory

public boolean mkdirs(Path f) throws IOException

Deleting Files and Directories

public boolean delete(Path f, boolean recusrive) throws IOException