What is Apache Hadoop

Apache Hadoop is a Java-based, open-source, freely available software platform for storing and analyzing big datasets on your system clusters. It keeps its data in the Hadoop Distributed File system (HDFS) and processes it utilizing MapReduce. Hadoop has been used in machine learning and data mining techniques. It is also used for managing multiple dedicated servers.

Every major industry is implementing Apache Hadoop as the standard framework for processing and storing big data. Hadoop is designed to be deployed across a network of hundreds or even thousands of dedicated servers. All these machines work together to deal with the massive volume and variety of incoming datasets. A fully developed Hadoop platform includes a collection of tools that enhance the core Hadoop framework and enable it to overcome any obstacle.

Below image gives an overview of the architecture. Detailed structure can be found here. Understanding the basic architecture can help you make sense of what you’re configuring.

image courtesy

So we begin our journey. If you get any challenge on the way, refer to the common errors that people get when installing hadoop.

Lets Begin.

Step 1 — Create user for Hadoop environment

Hadoop should have its own dedicated user account on your system. To create one, open a terminal (ctrl + alt + T) and type the following command. You’ll also be prompted to create a password for the account. You are free the use any username and password you see fit. I’m adding a user hadoop

Step 2— Installing Java

The Hadoop framework is written in Java, and its services require a compatible Java Runtime Environment (JRE) and Java Development Kit (JDK). Use the following command to update your system before initiating a new installation:

sudo apt update

Install the latest version of Java.

sudo apt install openjdk-8-jdk -y

Once installed, verify the installed version of Java with the following command:

java -version

You should get the following output:

Step 3: Install OpenSSH on Ubuntu

Install the OpenSSH server and client using the following command:

sudo apt install openssh-server openssh-client -y

Switch to the created user.

sudo su - hadoop

Generate public and private key pairs.

$ ssh-keygen -t rsa

Add the generated public key from id_rsa.pub to authorized_keys.

$ sudo cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Change the permissions of the authorized_keys file.

$ sudo chmod 640 ~/.ssh/authorized_keys

Verify if the password-less SSH is functional.

$ ssh localhost

Step 4: Install Apache Hadoop

Download the latest stable version of Hadoop. To get the latest version, go to Apache Hadoop official download page.

$ wget https://downloads.apache.org/hadoop/common/hadoop-3.3.2/hadoop-3.3.2.tar.gz

Extract the downloaded file.

$ tar -xvzf hadoop-3.3.2.tar.gz

You can also rename the extracted directory as we will do by executing the below-given command:

mv hadoop-3.3.0 hadoop

Now, configure Java environment variables for setting up Hadoop. For this, we will check out the location of our “JAVA_HOME” variable:

dirname $(dirname $(readlink -f $(which java)))

Step 5: Configure Hadoop

Hadoop excels when deployed in a fully distributed mode on a large cluster of networked servers. However, if you are new to Hadoop and want to explore basic commands or test applications, you can configure Hadoop on a single node.

This setup, also called pseudo-distributed mode, allows each Hadoop daemon to run as a single Java process. A Hadoop environment is configured by editing a set of configuration files:

bashrc, hadoop-env.sh, core-site.xml, hdfs-site.xml, mapred-site-xml and yarn-site.xml

They can be found in the newly created hadoop folder

Step 5a: Configure Hadoop Environment Variables (bashrc)

Edit file ~/.bashrc to configure the Hadoop environment variables.

$ sudo nano ~/.bashrc

Add the following lines to the file. Save and close the file.

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

Activate the environment variables.

$ source ~/.bashrc

Step 5b: Edit hadoop-env.sh File

The hadoop-env.sh file serves as a master file to configure YARN, HDFS, MapReduce, and Hadoop-related project settings.

When setting up a single node Hadoop cluster, you need to define which Java implementation is to be utilized. Use the previously created $HADOOP_HOME variable to access the hadoop-env.sh file:

sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Uncomment the $JAVA_HOME variable (i.e., remove the # sign) and add the full path to the OpenJDK installation on your system. If you have installed the same version as presented in the first part of this tutorial, add the following line:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

The path needs to match the location of the Java installation on your system.

If you need help to locate the correct Java path, run the following command in your terminal window:

which javac

The resulting output provides the path to the Java binary directory.

Use the provided path to find the OpenJDK directory with the following command:

readlink -f /usr/bin/javac

The section of the path just before the /bin/javac directory needs to be assigned to the $JAVA_HOME variable.

Step 5c: Edit core-site.xml File

The core-site.xml file defines HDFS and Hadoop core properties.

To set up Hadoop in a pseudo-distributed mode, you need to specify the URL for your NameNode, and the temporary directory Hadoop uses for the map and reduce process.

Open the core-site.xml file in a text editor:

sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml

Add the following configuration to override the default values for the temporary directory and add your HDFS URL to replace the default local file system setting:


This example uses values specific to the local system. You should use values that match your systems requirements. The data needs to be consistent throughout the configuration process.

Step 5d: Edit hdfs-site.xml File

The properties in the hdfs-site.xml file govern the location for storing node metadata, fsimage file, and edit log file. Configure the file by defining the NameNode and DataNode storage directories. In this “hdfs-site.xml” file, we will change the directory path of “datanode” and “namenode”:

Additionally, the default dfs.replication value of 3 needs to be changed to 1 to match the single node setup.

Use the following command to open the hdfs-site.xml file for editing:

sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Add the following configuration to the file and, if needed, adjust the NameNode and DataNode directories to your custom locations:

<configuration>        <property>

If necessary, create the specific directories you defined for the dfs.data.dir value.

Step 5e: Edit mapred-site.xml File

Use the following command to access the mapred-site.xml file and define MapReduce values:

sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

Add the following configuration to change the default MapReduce framework name value to yarn:


Step 5f: Edit yarn-site.xml File

The yarn-site.xml file is used to define settings relevant to YARN. It contains configurations for the Node Manager, Resource Manager, Containers, and Application Master.

Open the yarn-site.xml file in a text editor:

sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Append the following configuration to the file:


Step 5g. Format HDFS NameNode

It is important to format the NameNode before starting Hadoop services for the first time:

hdfs namenode -format

The shutdown notification signifies the end of the NameNode format process.

Step 6: Start Hadoop Cluster

Start the NameNode and DataNode.

$ start-dfs.sh

Start the YARN resource and node managers.

$ start-yarn.sh

Verify all the running components.

$ jps

The system takes a few moments to initiate the necessary nodes.

If everything is working as intended, the resulting list of running Java processes contains all the HDFS and YARN daemons.

Step 7: Access Hadoop UI from Browser

Use your preferred browser and navigate to your localhost URL or IP. The default port number 9870 gives you access to the Hadoop NameNode UI:


The NameNode user interface provides a comprehensive overview of the entire cluster

The default port 9864 is used to access individual DataNodes directly from your browser:


The YARN Resource Manager is accessible on port 8088:


The Resource Manager is an invaluable tool that allows you to monitor all running processes in your Hadoop cluster.


In this tutorial, you’ve installed Hadoop in stand-alone mode and verified it by running an example program it provided. To learn how to write your own MapReduce programs, you might want to visit Apache Hadoop’s MapReduce tutorial which walks through the code behind the example. When you’re ready to set up a cluster, see the Apache Foundation Hadoop Cluster Setup guide.