What is Apache Hadoop Show
Apache Hadoop is a Java-based, open-source, freely available software platform for storing and analyzing big datasets on your system clusters. It keeps its data in the Hadoop Distributed File system (HDFS) and processes it utilizing MapReduce. Hadoop has been used in machine learning and data mining techniques. It is also used for managing multiple dedicated servers. Every major industry is implementing Apache Hadoop as the standard framework for processing and storing big data. Hadoop is designed to be deployed across a network of hundreds or even thousands of dedicated servers. All these machines work together to deal with the massive volume and variety of incoming datasets. A fully developed Hadoop platform includes a collection of tools that enhance the core Hadoop framework and enable it to overcome any obstacle. Below image gives an overview of the architecture. Detailed structure can be found here. Understanding the basic architecture can help you make sense of what you’re configuring. image courtesySo we begin our journey. If you get any challenge on the way, refer to the common errors that people get when installing hadoop. Lets Begin. Step 1 — Create user for Hadoop environmentHadoop should have its own dedicated user account on your system. To create one, open a terminal (ctrl + alt + T) and type the following command. You’ll also be prompted to create a password for the account. You are free the use any username and password you see fit. I’m adding a user hadoop Step 2— Installing JavaThe Hadoop framework is written in Java, and its services require a compatible Java Runtime Environment (JRE) and Java Development Kit (JDK). Use the following command to update your system before initiating a new installation: sudo apt update Install the latest version of Java. sudo apt install openjdk-8-jdk -y Once installed, verify the installed version of Java with the following command: java -version You should get the following output: Step 3: Install OpenSSH on UbuntuInstall the OpenSSH server and client using the following command: sudo apt install openssh-server openssh-client -y Switch to the created user.
Generate public and private key pairs. $ ssh-keygen -t rsa Add the generated public
key from $ sudo cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys Change the permissions of the $ sudo chmod 640 ~/.ssh/authorized_keys Verify if the password-less SSH is functional. $ ssh localhost Step 4: Install Apache HadoopDownload the latest stable version of Hadoop. To get the latest version, go to Apache Hadoop official download page. $ wget https://downloads.apache.org/hadoop/common/hadoop-3.3.2/hadoop-3.3.2.tar.gz Extract the downloaded file. $ tar -xvzf hadoop-3.3.2.tar.gz You can also rename the extracted directory as we will do by executing the below-given command: mv hadoop-3.3.0 hadoop Now, configure Java environment variables for setting up Hadoop. For this, we will check out the location of our “JAVA_HOME” variable: dirname $(dirname $(readlink -f $(which java))) Step 5: Configure HadoopHadoop excels when deployed in a fully distributed mode on a large cluster of networked servers. However, if you are new to Hadoop and want to explore basic commands or test applications, you can configure Hadoop on a single node. This setup, also called pseudo-distributed mode, allows each Hadoop daemon to run as a single Java process. A Hadoop environment is configured by editing a set of configuration files: bashrc, hadoop-env.sh, core-site.xml, hdfs-site.xml, mapred-site-xml and yarn-site.xml They can be found in the newly created hadoop folder Step 5a: Configure Hadoop Environment Variables (bashrc)Edit file $ sudo nano ~/.bashrc Add the following lines to the file. Save and close the file. export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 Activate the environment variables. $ source ~/.bashrc Step 5b: Edit hadoop-env.sh FileThe hadoop-env.sh file serves as a master file to configure YARN, HDFS, MapReduce, and Hadoop-related project settings. When setting up a single node Hadoop cluster, you need to define which Java implementation is to be utilized. Use the previously created sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh Uncomment the export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 The path needs to match the location of the Java installation on your system. If you need help to locate the correct Java path, run the following command in your terminal window: which javac The resulting output provides the path to the Java binary directory. Use the provided path to find the OpenJDK directory with the following command: readlink -f /usr/bin/javac The section of the path just before the /bin/javac directory needs to be assigned to the Step 5c: Edit core-site.xml FileThe core-site.xml file defines HDFS and Hadoop core properties. To set up Hadoop in a pseudo-distributed mode, you need to specify the URL for your NameNode, and the temporary directory Hadoop uses for the map and reduce process. Open the core-site.xml file in a text editor: sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml Add the following configuration to override the default values for the temporary directory and add your HDFS URL to replace the default local file system setting: <configuration> This example uses values specific to the local system. You should use values that match your systems requirements. The data needs to be consistent throughout the configuration process. Step 5d: Edit hdfs-site.xml FileThe properties in the hdfs-site.xml file govern the location for storing node metadata, fsimage file, and edit log file. Configure the file by defining the NameNode and DataNode storage directories. In this “hdfs-site.xml” file, we will change the directory path of “datanode” and “namenode”: Additionally, the default Use the following command to open the hdfs-site.xml file for editing: sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml Add the following configuration to the file and, if needed, adjust the NameNode and DataNode directories to your custom locations: <configuration> <property> If necessary, create the specific directories you defined for the Step 5e: Edit mapred-site.xml FileUse the following command to access the mapred-site.xml file and define MapReduce values: sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml Add the following configuration to change the default MapReduce framework name value to <configuration> Step 5f: Edit yarn-site.xml FileThe yarn-site.xml file is used to define settings relevant to YARN. It contains configurations for the Node Manager, Resource Manager, Containers, and Application Master. Open the yarn-site.xml file in a text editor: sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml Append the following configuration to the file: <configuration> Step 5g. Format HDFS NameNodeIt is important to format the NameNode before starting Hadoop services for the first time: hdfs namenode -format The shutdown notification signifies the end of the NameNode format process. Step 6: Start Hadoop ClusterStart the NameNode and DataNode. $ start-dfs.sh Start the YARN resource and node managers. $ start-yarn.sh Verify all the running components. $ jps
The system takes a few moments to initiate the necessary nodes. If everything is working as intended, the resulting list of running Java processes contains all the HDFS and YARN daemons. Step 7: Access Hadoop UI from BrowserUse your preferred browser and navigate to your localhost URL or IP. The default port number 9870 gives you access to the Hadoop NameNode UI: http://localhost:9870 The NameNode user interface provides a comprehensive overview of the entire cluster The default port 9864 is used to access individual DataNodes directly from your browser: http://localhost:9864 The YARN Resource Manager is accessible on port 8088: http://localhost:8088 The Resource Manager is an invaluable tool that allows you to monitor all running processes in your Hadoop cluster. ConclusionIn this tutorial, you’ve installed Hadoop in stand-alone mode and verified it by running an example program it provided. To learn how to write your own MapReduce programs, you might want to visit Apache Hadoop’s MapReduce tutorial which walks through the code behind the example. When you’re ready to set up a cluster, see the Apache Foundation Hadoop Cluster Setup guide. |