Introduction
One of the most widely adopted Big Data technologies for processing data in distributed systems is Hadoop. A simple way to learn Hadoop involves running the Word Count example. Word Count counts the number of occurrences for each word within a text file. It helps us understand how HDFS and MapReduce work together.
This blog will guide a person through Hadoop installation on an AWS EC2 Ubuntu Server plus running the Word Count program in steps.
Hadoop Cluster Setup Type
- Hadoop can be set up in different modes depending on how many machines are used and for what purpose. The most common Hadoop cluster setups are:
- Standalone Mode
- Pseudo-Distributed Mode
- Fully Distributed Mode
Standalone Mode [Single Node]
- Standalone mode runs Hadoop without any cluster services. It is uses the local file system instead of HDFS.
- No Name Node or Data Node daemons
- No HDFS
- Used mainly for learning and testing MapReduce logic
Pseudo-Distributed Mode (Pseudo Cluster)
- In pseudo-distributed mode, all Hadoop daemons run on a single machine, but they behave as if they are running in a real cluster. This is why it is called a pseudo cluster.
- Services Name that is running:
- Name Node
- Data Node
- Resource Manager
- Node Manager
- Secondary Name Node
- It is uses HDFS , MapReduce and YARN. It is simulates a real Hadoop cluster and best for learning , practice and development.
Fully Distributed Mode (Real Cluster)
- In fully distributed mode, Hadoop runs on multiple machines forming a real cluster. There are one or more master node like Name Node, Resource Manager and also multiple worker nodes like Data Nodes, Node Managers.
- It is used in production environments
- There is high availability and fault tolerance
- Replication factor usually set to 3
Comparison Table
| Mode | Machine | HDFS | Use Case |
|---|---|---|---|
| Standalone | 1 | No | Learning MapReduce |
| Pseudo-Distributed | 1 | Yes | Practice & Development |
| Fully Distributed | Multiple | Yes | Production |
Why Use Hadoop on AWS EC2?
- AWS EC2 gives us a cloud-based Linux server
- Hadoop needs a Linux environment
- We can practice real-world Big Data skills
- No need to buy our own hardware
Step 1: Launch an Ubuntu EC2 Instance
- First, create an Ubuntu EC2 instance from the AWS Console.
- Login your AWS management console and search and create EC2 instances
- Name and Tag : wordcount-hadoop-server
- AMI : Select Ubuntu
- Instance Type : Select t3.medium
- Key pair : Create Keypair [hadoopkey]
- Security Group : Select SSH (22) , HTTP (80)
- Storage: 20 GB SSD
- Then Launch Instances
- Connect it using SSH
bash
ssh -i hadoopkey.pem ubuntu@ec2-public-ip
Step 2: Install Java
- Now, Install Java because Hadoop is written in java programming so we need java to run Hadoop.
- First update ubuntu system repository
bash
sudo apt update
- Install open jdk
bash
sudo apt install openjdk-8-jdk -y
- verify java version
bash
java -version
Step 3: Install Hadoop
- Now we have to download Hadoop package from internet and install on ubuntu server
- First Download Hadoop
bash
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
- Now Extract the downloaded tar file
bash
tar -xvzf hadoop-3.3.6.tar.gz
- Now, move extract Hadoop file in local system and install it
bash
sudo mv hadoop-3.3.6 /usr/local/hadoop
- Set environment variables in .bashrc:
- Open .bashrc file
bash
sudo vi ~/.bashrc
- Add the following environment path in the last line of the file
bash
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
- Apply changes:
bash
source ~/.bashrc
Step 4: Configure Hadoop
Hadoop uses configuration files to understand where the cluster services are running and how data should be stored in HDFS. Two important configuration files are core-site.xml and hdfs-site.xml.
- In core-site.xml
- The
core-site.xmlfile tells Hadoop where the NameNode is located. The NameNode is the master node that manages the Hadoop file system (HDFS). - Configuration core-site.xml
- The
bash
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
- This configuration tells Hadoop to use the HDFS service running on the local machine at port 9000 as the default file system.
- In hdfs-site.xml
- The
hdfs-site.xmlfile controls how data is stored in HDFS, including the replication factor (number of copies of each data block). - Configuration hdfs-site.xml
- The
bash
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
- This setting is commonly used for single-node Hadoop setups, where only one machine is available. In a production cluster, the replication factor is usually set to 3 for fault tolerance.
Step 5: Setup password less SSH and start Hadoop
- Set up password less SSH to ensure that Hadoop can communicate internally with Hadoop worker node smoothly.
bash
ssh-keygen -t rsa -P ""
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
- Format the NameNode:
bash
hdfs namenode -format
- Now start HDFS services
bash
start-dfs.sh
- Check running Services
bash
jps
Outcome
NameNode
DataNode
SecondaryNameNode
Step 6: Create Input Data
- Now create a text file for word count
bash
sudo vi input.txt
- It will open empty text file editor and paste the following word
Example
Hadoop is used for Big Data
Big Data is growing fast
Hadoop works on distributed systems
- After paste the word save it
- Press ECS key
- type :wq and press Enter
Step 7: Upload File to HDFS
- Create an HDFS directory:
bash
hdfs dfs -mkdir /input
- Upload the file:
bash
hdfs dfs -put input.txt /input
- to verify file
bash
hdfs dfs -ls /input
Step 8: Run Word Count
- Run the Word Count program:
bash
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar wordcount /input /output
- If you get somethings error message like output already exits then run [ optional]
bash
hdfs dfs -rm -r /output
Step 9: View Output
- To see the output
bash
hdfs dfs -cat /output/part-r-00000
- Example look like
Example
Hadoop 2
Big 2
Data 2
Step 10: Explore dashboard using browser
- If you want explore dashboard in graphically then you have configure port in security group
- Go to EC2 security Group
- Click on edit Incound rule
- Select custom TCP >> port (9870) >> anywwhere
- and save it


https://shorturl.fm/aKxg3
https://shorturl.fm/XnIZZ
https://shorturl.fm/O6PZS
https://shorturl.fm/CzJjQ