April 2026
S	M	T	W	T	F	S
	1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30

Introduction

One of the most widely adopted Big Data technologies for processing data in distributed systems is Hadoop. A simple way to learn Hadoop involves running the Word Count example. Word Count counts the number of occurrences for each word within a text file. It helps us understand how HDFS and MapReduce work together.

This blog will guide a person through Hadoop installation on an AWS EC2 Ubuntu Server plus running the Word Count program in steps.

Hadoop Cluster Setup Type

Hadoop can be set up in different modes depending on how many machines are used and for what purpose. The most common Hadoop cluster setups are:
- Standalone Mode
- Pseudo-Distributed Mode
- Fully Distributed Mode

Standalone Mode [Single Node]

Standalone mode runs Hadoop without any cluster services. It is uses the local file system instead of HDFS.
- No Name Node or Data Node daemons
- No HDFS
- Used mainly for learning and testing MapReduce logic

Pseudo-Distributed Mode (Pseudo Cluster)

In pseudo-distributed mode, all Hadoop daemons run on a single machine, but they behave as if they are running in a real cluster. This is why it is called a pseudo cluster.
Services Name that is running:
- Name Node
- Data Node
- Resource Manager
- Node Manager
- Secondary Name Node
It is uses HDFS , MapReduce and YARN. It is simulates a real Hadoop cluster and best for learning , practice and development.

Fully Distributed Mode (Real Cluster)

In fully distributed mode, Hadoop runs on multiple machines forming a real cluster. There are one or more master node like Name Node, Resource Manager and also multiple worker nodes like Data Nodes, Node Managers.
- It is used in production environments
- There is high availability and fault tolerance
- Replication factor usually set to 3

Comparison Table

Mode	Machine	HDFS	Use Case
Standalone	1	No	Learning MapReduce
Pseudo-Distributed	1	Yes	Practice & Development
Fully Distributed	Multiple	Yes	Production

Why Use Hadoop on AWS EC2?

AWS EC2 gives us a cloud-based Linux server
Hadoop needs a Linux environment
We can practice real-world Big Data skills
No need to buy our own hardware

Step 1: Launch an Ubuntu EC2 Instance

First, create an Ubuntu EC2 instance from the AWS Console.
Login your AWS management console and search and create EC2 instances
- Name and Tag : wordcount-hadoop-server
- AMI : Select Ubuntu
- Instance Type : Select t3.medium
- Key pair : Create Keypair [hadoopkey]
- Security Group : Select SSH (22) , HTTP (80)
- Storage: 20 GB SSD
Then Launch Instances
Connect it using SSH

Multi Copy Code Blocks


ssh -i hadoopkey.pem ubuntu@ec2-public-ip

Step 2: Install Java

Now, Install Java because Hadoop is written in java programming so we need java to run Hadoop.
First update ubuntu system repository

Multi Copy Code Blocks


sudo apt update

Install open jdk

Multi Copy Code Blocks


sudo apt install openjdk-8-jdk -y

verify java version

Multi Copy Code Blocks


java -version

Step 3: Install Hadoop

Now we have to download Hadoop package from internet and install on ubuntu server
First Download Hadoop

Multi Copy Code Blocks


wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz

Now Extract the downloaded tar file

Multi Copy Code Blocks


tar -xvzf hadoop-3.3.6.tar.gz

Now, move extract Hadoop file in local system and install it

Multi Copy Code Blocks


sudo mv hadoop-3.3.6 /usr/local/hadoop

Set environment variables in .bashrc:
Open .bashrc file

Multi Copy Code Blocks


sudo vi ~/.bashrc

Add the following environment path in the last line of the file

Multi Copy Code Blocks


export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Apply changes:

Multi Copy Code Blocks


source ~/.bashrc

Step 4: Configure Hadoop

Hadoop uses configuration files to understand where the cluster services are running and how data should be stored in HDFS. Two important configuration files are core-site.xml and hdfs-site.xml.

In core-site.xml
- The core-site.xml file tells Hadoop where the NameNode is located. The NameNode is the master node that manages the Hadoop file system (HDFS).
- Configuration core-site.xml

Multi Copy Code Blocks


<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>

This configuration tells Hadoop to use the HDFS service running on the local machine at port 9000 as the default file system.

In hdfs-site.xml
- The hdfs-site.xml file controls how data is stored in HDFS, including the replication factor (number of copies of each data block).
- Configuration hdfs-site.xml

Multi Copy Code Blocks


<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

This setting is commonly used for single-node Hadoop setups, where only one machine is available. In a production cluster, the replication factor is usually set to 3 for fault tolerance.

Step 5: Setup password less SSH and start Hadoop

Set up password less SSH to ensure that Hadoop can communicate internally with Hadoop worker node smoothly.

Multi Copy Code Blocks


ssh-keygen -t rsa -P ""
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Format the NameNode:

Multi Copy Code Blocks


hdfs namenode -format

Now start HDFS services

Multi Copy Code Blocks


start-dfs.sh

Check running Services

Multi Copy Code Blocks

jps

Multi Copy Code Blocks


NameNode
DataNode
SecondaryNameNode

Step 6: Create Input Data

Now create a text file for word count

Multi Copy Code Blocks


sudo vi input.txt

It will open empty text file editor and paste the following word

Multi Copy Code Blocks


Hadoop is used for Big Data
Big Data is growing fast
Hadoop works on distributed systems

After paste the word save it
- Press ECS key
- type :wq and press Enter

Step 7: Upload File to HDFS

Create an HDFS directory:

Multi Copy Code Blocks


hdfs dfs -mkdir /input

Upload the file:

Multi Copy Code Blocks


hdfs dfs -put input.txt /input

to verify file

Multi Copy Code Blocks


hdfs dfs -ls /input

Step 8: Run Word Count

Run the Word Count program:

Multi Copy Code Blocks


hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar wordcount /input /output

If you get somethings error message like output already exits then run [ optional]

Multi Copy Code Blocks


hdfs dfs -rm -r /output

Step 9: View Output

To see the output

Multi Copy Code Blocks


hdfs dfs -cat /output/part-r-00000

Example look like

Multi Copy Code Blocks


Hadoop 2
Big 2
Data 2

Step 10: Explore dashboard using browser

If you want explore dashboard in graphically then you have configure port in security group
- Go to EC2 security Group
- Click on edit Incound rule
- Select custom TCP >> port (9870) >> anywwhere
- and save it

4 thoughts on “Running Hadoop Word Count on AWS EC2 (Pseudo-Distributed Mode Guide)”

Mason364 says:

April 24, 2026 at 12:21 am

https://shorturl.fm/aKxg3

Harvey496 says:

April 24, 2026 at 4:21 am

https://shorturl.fm/XnIZZ

Roland3114 says:

April 24, 2026 at 7:16 am

https://shorturl.fm/O6PZS

Jace4327 says:

April 25, 2026 at 9:35 am

https://shorturl.fm/CzJjQ

Cloud With Yuvi

Follow us

External Link

Running Hadoop Word Count on AWS EC2 (Pseudo-Distributed Mode Guide)

Introduction

Hadoop Cluster Setup Type

Standalone Mode [Single Node]

Pseudo-Distributed Mode (Pseudo Cluster)

Fully Distributed Mode (Real Cluster)

Comparison Table

Why Use Hadoop on AWS EC2?

Step 1: Launch an Ubuntu EC2 Instance

Step 2: Install Java

Step 3: Install Hadoop

Step 4: Configure Hadoop

Step 5: Setup password less SSH and start Hadoop

Step 6: Create Input Data

Step 7: Upload File to HDFS

Step 8: Run Word Count

Step 9: View Output

Step 10: Explore dashboard using browser

4 thoughts on “Running Hadoop Word Count on AWS EC2 (Pseudo-Distributed Mode Guide)”

Leave a Reply Cancel reply

Categories

External Links

Information

Social Media

Cloud With Yuvi

Follow us

External Link

Introduction

Hadoop Cluster Setup Type

Standalone Mode [Single Node]

Pseudo-Distributed Mode (Pseudo Cluster)

Fully Distributed Mode (Real Cluster)

Comparison Table

Why Use Hadoop on AWS EC2?

Step 1: Launch an Ubuntu EC2 Instance

Step 2: Install Java

Step 3: Install Hadoop

Step 4: Configure Hadoop

Step 5: Setup password less SSH and start Hadoop

Step 6: Create Input Data

Step 7: Upload File to HDFS

Step 8: Run Word Count

Step 9: View Output

Step 10: Explore dashboard using browser

4 thoughts on “Running Hadoop Word Count on AWS EC2 (Pseudo-Distributed Mode Guide)”

Leave a Reply Cancel reply

Related News