Home » Running Hadoop Word Count on AWS EC2 (Pseudo-Distributed Mode Guide)

Running Hadoop Word Count on AWS EC2 (Pseudo-Distributed Mode Guide)

Introduction

One of the most widely adopted Big Data technologies for processing data in distributed systems is Hadoop. A simple way to learn Hadoop involves running the Word Count example. Word Count counts the number of occurrences for each word within a text file. It helps us understand how HDFS and MapReduce work together.

This blog will guide a person through Hadoop installation on an AWS EC2 Ubuntu Server plus running the Word Count program in steps.

Hadoop Cluster Setup Type

  • Hadoop can be set up in different modes depending on how many machines are used and for what purpose. The most common Hadoop cluster setups are:
    • Standalone Mode
    • Pseudo-Distributed Mode
    • Fully Distributed Mode

Standalone Mode [Single Node]

  • Standalone mode runs Hadoop without any cluster services. It is uses the local file system instead of HDFS.
    • No Name Node or Data Node daemons
    • No HDFS
    • Used mainly for learning and testing MapReduce logic

Pseudo-Distributed Mode (Pseudo Cluster)

  • In pseudo-distributed mode, all Hadoop daemons run on a single machine, but they behave as if they are running in a real cluster. This is why it is called a pseudo cluster.
  • Services Name that is running:
    • Name Node
    • Data Node
    • Resource Manager
    • Node Manager
    • Secondary Name Node
  • It is uses HDFS , MapReduce and YARN. It is simulates a real Hadoop cluster and best for learning , practice and development.

Fully Distributed Mode (Real Cluster)

  • In fully distributed mode, Hadoop runs on multiple machines forming a real cluster. There are one or more master node like Name Node, Resource Manager and also multiple worker nodes like Data Nodes, Node Managers.
    • It is used in production environments
    • There is high availability and fault tolerance
    • Replication factor usually set to 3

Comparison Table

Mode MachineHDFSUse Case
Standalone1No Learning MapReduce
Pseudo-Distributed1YesPractice & Development
Fully DistributedMultipleYesProduction

Why Use Hadoop on AWS EC2?

  • AWS EC2 gives us a cloud-based Linux server
  • Hadoop needs a Linux environment
  • We can practice real-world Big Data skills
  • No need to buy our own hardware

Step 1: Launch an Ubuntu EC2 Instance

  • First, create an Ubuntu EC2 instance from the AWS Console.
  • Login your AWS management console and search and create EC2 instances
    • Name and Tag : wordcount-hadoop-server
    • AMI : Select Ubuntu
    • Instance Type : Select t3.medium
    • Key pair : Create Keypair [hadoopkey]
    • Security Group : Select SSH (22) , HTTP (80)
    • Storage: 20 GB SSD
  • Then Launch Instances
  • Connect it using SSH
Multi Copy Code Blocks
bash

ssh -i hadoopkey.pem ubuntu@ec2-public-ip
    

Step 2: Install Java

  • Now, Install Java because Hadoop is written in java programming so we need java to run Hadoop.
  • First update ubuntu system repository
Multi Copy Code Blocks
bash

sudo apt update
    
  • Install open jdk
Multi Copy Code Blocks
bash

sudo apt install openjdk-8-jdk -y
    
  • verify java version
Multi Copy Code Blocks
bash

java -version
    

Step 3: Install Hadoop

  • Now we have to download Hadoop package from internet and install on ubuntu server
  • First Download Hadoop
Multi Copy Code Blocks
bash

wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
    
  • Now Extract the downloaded tar file
Multi Copy Code Blocks
bash

tar -xvzf hadoop-3.3.6.tar.gz
    
  • Now, move extract Hadoop file in local system and install it
Multi Copy Code Blocks
bash

sudo mv hadoop-3.3.6 /usr/local/hadoop
    
  • Set environment variables in .bashrc:
  • Open .bashrc file
Multi Copy Code Blocks
bash

sudo vi ~/.bashrc
    
  • Add the following environment path in the last line of the file
Multi Copy Code Blocks
bash

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

    
  • Apply changes:
Multi Copy Code Blocks
bash

source ~/.bashrc
    

Step 4: Configure Hadoop

Hadoop uses configuration files to understand where the cluster services are running and how data should be stored in HDFS. Two important configuration files are core-site.xml and hdfs-site.xml.

  • In core-site.xml
    • The core-site.xml file tells Hadoop where the NameNode is located. The NameNode is the master node that manages the Hadoop file system (HDFS).
    • Configuration core-site.xml
Multi Copy Code Blocks
bash

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>
    
  • This configuration tells Hadoop to use the HDFS service running on the local machine at port 9000 as the default file system.
  • In hdfs-site.xml
    • The hdfs-site.xml file controls how data is stored in HDFS, including the replication factor (number of copies of each data block).
    • Configuration hdfs-site.xml
Multi Copy Code Blocks
bash

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>
  • This setting is commonly used for single-node Hadoop setups, where only one machine is available. In a production cluster, the replication factor is usually set to 3 for fault tolerance.

Step 5: Setup password less SSH and start Hadoop

  • Set up password less SSH to ensure that Hadoop can communicate internally with Hadoop worker node smoothly.
Multi Copy Code Blocks
bash

ssh-keygen -t rsa -P ""
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  • Format the NameNode:
Multi Copy Code Blocks
bash

hdfs namenode -format
  • Now start HDFS services
Multi Copy Code Blocks
bash

start-dfs.sh
  • Check running Services
Multi Copy Code Blocks
bash

jps
Multi Copy Code Blocks
Outcome

NameNode
DataNode
SecondaryNameNode

Step 6: Create Input Data

  • Now create a text file for word count
Multi Copy Code Blocks
bash

sudo vi input.txt

  • It will open empty text file editor and paste the following word
Multi Copy Code Blocks
Example

Hadoop is used for Big Data
Big Data is growing fast
Hadoop works on distributed systems

  • After paste the word save it
    • Press ECS key
    • type :wq and press Enter

Step 7: Upload File to HDFS

  • Create an HDFS directory:
Multi Copy Code Blocks
bash

hdfs dfs -mkdir /input

  • Upload the file:
Multi Copy Code Blocks
bash

hdfs dfs -put input.txt /input

  • to verify file
Multi Copy Code Blocks
bash

hdfs dfs -ls /input

Step 8: Run Word Count

  • Run the Word Count program:
Multi Copy Code Blocks
bash

hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar wordcount /input /output

  • If you get somethings error message like output already exits then run [ optional]
Multi Copy Code Blocks
bash

hdfs dfs -rm -r /output

Step 9: View Output

  • To see the output
Multi Copy Code Blocks
bash

hdfs dfs -cat /output/part-r-00000

  • Example look like
Multi Copy Code Blocks
Example

Hadoop 2
Big 2
Data 2

Step 10: Explore dashboard using browser

  • If you want explore dashboard in graphically then you have configure port in security group
    • Go to EC2 security Group
    • Click on edit Incound rule
    • Select custom TCP >> port (9870) >> anywwhere
    • and save it

4 thoughts on “Running Hadoop Word Count on AWS EC2 (Pseudo-Distributed Mode Guide)

Leave a Reply

Your email address will not be published. Required fields are marked *