Aws Emr
· 2 min read
EMR
AWS EMR is a managed service provided by AWS to run Spark, HDFS, HIVE and other select software.
Protip: Start the EMR cluster only after you have you project setup to prevent unnecessary cost
We will use EMR to run our Spark and HDFS cluster
-
Go to
AWS Service -> EMR
-
Click on
Create Cluster
-
Click on the
Go to advanced options
-
Select the shown options and copy paste the config below into the
Edit software settings
section
[
{
"Classification": "spark-env",
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "/usr/bin/python3"
}
}
]
}
]
This config is to let the EMR cluster know to use python3
- This example shows
c4.large
machine, for keeping cost low. You can choose any machine you like(but recommended to keep costs low unless absolutely necessary). You can also chooseOn-demand
vsspot
,spot
instances are not available by default, you might have to askaws
for some spot instances. Choose the core count to be at least 2. ClickNext
- Type in a name for your cluster
- Choose the key pair you created in the
1. AWS Account
section above and pressCreate Cluster
.
- Now you will see the cluster starting, here click on the master security group. Here you can also note your
EMR ID
- This will take you to the master security group section, here press the
add inbound rule
button and add anssh
rule allowing access from anywhere (DO NOT DO THIS IN REAL LIFE). Note that this is set because we are building a toy project not a real life project(companies usually have their own VPC)
- You can now
ssh
into your cluster as shown below.
-
The cluster takes a few minutes to start, wait until the cluster shows its status as
waiting
to begin work. -
Once your work is complete select the EMR cluster and press
Terminate
button at the top to stop your cluster.