Distributed File Systems, MapReduce
Seminar 1 of NoSQL Databases (PA195)
David Novak & Vlastislav Dohnal
Faculty of Informatics, Masaryk University, Brno
http://disa.fi.muni.cz/vlastislav-dohnal/teaching/nosql-databases-fall-2019/
Agenda
● MetaCentrum Hadoop Cluster
○ Login, basic tools, monitoring
● Hadoop Distributed File System
○ Basics, working with files, monitoring, advanced settings
● Hadoop MapReduce
○ Writing own MapReduce program: WordCount
○ Running on small data, monitoring
○ Running on large data
○ Advanced MapReduce task: Average Max Temperature
● Simple example in Spark
2
Basic Information
● We will be using Hadoop version 2.6.0
○ Hadoop main page: http://hadoop.apache.org/
○ documentation (v2.6.4):
http://hadoop.apache.org/docs/r2.6.4/
● MetaCentrum Hadoop cluster
○ MetaCentrum account:
http://metavo.metacentrum.cz/en/application/index.html
○ Hadoop cluster access:
https://www.metacentrum.cz/en/hadoop/
○ MetaCentrum Hadoop cluster documentation:
https://wiki.metacentrum.cz/wiki/Hadoop_documentation
3
Access from you home
● Due to having the seminars online
● Connect to the faculty’s
○ https://www.fi.muni.cz/tech/unix/vpn.html.en
● In case of failure, use the university’s VPN
○ https://it.muni.cz/en/services/vpn
○ Unfortunately not all computers @FI are accessible then
■ E.g., nymfe* are behind another firewall, so Putty/SSH port forwarding
must be used.
● https://www.akadia.com/services/ssh_putty.html
4
Remote access to Nymfe*
● Nymfe* are local FI’s computers located in a
computer room
● Open an VNC connection (mainly for Win users)
○ https://www.fi.muni.cz/tech/unix/nymfe-
remote.html.en
● or forward X11 connection when on Mac/Linux
○ ssh -X nymfe0$((RANDOM % 2 + 1)).fi.muni.cz
■ chooses randomly nymfe01-nymfe03
5
MetaCentrum Hadoop Access
● edit file on your local machine ~/.ssh/config
● or for OpenSSH on Windows C:\Users\<login>\.ssh
## MetaCentrum #########
Host hador
HostName hador.ics.muni.cz
User <login_in_MetaCentrum>
Port 22
● log in to Hadoop Cluster frontend
$ ssh hador
6
localhost nymfe
MetaCentrum Hadoop Access
● Web interface:
● Firstly, gain Kerberos ticket on your local machine:
scp login@hador.ics.muni.cz:/etc/krb5.conf .
export KRB5_CONFIG=krb5.conf
kinit <login>@META
○ Chrome on the local machine (in the same terminal):
$ /opt/google/chrome/chrome --auth-server-whitelist="hador*.ics.muni.cz" &
■ open https://hador-c1.ics.muni.cz:9871/
○ or use Firefox and configure it using this manual:
https://wiki.metacentrum.cz/wiki/Hadoop_documentation#Web_accessibility
7
nymfe
HDFS DFS (1)
● HDFS system monitoring & basic commands
$ hdfs dfs -help
● Documentation of HDFS DFS file system commands
● get some data (complete Shakespeare's plays)
$ wget https://goo.gl/KyDfc7 -O shake.txt
$ hdfs dfs -put shake.txt
● or, alternatively
$ hdfs dfs -put shake.txt /user/<user>/shake.txt
$ hdfs dfs -ls
8
hador
HDFS DFS (2)
$ hdfs dfs -ls
$ hdfs dfs -setrep -w 2 shake.txt
$ hdfs dfs -rm shake.txt
$ hdfs dfs -D dfs.block.size=1048576 -put shake.txt
$ hdfs fsck /user/<user>/shake.txt -files -locations -
blocks
$ hdfs dfs -mkdir input
Check HDFS files in browser
https://hador-c1.ics.muni.cz:9871/explorer.html#/user/<user>/
9
nymfe
hador
Java Development
● Download the project from IS seminar 1
○ $ unzip pa195-hadoop-scafolding.zip
● Development with InteliJ IDEA
$ module add jdk-1.8.0
$ module add idea-loc
$ idea.sh &
● Project is in Maven with dependencies:
org.apache.hadoop.hadoop-common
org.apache.hadoop.hadoop-mapreduce-client-core
● Compilation by Maven
$ mvn install 10
nymfe
https://is.muni.cz/auth/el/fi/podzim2020/PA195/um/seminar-1/
MapReduce: WordCount (1)
Task 1: Calculate word frequency in a document.
Sub-task 1.1: Use the Hadoop Java interface v2.6.0 to
implement the WordCount as introduced in the
lecture.
11
Hadoop Writable Classes
ArrayWritable
TwoDArrayWritable
NullWritable
Text
MapWritable
SortedMapWritable
ObjectsWritable
BytesWritable
MapReduce: WordCount (2)
local$ module add maven
local$ mvn install
local$ scp target/pa195-hadoop-1.0.jar hador:
hador$ hdfs dfs -mkdir input
$ hdfs dfs -mv shake.txt input
$ hadoop jar pa195-hadoop-1.0.jar
pa195.hadoop.WordCount input/ output/
$ hdfs dfs -get output .
$ sort -k 2 -g -r part-r-00000 > sorted.txt
https://hador-c2.ics.muni.cz:19890 13
localhost
hador
MapReduce: WordCount (3)
● Sub-task 1.2:
○ try it with a Combiner and observe the difference in
MapReduce log (output of the hadoop process)
● Sub-task 1.3:
○ clean the input: remove characters ,.();:!?- and numbers
14
MapReduce: WordCount (3)
● Sub-task 1.4: do not lowercase the characters
but ignore case when counting the words
● Sub-task 1.5: sort the results by the word
frequency (descending)
○ use a second MapReduce job to do this
15
MapReduce: Large-scale Test
● Task 2: Run the WordCount (count & sort) on a
multi-GB collection of documents
○ observe the performance
■ the actual output is not important
○ downloaded Wikipedia in
$ DIR=/storage/brno2/home/dohnal/pa195/wikipedia
$ hdfs dfs -mkdir wiki-input
$ for F in $DIR/*.xml; do hdfs dfs -put $F wikiinput;
done
○ increase # of reduce jobs
16
hador
Proof of practice
● Report the time it took to sort the Wiki data
○ copy & paste the line from JobHistory to ‘wiki-sort.txt’:
■ https://hador-c2.ics.muni.cz:19890/jobhistory/app
■ e.g. 2020.10.09 10:45:30 CEST 2020…. 2020.10…. job_159… word count dohnal
root.dohnal SUCCEEDED 86 86 1 1 00hrs,
02mins, 30sec
○ copy to the instructor’s HDFS:
● ZIP your project in InteliJ IDEA (src dir)
○ Upload to the IS’s vault:
■ https://is.muni.cz/auth/el/fi/podzim2020/PA195/ode/105893552/?pre
dmet=1324117
17
$ hdfs dfs -put wiki-sort.txt /user/dohnal/pa195nosql-seminar1/<login>-wiki-sort.txt
● Task 3: Find out the average maximum
temperature for each month.
Data: historic temperatures in Milano (CSV format)
date,day-min,day-max
01012000,-4.0,5.0
02012000,-5.0,5.1
03012000,-5.0,7.7
04012000,-3.0,9.7
…
$ /storage/brno2/home/dohnal/pa195/weather.csv
MapReduce: Weather Data
source: http://www.slideshare.net/andreaiacono/mapreduce-34478449 18
hador
source: http://www.slideshare.net/andreaiacono/mapreduce-34478449
Is this correct?
19
Weather: Partial Avg Example
source data:
01012000, -4.0, 10.0
02012000, -5.0, 20.0
03012000, -5.0, 2.0
04012000, -3.0, 4.0
05012000, -3.0, 3.0
Mapper #1: lines 1,2
Mapper #2: lines 3,4,5
Mapper #1 avg: (10 + 20) / 2 = 15
Mapper #2 avg: (2 + 4 + 3) / 3 = 3
Reducer avg: (15 + 3) / 2 = 9
Correct avg: (10+20+2+4+3)/5 = 7.8
source: http://www.slideshare.net/andreaiacono/mapreduce-34478449
Not correct!
20
This is correct
source: http://www.slideshare.net/andreaiacono/mapreduce-34478449
hidden link
21
Spark: Simple Example
● The MetaCentrum cluster has Spark installed: doc
● A simple example to count words in Shakespeare:
$ spark-shell --master yarn
scala> :help
scala> val file = sc.textFile("hdfs://hador-
cluster/user/<login>/shake.txt")
scala> val counts = file.flatMap(line => line.split("
")).map(word => (word, 1)).reduceByKey(_ + _)
scala> counts.saveAsTextFile("spark-output")
scala> :quit
$ hdfs dfs -get spark-output/
22
hador
Lessons Learned & Cleanup
What lessons did we take from the following?
● Basic work with the HDFS distributed file system
● Hadoop MapReduce in Java
○ simple word count and it's modifications
○ large-scale distributed job
○ distributed average
● Clean the large files from both HDFS and the
your home dir on the Hadoop Cluster, please
$ hdfs dfs -rm -R wiki-input/
$ hdfs dfs -rm -R output 23
hador
Cleanup on Nymfe
● Log out of the Gnome session
○ it may take a while (cca 20 secs) to get Log Out prompt,
so wait.
○ if it fails, exit x11vnc in the terminal window (by Ctrl-C)
and run: $ gnome-session-quit --force –logout
● Check and kill remaining processes
○ $ ps ux
○ $ kill <each_PID>
24