Distributed File Systems, MapReduce Seminar 1 of NoSQL Databases (PA195) David Novak & Vlastislav Dohnal Faculty of Informatics, Masaryk University, Brno http://disa.fi.muni.cz/vlastislav-dohnal/teaching/nosql-databases-fall-2019/ Agenda ● MetaCentrum Hadoop Cluster ○ Login, basic tools, monitoring ● Hadoop Distributed File System ○ Basics, working with files, monitoring, advanced settings ● Hadoop MapReduce ○ Writing own MapReduce program: WordCount ○ Running on small data, monitoring ○ Running on large data ○ Advanced MapReduce task: Average Max Temperature ● Simple example in Spark 2 Basic Information ● We will be using Hadoop version 2.6.0 ○ Hadoop main page: http://hadoop.apache.org/ ○ documentation (v2.6.4): http://hadoop.apache.org/docs/r2.6.4/ ● MetaCentrum Hadoop cluster ○ MetaCentrum account: http://metavo.metacentrum.cz/en/application/index.html ○ Hadoop cluster access: https://www.metacentrum.cz/en/hadoop/ ○ MetaCentrum Hadoop cluster documentation: https://wiki.metacentrum.cz/wiki/Hadoop_documentation 3 Access from you home ● Due to having the seminars online ● Connect to the faculty’s ○ https://www.fi.muni.cz/tech/unix/vpn.html.en ● In case of failure, use the university’s VPN ○ https://it.muni.cz/en/services/vpn ○ Unfortunately not all computers @FI are accessible then ■ E.g., nymfe* are behind another firewall, so Putty/SSH port forwarding must be used. ● https://www.akadia.com/services/ssh_putty.html 4 Remote access to Nymfe* ● Nymfe* are local FI’s computers located in a computer room ● Open an VNC connection (mainly for Win users) ○ https://www.fi.muni.cz/tech/unix/nymfe- remote.html.en ● or forward X11 connection when on Mac/Linux ○ ssh -X nymfe0$((RANDOM % 2 + 1)).fi.muni.cz ■ chooses randomly nymfe01-nymfe03 5 MetaCentrum Hadoop Access ● edit file on your local machine ~/.ssh/config ● or for OpenSSH on Windows C:\Users\\.ssh ## MetaCentrum ######### Host hador HostName hador.ics.muni.cz User Port 22 ● log in to Hadoop Cluster frontend $ ssh hador 6 localhost nymfe MetaCentrum Hadoop Access ● Web interface: ● Firstly, gain Kerberos ticket on your local machine: scp login@hador.ics.muni.cz:/etc/krb5.conf . export KRB5_CONFIG=krb5.conf kinit @META ○ Chrome on the local machine (in the same terminal): $ /opt/google/chrome/chrome --auth-server-whitelist="hador*.ics.muni.cz" & ■ open https://hador-c1.ics.muni.cz:9871/ ○ or use Firefox and configure it using this manual: https://wiki.metacentrum.cz/wiki/Hadoop_documentation#Web_accessibility 7 nymfe HDFS DFS (1) ● HDFS system monitoring & basic commands $ hdfs dfs -help ● Documentation of HDFS DFS file system commands ● get some data (complete Shakespeare's plays) $ wget https://goo.gl/KyDfc7 -O shake.txt $ hdfs dfs -put shake.txt ● or, alternatively $ hdfs dfs -put shake.txt /user//shake.txt $ hdfs dfs -ls 8 hador HDFS DFS (2) $ hdfs dfs -ls $ hdfs dfs -setrep -w 2 shake.txt $ hdfs dfs -rm shake.txt $ hdfs dfs -D dfs.block.size=1048576 -put shake.txt $ hdfs fsck /user//shake.txt -files -locations - blocks $ hdfs dfs -mkdir input Check HDFS files in browser https://hador-c1.ics.muni.cz:9871/explorer.html#/user// 9 nymfe hador Java Development ● Download the project from IS seminar 1 ○ $ unzip pa195-hadoop-scafolding.zip ● Development with InteliJ IDEA $ module add jdk-1.8.0 $ module add idea-loc $ idea.sh & ● Project is in Maven with dependencies: org.apache.hadoop.hadoop-common org.apache.hadoop.hadoop-mapreduce-client-core ● Compilation by Maven $ mvn install 10 nymfe https://is.muni.cz/auth/el/fi/podzim2020/PA195/um/seminar-1/ MapReduce: WordCount (1) Task 1: Calculate word frequency in a document. Sub-task 1.1: Use the Hadoop Java interface v2.6.0 to implement the WordCount as introduced in the lecture. 11 Hadoop Writable Classes ArrayWritable TwoDArrayWritable NullWritable Text MapWritable SortedMapWritable ObjectsWritable BytesWritable MapReduce: WordCount (2) local$ module add maven local$ mvn install local$ scp target/pa195-hadoop-1.0.jar hador: hador$ hdfs dfs -mkdir input $ hdfs dfs -mv shake.txt input $ hadoop jar pa195-hadoop-1.0.jar pa195.hadoop.WordCount input/ output/ $ hdfs dfs -get output . $ sort -k 2 -g -r part-r-00000 > sorted.txt https://hador-c2.ics.muni.cz:19890 13 localhost hador MapReduce: WordCount (3) ● Sub-task 1.2: ○ try it with a Combiner and observe the difference in MapReduce log (output of the hadoop process) ● Sub-task 1.3: ○ clean the input: remove characters ,.();:!?- and numbers 14 MapReduce: WordCount (3) ● Sub-task 1.4: do not lowercase the characters but ignore case when counting the words ● Sub-task 1.5: sort the results by the word frequency (descending) ○ use a second MapReduce job to do this 15 MapReduce: Large-scale Test ● Task 2: Run the WordCount (count & sort) on a multi-GB collection of documents ○ observe the performance ■ the actual output is not important ○ downloaded Wikipedia in $ DIR=/storage/brno2/home/dohnal/pa195/wikipedia $ hdfs dfs -mkdir wiki-input $ for F in $DIR/*.xml; do hdfs dfs -put $F wikiinput; done ○ increase # of reduce jobs 16 hador Proof of practice ● Report the time it took to sort the Wiki data ○ copy & paste the line from JobHistory to ‘wiki-sort.txt’: ■ https://hador-c2.ics.muni.cz:19890/jobhistory/app ■ e.g. 2020.10.09 10:45:30 CEST 2020…. 2020.10…. job_159… word count dohnal root.dohnal SUCCEEDED 86 86 1 1 00hrs, 02mins, 30sec ○ copy to the instructor’s HDFS: ● ZIP your project in InteliJ IDEA (src dir) ○ Upload to the IS’s vault: ■ https://is.muni.cz/auth/el/fi/podzim2020/PA195/ode/105893552/?pre dmet=1324117 17 $ hdfs dfs -put wiki-sort.txt /user/dohnal/pa195nosql-seminar1/-wiki-sort.txt ● Task 3: Find out the average maximum temperature for each month. Data: historic temperatures in Milano (CSV format) date,day-min,day-max 01012000,-4.0,5.0 02012000,-5.0,5.1 03012000,-5.0,7.7 04012000,-3.0,9.7 … $ /storage/brno2/home/dohnal/pa195/weather.csv MapReduce: Weather Data source: http://www.slideshare.net/andreaiacono/mapreduce-34478449 18 hador source: http://www.slideshare.net/andreaiacono/mapreduce-34478449 Is this correct? 19 Weather: Partial Avg Example source data: 01012000, -4.0, 10.0 02012000, -5.0, 20.0 03012000, -5.0, 2.0 04012000, -3.0, 4.0 05012000, -3.0, 3.0 Mapper #1: lines 1,2 Mapper #2: lines 3,4,5 Mapper #1 avg: (10 + 20) / 2 = 15 Mapper #2 avg: (2 + 4 + 3) / 3 = 3 Reducer avg: (15 + 3) / 2 = 9 Correct avg: (10+20+2+4+3)/5 = 7.8 source: http://www.slideshare.net/andreaiacono/mapreduce-34478449 Not correct! 20 This is correct source: http://www.slideshare.net/andreaiacono/mapreduce-34478449 hidden link 21 Spark: Simple Example ● The MetaCentrum cluster has Spark installed: doc ● A simple example to count words in Shakespeare: $ spark-shell --master yarn scala> :help scala> val file = sc.textFile("hdfs://hador- cluster/user//shake.txt") scala> val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) scala> counts.saveAsTextFile("spark-output") scala> :quit $ hdfs dfs -get spark-output/ 22 hador Lessons Learned & Cleanup What lessons did we take from the following? ● Basic work with the HDFS distributed file system ● Hadoop MapReduce in Java ○ simple word count and it's modifications ○ large-scale distributed job ○ distributed average ● Clean the large files from both HDFS and the your home dir on the Hadoop Cluster, please $ hdfs dfs -rm -R wiki-input/ $ hdfs dfs -rm -R output 23 hador Cleanup on Nymfe ● Log out of the Gnome session ○ it may take a while (cca 20 secs) to get Log Out prompt, so wait. ○ if it fails, exit x11vnc in the terminal window (by Ctrl-C) and run: $ gnome-session-quit --force –logout ● Check and kill remaining processes ○ $ ps ux ○ $ kill 24