Lab 6 ‑ HBase

Lab 6 - HBase https://canvas.du.edu/courses/60870/assignments/427677 1/3 Lab 6 ‑ HBase Due May 24 by 11:59pm Points 10 Available after May 17 at 8am Introducon In this lab, you’ll use HBase and Spark to read data from one HBase table and write the data to another HBase table. Links to HBas

c/c++代写,java代写,python代写,matlab代写,作业代写,留学生作业代写

5/17/2018 Lab 6 - HBase https://canvas.du.edu/courses/60870/assignments/427677 1/3 Lab 6 ‑ HBase Due May 24 by 11:59pm Points 10 Available after May 17 at 8am Introducon In this lab, you’ll use HBase and Spark to read data from one HBase table and write the data to another HBase table. Links to HBase documentation can be found at HBase Tutorial and Reference. For this assignment, you may use code from the inclass HBase example (SparkHBase.zip) as well as any of the posted solutions to previous assignments. Input You’ll read input data from a public, readonly HBase table named ‘Baseball’. This table has two column families, ‘Players’ and ‘HallOfFame’. Row keys are the unique player IDs from the baseball database in https://www.kaggle.com/opensourcesports/baseballdatabank (https://www.kaggle.com/opensourcesports/baseballdatabank) . ‘Players’ Column Family This column family contains the data from the Master.csv file in the baseball database. The column names are: birthYear: Year player was born birthMonth: Month player was born birthDay: Day player was born birthCountry: Country where player was born birthState: State where player was born birthCity: City where player was born deathYear: Year player died deathMonth: Month player died deathDay: Day player died deathCountry: Country where player died deathState: State where player died deathCity: City where player died nameFirst: Player’s first name nameLast: Player’s last name nameGiven: Player’s given name (typically first and middle) weight: Player’s weight in pounds height: Player’s height in inches bats: Player’s batting hand (left, right, or both) 5/17/2018 Lab 6 - HBase https://canvas.du.edu/courses/60870/assignments/427677 2/3 throws: Player’s throwing hand (left or right) debut: Date that player made first major league appearance finalGame: Date that player made first major league appearance (blank if still active) retroID: ID used by retrosheet bbrefID: ID used by Baseball Reference website ‘HallOfFame’ Column Family This column family contains the data from the HallOfFame.csv file in the baseball database. Only a subset of rows have any data in this column family. The column names are: yearid: Year of ballot votedBy: Method by which player was voted upon ballots: Total ballots cast in that year needed: Number of votes needed for selection in that year votes: Total votes received inducted: Whether player was inducted by that vote or not (Y or N) category: Category in which candidate was honored needed_note: Explanation of qualifiers for special elections Output Using the hbase shell, create a new table with name ‘username:Lab6’ with a single column family called ‘HOF’ (username is your Unix username). Your program will write to this table (see below). Program Arguments Your Spark Java program must accept 3 command line arguments: sparkMaster inputTable outputTable As previous labs, “sparkMaster” will be “local” or “local[*]” to run the program locally, and “yarn” to run on the cluster. The “inputTable” argument is the name of the input HBase table (use “Baseball” for this assignment), and the “outputTable” argument is the name of the HBase table to use for output (use “username:Lab6” for this assignment, a different table will be used for grading). Running Your Program on YARN Modify the shell script file run_spark_yarn.sh in the example code in /u/home/mikegoss/COMPx705Public/Demos/SparkHBase.zip to run your program. It shows how to specify the necessary HBase class path. Your Task 5/17/2018 Lab 6 - HBase https://canvas.du.edu/courses/60870/assignments/427677 3/3 Create your program in a directory called Lab 6. Include a build.xml file, and shell script file to run your program locally and on Yarn. Write a Spark program that will: Create an RDD from the input HBase table. For each row of the input table that meets all of the following conditions: HallOfFame:yearid is present HallOfFame:inducted value is “Y” write a row to the output HBase table as follows: Row key is the input row key (unique player ID) HOF:nameFull is the player’s full name (Players:nameFirst + “ “ + Players:nameLast) HOF:yearid is the value of the input HallOfFame:yearid HOF:category is the value of the input HallOfFame:category Nothing should be output for rows that do not meet the input conditions. The method mapPartitionsToPair is probably the easiest way to implement the creation of the output RDD. Hints Notice that the data type NavigableMap provides methods “containsKey” and “get”, so you don’t need to iterate over the columns like the sample HBase code does. Don’t forget that all the HBase key and data values in the map are of type byte[]. You can add multiple columns (for the same row) to a single Put object. Sample Output The HBase table “Lab6SampleOutput” contains the sample output. If you list this table with the “scan” command in the HBase shell, your output should match exactly except for the time stamps. Subming the Lab After completing the lab, commit your source Java file, Makefile, and shell files to run the code to SVN. Your code must execute correctly on the parcom machines using YARN.

留学生作业代写,cs作业代写,cs代写,作业代写,北美cs作业代写,澳洲cs作业代写,加拿大cs作业代写,cs作业代写价格,靠谱cs作业代写,程序代写
WeChat