Example: data set: collections of text documents. problem: count the frequency of nouns that appear at least 100 times in the

Question

Example:  data set:  collections of text documents. problem:  count the frequency of nouns that appear at least 100 times in the documents.  (i) mapper function:  tokenize each line into a set of terms (words), and filter out terms that are not nouns. (ii) mapper output:  key is a noun, value is 1. (iii) reducer input:  key is a word, value is list of 1’s. (iv) reduce function:  sums up the 1’s for each key (noun). (v) reducer output:  key is a noun, value is frequency of the word (filter the nouns whose frequencies are below ) data set:  amazon book ratings data. each line in the data file has 4 columns (reviewer id, book id, book genre, rating), where ratings are integer-valued ranging from 1 to 4. problem:  identify the highest rated book, i. e., the book with highest average rating, for each book genre. note that each book can have more than one ratings (e. g., by different ) data set:  movie preference data. each record in the data file contains the movie title and list of users who liked the movie. for example, the record jaws user111 user134 user313 user5812 star_wars user111 user313 user388 user4422 problem:  for each pair of users, count the number of movies they both liked. the output may exclude pairs of users who do not have any movies they both liked.(c) data set:  maximum and minimum daily temperature readings for weather stations from around the world. each line in the data files has 4 columns (station id, date, max temperature, min temperature). 2 problem:  find the station id and date of anomalous temperature readings in the dataset. a temperature reading is anomalous if the minimum daily temperature exceeds the maximum temperature for the given day.(d) data set:  instagram friendship graph. each record corresponds to an instagram user, followed by a list of his/her friends. for example, the graph data may contain the following records:  john123 mary456 tom312 lee222 mary456 john123 tom312 john123 lee222 lee222 john123 tom312 the first line above states that mary456, tom312, and lee222 are friends of john123. problem:  find pairs of instagram users who are not friends with each other but who share one or more common friends. this is known as the friend-of-a-friend (fof) problem. for example, mary456 and tom312 are both friends of john123, but they are not friends with each other. the hadoop program should only output the pair (u, v) if u <  v. in the previous example, the program should only output the pair (mary456, tom312) but not (tom312, ) data set:  cancer data. each line in the data file corresponds to a patient with the following nominal-valued attributes:  patientid, gender, marital status, smoker, weight class, and class, where the class attribute has value yes or no to indicate whether the patient has cancer. 12345, female, married, smoker, normal, yes. 13, male, single, nonsmoker, normal, no. 14423, male, married, smoker, overweight, yes. problem:  compute the gini index for each of the following attributes:  gender, marital status, smoker, and weight class, based on the distribution of their class values.

Guest · Answer

Answer (c) worn brakes pads;...

Answers

Another question on Computers and Technology