Saturday, September 11, 2021

Big Data Computing: Quiz Assignment-II Solutions (Week-2)

1. Consider the following statements:

Statement 1: The Job Tracker is hosted inside the master and it receives the job execution request from the client.

Statement 2: Task tracker is the MapReduce component on the slave machine as there are multiple slave machines.

A. Only statement 1 is true

B. Only statement 2 is true

C. Both statements are true

D. Both statements are false


Answer: C) Both statements are true


2. _______________is the slave/worker node and holds the user data in the form of Data Blocks.

A. NameNode

B. Data block

C. Replication

D. DataNode

Answer: D) DataNode

Explanation: The NameNode acts as the main server that manages the namespace of the file system, primarily managing client access to these files, and also keeping track of the location of the data on the DataNode and the basic distribution location of blocks. On the other hand, DataNode is a slave / worker node and stores user data in the form of data blocks.


3. ________works as a master server that manages the file system namespace and basically regulates access to these files from clients, and it also keeps track of where the data is on the Data Nodes and where the blocks are distributed essentially.

A. Name Node

B. Data block

C. Replication

D. Data Node

Answer: A) Name Node

Explanation: Namenode, as the main server, manages the namespace of the file system, and basically regulates the client's access to these files. At the same time, it also tracks the location of the data on the data node and the basic distribution location of the block. On the other hand, data nodes are slave/work nodes, which contain user data in the form of data blocks.


4. The number of maps in MapReduce is usually driven by the total size of

A. Inputs

B. Outputs

C. Tasks

D. None of the mentioned

Answer: A) Inputs

Explanation: The map, written by the user takes a pair of entry and produces a series of intermediate keys / value pairs. The MapReduce Library groups together all the intermediate values associated with the same intermediate key "I 'and pass them to the function reduce.


5. True or False ?

The main duties of task tracker are to break down the receive job that is big computations in small parts, allocate the partial computations that is tasks to the slave nodes monitoring the progress and report of task execution from the slave.

A. True

B. False

Answer: B) False

Explanation: The task tracker will communicate the progress and report the results to the job tracker.


6. Point out the correct statement in context of YARN:

A. YARN is highly scalable.

B. YARN enhances a Hadoop compute cluster in many ways

C. YARN extends the power of Hadoop to incumbent and new technologies found within the data center

D. All of the mentioned

Answer: D) All of the mentioned


7. Consider the pseudo-code for MapReduce's WordCount example (not shown here). Let's now assume that you want to determine the frequency of phrases consisting of 3 words each instead of determining the frequency of single words. Which part of the (pseudo-)code do you need to adapt?

A. Only map()

B. Only reduce()

C. map() and reduce()

D. The code does not have to be changed

Answer: A) Only map()

Explanation: The map function takes a value and outputs key:value pairs.

For instance, if we define a map function that takes a string and outputs the length of the word as the key and the word itself as the value then

map(steve) would return 5:steve and map(savannah) would return 8:savannah.

This allows us to run the map function against values in parallel. So we have to only adapt the map() function of pseudo code.


8. The namenode knows that the datanode is active using a mechanism known as

A. Heartbeats

B. Datapulse

C. h-signal

D. Active-pulse

Answer: A) heartbeats

Explanation: Use Heartbeat to communicate between the Hadoop Namenode and Datanode. Heartbeat is therefore a signal that the data node sends to the name node after a certain time interval to indicate its existence, ie to indicate that it is alive.


9. True or False ?

HDFS performs replication, although it results in data redundancy?

A. True

B. False

Answer: True

Explanation: Once the data has been written on HDFS, it is replicated immediately along the cluster, so that the different data copies are stored in different data nodes. Normally, the replication factor is 3, since due to this, the data does not remain on replicates or are lower.


10. _____________function processes a key/value pair to generate a set of intermediate key/value pairs.

A. Map

B. Reduce

C. Both Map and Reduce

D. None of the mentioned

Answer: A) Map

Explanation: Mapping is a single task that converts input data records into intermediate data records and reduces the process and merges all intermediate values ​​assigned by each key.

Big Data Computing: Quiz Assignment-I Solutions (Week-1)

1. _____________is responsible for allocating system resources to the various applications running in a Hadoop cluster and scheduling tasks to be executed on different cluster nodes.


A. Hadoop Common
B. Hadoop Distributed File System (HDFS)
C. Hadoop YARN
D. Hadoop MapReduce

Answer: C) Hadoop YARN

Explanation:

Hadoop Common: Contains libraries and utilities necessary from other Didoop modules.
HDFS: It is a distributed file system that stores data on a commodity machine. Provide a very high addition bandwidth throughout the cluster.
Hadoop Discussion: is a resource management platform responsible for managing processing resources in the cluster and use them to schedule users and applications. The thread is responsible for the assignment of system resources to the various applications running in a Hadoop cluster and the programming activities that are performed in different clustered nodes.
Hadooop MapReduce: it is a programming model that resizes data into many different processes.

 

2. Which of the following tool is designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases ?
A. Pig
B. Mahout
C. Apache Sqoop
D. Flume

Answer: C) Apache Sqoop

Explanation: Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases

 

3. ________________is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
A. Flume
B. Apache Sqoop
C. Pig
D. Mahout

Answer: A) Flume
Explanation: Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of historical data. It has a simple and very flexible architecture based on streaming data flow. It is very robust and fault-tolerant, and can really be adjusted to improve reliability mechanisms, failover, recovery, and any other mechanisms that keep the cluster safe and reliable. It uses a simple extensible data model that allows us to use various online analysis applications.

 

4. _____________refers to the connectedness of big data.

A. Value
B. Veracity
C. Velocity
D. Valence

Answer: D) Valence

Explanation: Valence refers to the connectedness of big data. Such as in the form of graph networks

 

5. Consider the following statements:

Statement 1: Volatility refers to the data velocity relative to timescale of event being studied

Statement 2: Viscosity refers to the rate of data loss and stable lifetime of data

A. Only statement 1 is true
B. Only statement 2 is true
C. Both statements are true
D. Both statements are false

Answer: D) Both statements are false

Explanation: The correct statements are:

Statement 1: Viscosity refers to the data velocity relative to timescale of event being studied Statement 2: Volatility refers to the rate of data loss and stable lifetime of data

 

6.___________refers to the biases, noise and abnormality in data, trustworthiness of data.

A. Value
B. Veracity
C. Velocity
D. Volume

Answer: B) Veracity

Explanation: Veracity refers to the biases ,noise and abnormality in data, trustworthiness of data.

 

7. ___________brings scalable parallel database technology to Hadoop and allows users to submit low latencies queries to the data that's stored within the HDFS or the Hbase without acquiring a ton of data movement and manipulation.
A. Apache Sqoop
B. Mahout
C. Flume
D. Impala

Answer: D) Impala

Explanation: Cloudera, Impala is specially designed for Cloudera, it is a query engine that runs on Apache Hadoop. The project was officially announced in late 2012 and became a publicly available open source distribution. Impala brings scalable parallel database technology to Hadoop, allowing users to send low-latency queries to data stored in HDFS or Hbase without performing a lot of data movements and operations.

 

8. True or False ?

NoSQL databases store unstructured data with no particular schema
A. True
B. False

Answer: A) True
Explanation: Traditional SQL can handle a large amount of structured data effectively, and we need NoSQL (not only SQL) to handle unstructured data. NoSQL database stores unstructured data without a specific schema.

 

9. _____________is a highly reliable distributed coordination kernel , which can be used for distributed locking, configuration management, leadership election, and work queues etc.
A. Apache Sqoop
B. Mahout
C. ZooKeeper
D. Flume

Answer: C) ZooKeeper

Explanation: Zookeeper is a central key value store that uses distributed systems that can coordinate. Since it is necessary to be able to manage the load, the zookeeper works with many machines.

 

10. True or False ?

MapReduce is a programming model and an associated implementation for processing and generating large data sets.
A. True
B. False

Answer: A) True

Search Aptipedia