Showing posts with label Hadoop MapReduce. Show all posts
Showing posts with label Hadoop MapReduce. Show all posts

Friday, September 24, 2021

Big Data Computing: Quiz Assignment-III Solutions (Week-3)

1. In Spark, a is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.

A. Spark Streaming

B. FlatMap

C. Driver

D. Resilient Distributed Dataset (RDD)

Answer: D) Resilient Distributed Dataset (RDD)

Explanation: Resilient Distributed Data Sets (RDDs) are a basic Spark data structure. It is a distributed and immutable collection of objects. Each dataset in RDD is divided into logical partitions that can be computed on different nodes in the cluster. RDDs can contain any type of Python, Java, or Scala object, including custom classes. Formally, an RDD is a read-only, partitioned collection of data sets. RDDs can be created by deterministic operations on data in stable storage or other RDDs. RDD is a collection of fault tolerant elements that can be operated in parallel.


2. Given the following definition about the join transformation in Apache Spark:

def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]

Where join operation is used for joining two datasets. When it is called on datasets of type (K, V) and (K, W), it returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.

Output the result of joinrdd, when the following code is run.

val rdd1 = sc.parallelize(Seq(("m",55),("m",56),("e",57),("e",58),("s",59),("s",54)))

val rdd2 = sc.parallelize(Seq(("m",60),("m",65),("s",61),("s",62),("h",63),("h",64))) val joinrdd = rdd1.join(rdd2)

joinrdd.collect


A. Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)),

(m,(56,65)), (s,(59,61)), (s,(59,62)), (h,(63,64)), (s,(54,61)), (s,(54,62)))

B. Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)),

(m,(56,65)), (s,(59,61)), (s,(59,62)), (e,(57,58)), (s,(54,61)), (s,(54,62)))

C. Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)),

(m,(56,65)), (s,(59,61)), (s,(59,62)), (s,(54,61)), (s,(54,62)))

D. None of the mentioned

Answer: C) Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)),

(m,(56,65)), (s,(59,61)), (s,(59,62)), (s,(54,61)), (s,(54,62)))

Explanation: join() is transformation which returns an RDD containing all pairs of elements with matching keys in this and other. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in this and (k, v2) is in other.

 

3. Consider the following statements in the context of Spark:

Statement 1: Spark improves efficiency through in-memory computing primitives and general computation graphs.

Statement 2: Spark improves usability through high-level APIs in Java, Scala, Python and also provides an interactive shell.

A. Only statement 1 is true

B. Only statement 2 is true

C. Both statements are true

D. Both statements are false

Answer: C) Both statements are true

Explanation: Apache Spark is a fast and universal cluster computing system. It offers high-level APIs in Java, Scala, and Python, as well as an optimized engine that supports general-execution graphics. It also supports a variety of higher-level tools, including Spark SQL for SQL and structured computing, MLlib for machine learning, GraphX ​​for graph processing, and Spark Streaming. Spark comes with several sample programs. Spark offers an interactive shell, a powerful tool for interactive data analysis. It is available in Scala or Python language. Spark improves efficiency through in-memory computing primitives. With in-memory computing, data is kept in random access memory (RAM) instead of some slow disk drives and is processed in parallel. This allows us to recognize a pattern and analyze large amounts of data. This has become popular because it reduces the cost of storage. Therefore, in-memory processing is economical for applications.


4. True or False ?

Resilient Distributed Datasets (RDDs) are fault-tolerant and immutable.

A. True

B. False

Answer: True

Explanation: Resilient Distributed Datasets (RDDs) are:

1. Immutable collections of objects spread across a cluster

2. Built through parallel transformations (map, filter, etc.)

3. Automatically rebuilt on failure

4. Controllable persistence (e.g. caching in RAM)


5. Which of the following is not a NoSQL database?

A. HBase

B. Cassandra

C. SQL Server

D. None of the mentioned

Answer: C) SQL Server

Explanation: NoSQL, which stands for "not just SQL", is an alternative to traditional relational databases where the data is stored in tables and the data schema is carefully designed before the database is created. NoSQL databases are particularly useful for working with large amounts of distributed data.

 

6. True or False ?

Apache Spark potentially run batch-processing programs up to 100 times faster than Hadoop MapReduce in memory, or 10 times faster on disk.

A. True

B. False

Answer: True

Explanation: Spark's biggest claim about speed is that "it can run programs up to 100 times faster than Hadoop MapReduce in memory or 10 times faster on disk." Spark could make this claim because it takes care of the processing in the main memory of the worker nodes and avoids unnecessary I / O operations on the disks. The other benefit that Spark offers is the ability to chain tasks at the application programming level without actually writing to disks or minimizing the amount of writes to disks.


7. _____________leverages Spark Core fast scheduling capability to perform streaming analytics.

A. MLlib

B. Spark Streaming

C. GraphX

D. RDDs

Answer: B) Spark Streaming

Explanation: Spark Streaming ingests data in mini-batches and performs RDD transformations on those mini-batches of data.


8. _________ is a distributed graph processing framework on top of Spark.

A. MLlib

B. Spark streaming

C. GraphX

D. All of the mentioned

Answer: C) GraphX

Explanation: GraphX is Apache Spark's API for graphs and graph-parallel computation. It is a distributed graph processing framework on top of Spark.


9. Point out the incorrect statement in the context of Cassandra:

A. It is a centralized key-value store

B. It is originally designed at Facebook

C. It is designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure

D. It uses a ring-based DHT (Distributed Hash Table) but without finger tables or routing

Answer: A) It is a centralized key-value store

Explanation: Cassandra is a distributed key-value store.


10. Consider the following statements:

Statement 1: Scale out means grow your cluster capacity by replacing with more powerful machines.

Statement 2: Scale up means incrementally grow your cluster capacity by adding more COTS machines (Components Off the Shelf).

A. Only statement 1 is true

B. Only statement 2 is true

C. Both statements are false

D. Both statements are true

Answer: C) Both statements are false

Explanation: The correct statements are:

Scale up = grow your cluster capacity by replacing with more powerful machines

Scale out = incrementally grow your cluster capacity by adding more COTS machines (Components Off the Shelf)

Saturday, September 11, 2021

Big Data Computing: Quiz Assignment-I Solutions (Week-1)

1. _____________is responsible for allocating system resources to the various applications running in a Hadoop cluster and scheduling tasks to be executed on different cluster nodes.


A. Hadoop Common
B. Hadoop Distributed File System (HDFS)
C. Hadoop YARN
D. Hadoop MapReduce

Answer: C) Hadoop YARN

Explanation:

Hadoop Common: Contains libraries and utilities necessary from other Didoop modules.
HDFS: It is a distributed file system that stores data on a commodity machine. Provide a very high addition bandwidth throughout the cluster.
Hadoop Discussion: is a resource management platform responsible for managing processing resources in the cluster and use them to schedule users and applications. The thread is responsible for the assignment of system resources to the various applications running in a Hadoop cluster and the programming activities that are performed in different clustered nodes.
Hadooop MapReduce: it is a programming model that resizes data into many different processes.

 

2. Which of the following tool is designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases ?
A. Pig
B. Mahout
C. Apache Sqoop
D. Flume

Answer: C) Apache Sqoop

Explanation: Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases

 

3. ________________is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
A. Flume
B. Apache Sqoop
C. Pig
D. Mahout

Answer: A) Flume
Explanation: Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of historical data. It has a simple and very flexible architecture based on streaming data flow. It is very robust and fault-tolerant, and can really be adjusted to improve reliability mechanisms, failover, recovery, and any other mechanisms that keep the cluster safe and reliable. It uses a simple extensible data model that allows us to use various online analysis applications.

 

4. _____________refers to the connectedness of big data.

A. Value
B. Veracity
C. Velocity
D. Valence

Answer: D) Valence

Explanation: Valence refers to the connectedness of big data. Such as in the form of graph networks

 

5. Consider the following statements:

Statement 1: Volatility refers to the data velocity relative to timescale of event being studied

Statement 2: Viscosity refers to the rate of data loss and stable lifetime of data

A. Only statement 1 is true
B. Only statement 2 is true
C. Both statements are true
D. Both statements are false

Answer: D) Both statements are false

Explanation: The correct statements are:

Statement 1: Viscosity refers to the data velocity relative to timescale of event being studied Statement 2: Volatility refers to the rate of data loss and stable lifetime of data

 

6.___________refers to the biases, noise and abnormality in data, trustworthiness of data.

A. Value
B. Veracity
C. Velocity
D. Volume

Answer: B) Veracity

Explanation: Veracity refers to the biases ,noise and abnormality in data, trustworthiness of data.

 

7. ___________brings scalable parallel database technology to Hadoop and allows users to submit low latencies queries to the data that's stored within the HDFS or the Hbase without acquiring a ton of data movement and manipulation.
A. Apache Sqoop
B. Mahout
C. Flume
D. Impala

Answer: D) Impala

Explanation: Cloudera, Impala is specially designed for Cloudera, it is a query engine that runs on Apache Hadoop. The project was officially announced in late 2012 and became a publicly available open source distribution. Impala brings scalable parallel database technology to Hadoop, allowing users to send low-latency queries to data stored in HDFS or Hbase without performing a lot of data movements and operations.

 

8. True or False ?

NoSQL databases store unstructured data with no particular schema
A. True
B. False

Answer: A) True
Explanation: Traditional SQL can handle a large amount of structured data effectively, and we need NoSQL (not only SQL) to handle unstructured data. NoSQL database stores unstructured data without a specific schema.

 

9. _____________is a highly reliable distributed coordination kernel , which can be used for distributed locking, configuration management, leadership election, and work queues etc.
A. Apache Sqoop
B. Mahout
C. ZooKeeper
D. Flume

Answer: C) ZooKeeper

Explanation: Zookeeper is a central key value store that uses distributed systems that can coordinate. Since it is necessary to be able to manage the load, the zookeeper works with many machines.

 

10. True or False ?

MapReduce is a programming model and an associated implementation for processing and generating large data sets.
A. True
B. False

Answer: A) True

Search Aptipedia