Showing posts with label MLlib. Show all posts
Showing posts with label MLlib. Show all posts

Friday, September 24, 2021

Big Data Computing: Quiz Assignment-III Solutions (Week-3)

1. In Spark, a is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.

A. Spark Streaming

B. FlatMap

C. Driver

D. Resilient Distributed Dataset (RDD)

Answer: D) Resilient Distributed Dataset (RDD)

Explanation: Resilient Distributed Data Sets (RDDs) are a basic Spark data structure. It is a distributed and immutable collection of objects. Each dataset in RDD is divided into logical partitions that can be computed on different nodes in the cluster. RDDs can contain any type of Python, Java, or Scala object, including custom classes. Formally, an RDD is a read-only, partitioned collection of data sets. RDDs can be created by deterministic operations on data in stable storage or other RDDs. RDD is a collection of fault tolerant elements that can be operated in parallel.


2. Given the following definition about the join transformation in Apache Spark:

def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]

Where join operation is used for joining two datasets. When it is called on datasets of type (K, V) and (K, W), it returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.

Output the result of joinrdd, when the following code is run.

val rdd1 = sc.parallelize(Seq(("m",55),("m",56),("e",57),("e",58),("s",59),("s",54)))

val rdd2 = sc.parallelize(Seq(("m",60),("m",65),("s",61),("s",62),("h",63),("h",64))) val joinrdd = rdd1.join(rdd2)

joinrdd.collect


A. Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)),

(m,(56,65)), (s,(59,61)), (s,(59,62)), (h,(63,64)), (s,(54,61)), (s,(54,62)))

B. Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)),

(m,(56,65)), (s,(59,61)), (s,(59,62)), (e,(57,58)), (s,(54,61)), (s,(54,62)))

C. Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)),

(m,(56,65)), (s,(59,61)), (s,(59,62)), (s,(54,61)), (s,(54,62)))

D. None of the mentioned

Answer: C) Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)),

(m,(56,65)), (s,(59,61)), (s,(59,62)), (s,(54,61)), (s,(54,62)))

Explanation: join() is transformation which returns an RDD containing all pairs of elements with matching keys in this and other. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in this and (k, v2) is in other.

 

3. Consider the following statements in the context of Spark:

Statement 1: Spark improves efficiency through in-memory computing primitives and general computation graphs.

Statement 2: Spark improves usability through high-level APIs in Java, Scala, Python and also provides an interactive shell.

A. Only statement 1 is true

B. Only statement 2 is true

C. Both statements are true

D. Both statements are false

Answer: C) Both statements are true

Explanation: Apache Spark is a fast and universal cluster computing system. It offers high-level APIs in Java, Scala, and Python, as well as an optimized engine that supports general-execution graphics. It also supports a variety of higher-level tools, including Spark SQL for SQL and structured computing, MLlib for machine learning, GraphX ​​for graph processing, and Spark Streaming. Spark comes with several sample programs. Spark offers an interactive shell, a powerful tool for interactive data analysis. It is available in Scala or Python language. Spark improves efficiency through in-memory computing primitives. With in-memory computing, data is kept in random access memory (RAM) instead of some slow disk drives and is processed in parallel. This allows us to recognize a pattern and analyze large amounts of data. This has become popular because it reduces the cost of storage. Therefore, in-memory processing is economical for applications.


4. True or False ?

Resilient Distributed Datasets (RDDs) are fault-tolerant and immutable.

A. True

B. False

Answer: True

Explanation: Resilient Distributed Datasets (RDDs) are:

1. Immutable collections of objects spread across a cluster

2. Built through parallel transformations (map, filter, etc.)

3. Automatically rebuilt on failure

4. Controllable persistence (e.g. caching in RAM)


5. Which of the following is not a NoSQL database?

A. HBase

B. Cassandra

C. SQL Server

D. None of the mentioned

Answer: C) SQL Server

Explanation: NoSQL, which stands for "not just SQL", is an alternative to traditional relational databases where the data is stored in tables and the data schema is carefully designed before the database is created. NoSQL databases are particularly useful for working with large amounts of distributed data.

 

6. True or False ?

Apache Spark potentially run batch-processing programs up to 100 times faster than Hadoop MapReduce in memory, or 10 times faster on disk.

A. True

B. False

Answer: True

Explanation: Spark's biggest claim about speed is that "it can run programs up to 100 times faster than Hadoop MapReduce in memory or 10 times faster on disk." Spark could make this claim because it takes care of the processing in the main memory of the worker nodes and avoids unnecessary I / O operations on the disks. The other benefit that Spark offers is the ability to chain tasks at the application programming level without actually writing to disks or minimizing the amount of writes to disks.


7. _____________leverages Spark Core fast scheduling capability to perform streaming analytics.

A. MLlib

B. Spark Streaming

C. GraphX

D. RDDs

Answer: B) Spark Streaming

Explanation: Spark Streaming ingests data in mini-batches and performs RDD transformations on those mini-batches of data.


8. _________ is a distributed graph processing framework on top of Spark.

A. MLlib

B. Spark streaming

C. GraphX

D. All of the mentioned

Answer: C) GraphX

Explanation: GraphX is Apache Spark's API for graphs and graph-parallel computation. It is a distributed graph processing framework on top of Spark.


9. Point out the incorrect statement in the context of Cassandra:

A. It is a centralized key-value store

B. It is originally designed at Facebook

C. It is designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure

D. It uses a ring-based DHT (Distributed Hash Table) but without finger tables or routing

Answer: A) It is a centralized key-value store

Explanation: Cassandra is a distributed key-value store.


10. Consider the following statements:

Statement 1: Scale out means grow your cluster capacity by replacing with more powerful machines.

Statement 2: Scale up means incrementally grow your cluster capacity by adding more COTS machines (Components Off the Shelf).

A. Only statement 1 is true

B. Only statement 2 is true

C. Both statements are false

D. Both statements are true

Answer: C) Both statements are false

Explanation: The correct statements are:

Scale up = grow your cluster capacity by replacing with more powerful machines

Scale out = incrementally grow your cluster capacity by adding more COTS machines (Components Off the Shelf)

Search Aptipedia