Friday, September 24, 2021

Big Data Computing: Quiz Assignment-IV Solutions (Week-4)

1. Identify the correct choices for the given scenarios:
P: The system allows operations all the time, and operations return quickly
Q: All nodes see same data at any time, or reads return latest written value by any client R: 

The system continues to work in spite of network partitions
A. P: Consistency, Q: Availability, R: Partition tolerance
B. P: Availability, Q: Consistency, R: Partition tolerance
C. P: Partition tolerance, Q: Consistency, R: Availability
D. P: Consistency, Q: Partition tolerance, R: Availability
 

Answer: B) P: Availability, Q: Consistency, R: Partition tolerance 

Explanation:
CAP Theorem states following properties:
Consistency: All nodes see same data at any time, or reads return latest written value by any client.
Availability: The system allows operations all the time, and operations return quickly.
Partition-tolerance: The system continues to work in spite of network partitions.

2. Cassandra uses a protocol called to discover location and state information about the other nodes participating in a Cassandra cluster.
A. Key-value
B. Memtable
C. Heartbeat
D. Gossip

Answer: D) Gossip

Explanation: Cassandra uses a protocol called Gossip to obtain information about the location and status of the other nodes participating in a Cassandra cluster. Gossip is a peer-to-peer communication protocol in which nodes regularly exchange status information about themselves and about other nodes they know.


3. In Cassandra, is used to specify data centers and the number of replicas to place within each data center. It attempts to place replicas on distinct racks to avoid the node failure and to ensure data availability.
A. Simple strategy
B. Quorum strategy
C. Network topology strategy
D. None of the mentioned 

Answer: C) Network topology strategy

Explanation: The network topology strategy is used to specify the data centers and the number of replicas to be placed in each data center. Try to place replicas in different racks to avoid node failure and ensure data availability. In the network topology strategy, the two most common methods for configuring multiple data center clusters are: two replicas in each data center and three replicas in each data center.

 

4. True or False ?
A Snitch determines which data centers and racks nodes belong to. Snitches inform Cassandra about the network topology so that requests are routed efficiently and allows Cassandra to distribute replicas by grouping machines into data centers and racks.
A. True
B. False

Answer: True

Explanation: A snitch determines which data centers and rack nodes they belong to. The snitches inform Cassandra about the network topology so that requests can be routed efficiently and Cassandra can distribute replicas by grouping machines in data centers and racks. In particular, the replication strategy places the replicas based on the information provided by the new Snitch. All nodes must return to the same rack and data center. Cassandra tries her best not to have more than one replica on the same shelf (which is not necessarily a physical location).


5. Consider the following statements:
Statement 1: In Cassandra, during a write operation, when hinted handoff is enabled and If any replica is down, the coordinator writes to all other replicas, and keeps the write locally until down replica comes back up.
Statement 2: In Cassandra, Ec2Snitch is important snitch for deployments and it is a simple snitch for Amazon EC2 deployments where all nodes are in a single region. In Ec2Snitch region name refers to data center and availability zone refers to rack in a cluster.
A. Only Statement 1 is true
B. Only Statement 2 is true
C. Both Statements are true
D. Both Statements are false

Answer: C) Both Statements are true

Explanation: Cassandra uses a protocol called Gossip to obtain information about the location and status of the other nodes participating in a Cassandra cluster. Gossip is a peer-to-peer communication protocol in which nodes regularly exchange status information about themselves and about other nodes they know.

 

6. What is Eventual Consistency ?
A. At any time, the system is linearizable
B. If writes stop, all reads will return the same value after a while
C. At any time, concurrent reads from any node return the same values
D. If writes stop, a distributed system will become consistent

Answer: B) If writes stop, all reads will return the same value after a while

Explanation: Cassandra offers Eventual Consistency. Is says that If writes to a key stop, all replicas of key will converge automatically.

 

7. Consider the following statements:
Statement 1: When two processes are competing with each other causing data corruption, it is called deadlock
Statement 2: When two processes are waiting for each other directly or indirectly, it is called race condition
A. Only Statement 1 is true
B. Only Statement 2 is true
C. Both Statements are false
D. Both Statements are true 

Answer: C) Both Statements are false 

Explanation: The correct statements are:
Statement 1: When two processes are competing with each other causing data corruption, it is called Race Condition
Statement 2: When two processes are waiting for each other directly or indirectly, it is called deadlock.


8. ZooKeeper allows distributed processes to coordinate with each other through registers, known as
A. znodes
B. hnodes
C. vnodes
D. rnodes

Answer: A) znodes

Explanation: Every znode is identified by a path, with path elements separated by a slash.



9. In Zookeeper, when a is triggered the client receives a packet saying that the znode has changed.
A. Event
B. Row
C. Watch
D. Value

Answer: C) Watch

Explanation: ZooKeeper supports the concept of watches. Clients can set a watch on a znodes.


10. Consider the Table temperature_details in Keyspace “day3” with schema as follows:
temperature_details(daynum, year,month,date,max_temp)
with primary key(daynum,year,month,date) 

DayNum

Year

Month

Date

MaxTemp (°C)

1

1943

10

1

14.1

2

1943

10

2

16.4

541

1945

3

24

21.1

9970

1971

1

16

21.4

20174

1998

12

24

36.7

21223

2001

11

7

16

4317

1955

7

26

16.7

 There exists same maximum temperature at different hours of the same day. Choose the correct CQL query to:

Alter table temperature_details to add a new column called “seasons” using map of type
<varint, text> represented as <month, season>. Season can have the following values season={spring, summer, autumn, winter}.
Update table temperature_details where columns daynum, year, month, date contain the following values- 4317,1955,7,26 respectively.
Use the select statement to output the row after updation.
Note: A map relates one item to another with a key-value pair. For each key, only one value may exist, and duplicates cannot be stored. Both the key and the value are designated with a data type.

A)
cqlsh:day3> alter table temperature_details add hours1 set<varint>;
cqlsh:day3> update temperature_details set hours1={1,5,9,13,5,9} where daynum=4317; cqlsh:day3> select * from temperature_details where daynum=4317;


B)
cqlsh:day3> alter table temperature_details add seasons map<varint,text>;
cqlsh:day3> update temperature_details set seasons = seasons + {7:'spring'} where daynum=4317 and year =1955 and month = 7 and date=26;
cqlsh:day3> select * from temperature_details where daynum=4317 and year=1955 and month=7 and date=26;


C)
cqlsh:day3>alter table temperature_details add hours1 list<varint>;
cqlsh:day3> update temperature_details set hours1=[1,5,9,13,5,9] where daynum=4317 and year = 1955 and month = 7 and date=26;
cqlsh:day3> select * from temperature_details where daynum=4317 and year=1955 and month=7 and date=26;


D) cqlsh:day3> alter table temperature_details add seasons map<month, season>;
cqlsh:day3> update temperature_details set seasons = seasons + {7:'spring'} where daynum=4317;
cqlsh:day3> select * from temperature_details where daynum=4317;

Answer: B)
cqlsh:day3> alter table temperature_details add seasons map<varint,text>;
cqlsh:day3> update temperature_details set seasons = seasons + {7:'spring'} where daynum=4317 and year =1955 and month = 7 and date=26;
cqlsh:day3> select * from temperature_details where daynum=4317 and year=1955 and month=7 and date=26;


Explanation:
The correct steps are:
a) Add column “seasons”
cqlsh:day3> alter table temperature_details add seasons map<varint,text>;

b) Update table
cqlsh:day3> update temperature_details set seasons = seasons + {7:'spring'} where daynum=4317 and year =1955 and month = 7 and date=26;

c) Select query
cqlsh:day3> select * from temperature_details where daynum=4317 and year=1955 and month=7 and date=26;

 

daynum

year

month

date

hours

hours1

max_temp

seasons

4317

1955

7

26

{1,5,9,13}

[1,5,9,13,5,9]

16.7

{7:’spring’}

Big Data Computing: Quiz Assignment-III Solutions (Week-3)

1. In Spark, a is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.

A. Spark Streaming

B. FlatMap

C. Driver

D. Resilient Distributed Dataset (RDD)

Answer: D) Resilient Distributed Dataset (RDD)

Explanation: Resilient Distributed Data Sets (RDDs) are a basic Spark data structure. It is a distributed and immutable collection of objects. Each dataset in RDD is divided into logical partitions that can be computed on different nodes in the cluster. RDDs can contain any type of Python, Java, or Scala object, including custom classes. Formally, an RDD is a read-only, partitioned collection of data sets. RDDs can be created by deterministic operations on data in stable storage or other RDDs. RDD is a collection of fault tolerant elements that can be operated in parallel.


2. Given the following definition about the join transformation in Apache Spark:

def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]

Where join operation is used for joining two datasets. When it is called on datasets of type (K, V) and (K, W), it returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.

Output the result of joinrdd, when the following code is run.

val rdd1 = sc.parallelize(Seq(("m",55),("m",56),("e",57),("e",58),("s",59),("s",54)))

val rdd2 = sc.parallelize(Seq(("m",60),("m",65),("s",61),("s",62),("h",63),("h",64))) val joinrdd = rdd1.join(rdd2)

joinrdd.collect


A. Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)),

(m,(56,65)), (s,(59,61)), (s,(59,62)), (h,(63,64)), (s,(54,61)), (s,(54,62)))

B. Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)),

(m,(56,65)), (s,(59,61)), (s,(59,62)), (e,(57,58)), (s,(54,61)), (s,(54,62)))

C. Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)),

(m,(56,65)), (s,(59,61)), (s,(59,62)), (s,(54,61)), (s,(54,62)))

D. None of the mentioned

Answer: C) Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)),

(m,(56,65)), (s,(59,61)), (s,(59,62)), (s,(54,61)), (s,(54,62)))

Explanation: join() is transformation which returns an RDD containing all pairs of elements with matching keys in this and other. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in this and (k, v2) is in other.

 

3. Consider the following statements in the context of Spark:

Statement 1: Spark improves efficiency through in-memory computing primitives and general computation graphs.

Statement 2: Spark improves usability through high-level APIs in Java, Scala, Python and also provides an interactive shell.

A. Only statement 1 is true

B. Only statement 2 is true

C. Both statements are true

D. Both statements are false

Answer: C) Both statements are true

Explanation: Apache Spark is a fast and universal cluster computing system. It offers high-level APIs in Java, Scala, and Python, as well as an optimized engine that supports general-execution graphics. It also supports a variety of higher-level tools, including Spark SQL for SQL and structured computing, MLlib for machine learning, GraphX ​​for graph processing, and Spark Streaming. Spark comes with several sample programs. Spark offers an interactive shell, a powerful tool for interactive data analysis. It is available in Scala or Python language. Spark improves efficiency through in-memory computing primitives. With in-memory computing, data is kept in random access memory (RAM) instead of some slow disk drives and is processed in parallel. This allows us to recognize a pattern and analyze large amounts of data. This has become popular because it reduces the cost of storage. Therefore, in-memory processing is economical for applications.


4. True or False ?

Resilient Distributed Datasets (RDDs) are fault-tolerant and immutable.

A. True

B. False

Answer: True

Explanation: Resilient Distributed Datasets (RDDs) are:

1. Immutable collections of objects spread across a cluster

2. Built through parallel transformations (map, filter, etc.)

3. Automatically rebuilt on failure

4. Controllable persistence (e.g. caching in RAM)


5. Which of the following is not a NoSQL database?

A. HBase

B. Cassandra

C. SQL Server

D. None of the mentioned

Answer: C) SQL Server

Explanation: NoSQL, which stands for "not just SQL", is an alternative to traditional relational databases where the data is stored in tables and the data schema is carefully designed before the database is created. NoSQL databases are particularly useful for working with large amounts of distributed data.

 

6. True or False ?

Apache Spark potentially run batch-processing programs up to 100 times faster than Hadoop MapReduce in memory, or 10 times faster on disk.

A. True

B. False

Answer: True

Explanation: Spark's biggest claim about speed is that "it can run programs up to 100 times faster than Hadoop MapReduce in memory or 10 times faster on disk." Spark could make this claim because it takes care of the processing in the main memory of the worker nodes and avoids unnecessary I / O operations on the disks. The other benefit that Spark offers is the ability to chain tasks at the application programming level without actually writing to disks or minimizing the amount of writes to disks.


7. _____________leverages Spark Core fast scheduling capability to perform streaming analytics.

A. MLlib

B. Spark Streaming

C. GraphX

D. RDDs

Answer: B) Spark Streaming

Explanation: Spark Streaming ingests data in mini-batches and performs RDD transformations on those mini-batches of data.


8. _________ is a distributed graph processing framework on top of Spark.

A. MLlib

B. Spark streaming

C. GraphX

D. All of the mentioned

Answer: C) GraphX

Explanation: GraphX is Apache Spark's API for graphs and graph-parallel computation. It is a distributed graph processing framework on top of Spark.


9. Point out the incorrect statement in the context of Cassandra:

A. It is a centralized key-value store

B. It is originally designed at Facebook

C. It is designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure

D. It uses a ring-based DHT (Distributed Hash Table) but without finger tables or routing

Answer: A) It is a centralized key-value store

Explanation: Cassandra is a distributed key-value store.


10. Consider the following statements:

Statement 1: Scale out means grow your cluster capacity by replacing with more powerful machines.

Statement 2: Scale up means incrementally grow your cluster capacity by adding more COTS machines (Components Off the Shelf).

A. Only statement 1 is true

B. Only statement 2 is true

C. Both statements are false

D. Both statements are true

Answer: C) Both statements are false

Explanation: The correct statements are:

Scale up = grow your cluster capacity by replacing with more powerful machines

Scale out = incrementally grow your cluster capacity by adding more COTS machines (Components Off the Shelf)

Search Aptipedia