Showing posts with label ZooKeeper. Show all posts
Showing posts with label ZooKeeper. Show all posts

Friday, October 15, 2021

Big Data Computing: Quiz Assignment-V Solutions (Week-5)

1. Columns in HBase are organized to
A. Column group
B. Column list
C. Column base
D. Column families
Answer: D) Column families
Explanation: The HBase table consists of a column family which is a logical and physical grouping of columns. Columns of one family are stored separately from columns of other families.

 

 

2. HBase is a distributed database built on top of the Hadoop file system.
A. Row-oriented
B. Tuple-oriented
C. Column-oriented
D. None of the mentioned
Answer: C) Column-oriented
Explanation: Column-oriented distributed data storage capable of horizontally scaling up to 1,000 standard servers and petabytes of indexed storage.

 

 

3. A small chunk of data residing in one machine which is part of a cluster of machines holding one HBase table is known as
A. Region
B. Split
C. Rowarea
D. Tablearea
Answer : A) Region
Explanation: In Hbase, table Split into regions and served by region servers.


4. In HBase, is a combination of row, column family, column qualifier and contains a value and a timestamp.
A. Cell
B. Stores
C. HMaster
D. Region Server
Answer: A) Cell
Explanation: Data is stored in the HBASE table Cells and Cells are a combination of rows, column families, column qualifiers and contain values ​​and timestamps.

 

 

5. HBase architecture has 3 main components:
A. Client, Column family, Region Server
B. Cell, Rowkey, Stores
C. HMaster, Region Server, Zookeeper
D. HMaster, Stores, Region Server
Answer: C) HMaster, Region Server, Zookeeper
Explanation: HBase architecture has 3 main components: HMaster, Region Server, Zookeeper.

1. HMaster: The Master Server implementation in HBase is HMaster. This is the process in which the region is assigned to the region server and DDL operations (create, drop tables). Monitors all Regional Server instances in the cluster.

2. Region Servers: The HBase table is divided horizontally by row key range into regions. Regions are the basic building blocks of the HBase cluster which consist of distribution tables and consist of column families. Region Server running on HDFS DataNode which is in Hadoop cluster.

3. Zookeeper: It's like the coordinator at HBase. It provides services such as maintaining configuration information, naming, distributed synchronization, server failure notification, etc. The client communicates with the regional server via zookeeper.


6. HBase stores data in
A. As many filesystems as the number of region servers
B. One filesystem per column family
C. A single filesystem available to all region servers
D. One filesystem per table.
Answer : C) A single filesystem available to all region servers


7. Kafka is run as a cluster comprised of one or more servers each of which is called
A. cTakes
B. Chunks
C. Broker
D. None of the mentioned
Answer: C) Broker
Explanation: Kafka broker allows consumers to retrieve messages by subject, partition and offset. Kafka brokers can create Kafka clusters by sharing information directly or indirectly using Zookeeper. A Kafka cluster has exactly one broker acting as a controller.


8. True or False ?
Statement 1: Batch Processing provides ability to process and analyze data at-rest (stored data)
Statement 2: Stream Processing provides ability to ingest, process and analyze data in- motion in real or near-real-time.
A. Only statement 1 is true
B. Only statement 2 is true
C. Both statements are true
D. Both statements are false
Answer: C) Both statements are true


9. _________________is a central hub to transport and store event streams in real time.
A. Kafka Core
B. Kafka Connect
C. Kafka Streams
D. None of the mentioned
Answer: A) Kafka Core
Explanation: Kafka Core is a central hub to transport and store event streams in real time.


10. What are the parameters defined to specify window operation ?
A. State size, window length
B. State size, sliding interval
C. Window length, sliding interval
D. None of the mentioned
Answer: C) Window length, sliding interval
Explanation:
Following parameters are used to specify window operation:
i) Window length: duration of the window
(ii) Sliding interval: interval at which the window operation is performed Both the parameters must be a multiple of the batch interval 

 

 

11. _________________is a Java library to process event streams live as they occur.
A. Kafka Core
B. Kafka Connect
C. Kafka Streams
D. None of the mentioned
Answer: C) Kafka Streams  

Explanation: Kafka Streams is a Java library to process event streams live as they occur.

Friday, September 24, 2021

Big Data Computing: Quiz Assignment-IV Solutions (Week-4)

1. Identify the correct choices for the given scenarios:
P: The system allows operations all the time, and operations return quickly
Q: All nodes see same data at any time, or reads return latest written value by any client R: 

The system continues to work in spite of network partitions
A. P: Consistency, Q: Availability, R: Partition tolerance
B. P: Availability, Q: Consistency, R: Partition tolerance
C. P: Partition tolerance, Q: Consistency, R: Availability
D. P: Consistency, Q: Partition tolerance, R: Availability
 

Answer: B) P: Availability, Q: Consistency, R: Partition tolerance 

Explanation:
CAP Theorem states following properties:
Consistency: All nodes see same data at any time, or reads return latest written value by any client.
Availability: The system allows operations all the time, and operations return quickly.
Partition-tolerance: The system continues to work in spite of network partitions.

2. Cassandra uses a protocol called to discover location and state information about the other nodes participating in a Cassandra cluster.
A. Key-value
B. Memtable
C. Heartbeat
D. Gossip

Answer: D) Gossip

Explanation: Cassandra uses a protocol called Gossip to obtain information about the location and status of the other nodes participating in a Cassandra cluster. Gossip is a peer-to-peer communication protocol in which nodes regularly exchange status information about themselves and about other nodes they know.


3. In Cassandra, is used to specify data centers and the number of replicas to place within each data center. It attempts to place replicas on distinct racks to avoid the node failure and to ensure data availability.
A. Simple strategy
B. Quorum strategy
C. Network topology strategy
D. None of the mentioned 

Answer: C) Network topology strategy

Explanation: The network topology strategy is used to specify the data centers and the number of replicas to be placed in each data center. Try to place replicas in different racks to avoid node failure and ensure data availability. In the network topology strategy, the two most common methods for configuring multiple data center clusters are: two replicas in each data center and three replicas in each data center.

 

4. True or False ?
A Snitch determines which data centers and racks nodes belong to. Snitches inform Cassandra about the network topology so that requests are routed efficiently and allows Cassandra to distribute replicas by grouping machines into data centers and racks.
A. True
B. False

Answer: True

Explanation: A snitch determines which data centers and rack nodes they belong to. The snitches inform Cassandra about the network topology so that requests can be routed efficiently and Cassandra can distribute replicas by grouping machines in data centers and racks. In particular, the replication strategy places the replicas based on the information provided by the new Snitch. All nodes must return to the same rack and data center. Cassandra tries her best not to have more than one replica on the same shelf (which is not necessarily a physical location).


5. Consider the following statements:
Statement 1: In Cassandra, during a write operation, when hinted handoff is enabled and If any replica is down, the coordinator writes to all other replicas, and keeps the write locally until down replica comes back up.
Statement 2: In Cassandra, Ec2Snitch is important snitch for deployments and it is a simple snitch for Amazon EC2 deployments where all nodes are in a single region. In Ec2Snitch region name refers to data center and availability zone refers to rack in a cluster.
A. Only Statement 1 is true
B. Only Statement 2 is true
C. Both Statements are true
D. Both Statements are false

Answer: C) Both Statements are true

Explanation: Cassandra uses a protocol called Gossip to obtain information about the location and status of the other nodes participating in a Cassandra cluster. Gossip is a peer-to-peer communication protocol in which nodes regularly exchange status information about themselves and about other nodes they know.

 

6. What is Eventual Consistency ?
A. At any time, the system is linearizable
B. If writes stop, all reads will return the same value after a while
C. At any time, concurrent reads from any node return the same values
D. If writes stop, a distributed system will become consistent

Answer: B) If writes stop, all reads will return the same value after a while

Explanation: Cassandra offers Eventual Consistency. Is says that If writes to a key stop, all replicas of key will converge automatically.

 

7. Consider the following statements:
Statement 1: When two processes are competing with each other causing data corruption, it is called deadlock
Statement 2: When two processes are waiting for each other directly or indirectly, it is called race condition
A. Only Statement 1 is true
B. Only Statement 2 is true
C. Both Statements are false
D. Both Statements are true 

Answer: C) Both Statements are false 

Explanation: The correct statements are:
Statement 1: When two processes are competing with each other causing data corruption, it is called Race Condition
Statement 2: When two processes are waiting for each other directly or indirectly, it is called deadlock.


8. ZooKeeper allows distributed processes to coordinate with each other through registers, known as
A. znodes
B. hnodes
C. vnodes
D. rnodes

Answer: A) znodes

Explanation: Every znode is identified by a path, with path elements separated by a slash.



9. In Zookeeper, when a is triggered the client receives a packet saying that the znode has changed.
A. Event
B. Row
C. Watch
D. Value

Answer: C) Watch

Explanation: ZooKeeper supports the concept of watches. Clients can set a watch on a znodes.


10. Consider the Table temperature_details in Keyspace “day3” with schema as follows:
temperature_details(daynum, year,month,date,max_temp)
with primary key(daynum,year,month,date) 

DayNum

Year

Month

Date

MaxTemp (°C)

1

1943

10

1

14.1

2

1943

10

2

16.4

541

1945

3

24

21.1

9970

1971

1

16

21.4

20174

1998

12

24

36.7

21223

2001

11

7

16

4317

1955

7

26

16.7

 There exists same maximum temperature at different hours of the same day. Choose the correct CQL query to:

Alter table temperature_details to add a new column called “seasons” using map of type
<varint, text> represented as <month, season>. Season can have the following values season={spring, summer, autumn, winter}.
Update table temperature_details where columns daynum, year, month, date contain the following values- 4317,1955,7,26 respectively.
Use the select statement to output the row after updation.
Note: A map relates one item to another with a key-value pair. For each key, only one value may exist, and duplicates cannot be stored. Both the key and the value are designated with a data type.

A)
cqlsh:day3> alter table temperature_details add hours1 set<varint>;
cqlsh:day3> update temperature_details set hours1={1,5,9,13,5,9} where daynum=4317; cqlsh:day3> select * from temperature_details where daynum=4317;


B)
cqlsh:day3> alter table temperature_details add seasons map<varint,text>;
cqlsh:day3> update temperature_details set seasons = seasons + {7:'spring'} where daynum=4317 and year =1955 and month = 7 and date=26;
cqlsh:day3> select * from temperature_details where daynum=4317 and year=1955 and month=7 and date=26;


C)
cqlsh:day3>alter table temperature_details add hours1 list<varint>;
cqlsh:day3> update temperature_details set hours1=[1,5,9,13,5,9] where daynum=4317 and year = 1955 and month = 7 and date=26;
cqlsh:day3> select * from temperature_details where daynum=4317 and year=1955 and month=7 and date=26;


D) cqlsh:day3> alter table temperature_details add seasons map<month, season>;
cqlsh:day3> update temperature_details set seasons = seasons + {7:'spring'} where daynum=4317;
cqlsh:day3> select * from temperature_details where daynum=4317;

Answer: B)
cqlsh:day3> alter table temperature_details add seasons map<varint,text>;
cqlsh:day3> update temperature_details set seasons = seasons + {7:'spring'} where daynum=4317 and year =1955 and month = 7 and date=26;
cqlsh:day3> select * from temperature_details where daynum=4317 and year=1955 and month=7 and date=26;


Explanation:
The correct steps are:
a) Add column “seasons”
cqlsh:day3> alter table temperature_details add seasons map<varint,text>;

b) Update table
cqlsh:day3> update temperature_details set seasons = seasons + {7:'spring'} where daynum=4317 and year =1955 and month = 7 and date=26;

c) Select query
cqlsh:day3> select * from temperature_details where daynum=4317 and year=1955 and month=7 and date=26;

 

daynum

year

month

date

hours

hours1

max_temp

seasons

4317

1955

7

26

{1,5,9,13}

[1,5,9,13,5,9]

16.7

{7:’spring’}

Saturday, September 11, 2021

Big Data Computing: Quiz Assignment-I Solutions (Week-1)

1. _____________is responsible for allocating system resources to the various applications running in a Hadoop cluster and scheduling tasks to be executed on different cluster nodes.


A. Hadoop Common
B. Hadoop Distributed File System (HDFS)
C. Hadoop YARN
D. Hadoop MapReduce

Answer: C) Hadoop YARN

Explanation:

Hadoop Common: Contains libraries and utilities necessary from other Didoop modules.
HDFS: It is a distributed file system that stores data on a commodity machine. Provide a very high addition bandwidth throughout the cluster.
Hadoop Discussion: is a resource management platform responsible for managing processing resources in the cluster and use them to schedule users and applications. The thread is responsible for the assignment of system resources to the various applications running in a Hadoop cluster and the programming activities that are performed in different clustered nodes.
Hadooop MapReduce: it is a programming model that resizes data into many different processes.

 

2. Which of the following tool is designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases ?
A. Pig
B. Mahout
C. Apache Sqoop
D. Flume

Answer: C) Apache Sqoop

Explanation: Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases

 

3. ________________is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.
A. Flume
B. Apache Sqoop
C. Pig
D. Mahout

Answer: A) Flume
Explanation: Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of historical data. It has a simple and very flexible architecture based on streaming data flow. It is very robust and fault-tolerant, and can really be adjusted to improve reliability mechanisms, failover, recovery, and any other mechanisms that keep the cluster safe and reliable. It uses a simple extensible data model that allows us to use various online analysis applications.

 

4. _____________refers to the connectedness of big data.

A. Value
B. Veracity
C. Velocity
D. Valence

Answer: D) Valence

Explanation: Valence refers to the connectedness of big data. Such as in the form of graph networks

 

5. Consider the following statements:

Statement 1: Volatility refers to the data velocity relative to timescale of event being studied

Statement 2: Viscosity refers to the rate of data loss and stable lifetime of data

A. Only statement 1 is true
B. Only statement 2 is true
C. Both statements are true
D. Both statements are false

Answer: D) Both statements are false

Explanation: The correct statements are:

Statement 1: Viscosity refers to the data velocity relative to timescale of event being studied Statement 2: Volatility refers to the rate of data loss and stable lifetime of data

 

6.___________refers to the biases, noise and abnormality in data, trustworthiness of data.

A. Value
B. Veracity
C. Velocity
D. Volume

Answer: B) Veracity

Explanation: Veracity refers to the biases ,noise and abnormality in data, trustworthiness of data.

 

7. ___________brings scalable parallel database technology to Hadoop and allows users to submit low latencies queries to the data that's stored within the HDFS or the Hbase without acquiring a ton of data movement and manipulation.
A. Apache Sqoop
B. Mahout
C. Flume
D. Impala

Answer: D) Impala

Explanation: Cloudera, Impala is specially designed for Cloudera, it is a query engine that runs on Apache Hadoop. The project was officially announced in late 2012 and became a publicly available open source distribution. Impala brings scalable parallel database technology to Hadoop, allowing users to send low-latency queries to data stored in HDFS or Hbase without performing a lot of data movements and operations.

 

8. True or False ?

NoSQL databases store unstructured data with no particular schema
A. True
B. False

Answer: A) True
Explanation: Traditional SQL can handle a large amount of structured data effectively, and we need NoSQL (not only SQL) to handle unstructured data. NoSQL database stores unstructured data without a specific schema.

 

9. _____________is a highly reliable distributed coordination kernel , which can be used for distributed locking, configuration management, leadership election, and work queues etc.
A. Apache Sqoop
B. Mahout
C. ZooKeeper
D. Flume

Answer: C) ZooKeeper

Explanation: Zookeeper is a central key value store that uses distributed systems that can coordinate. Since it is necessary to be able to manage the load, the zookeeper works with many machines.

 

10. True or False ?

MapReduce is a programming model and an associated implementation for processing and generating large data sets.
A. True
B. False

Answer: A) True

Search Aptipedia