Showing posts with label Cassandra. Show all posts
Showing posts with label Cassandra. Show all posts

Friday, September 24, 2021

Big Data Computing: Quiz Assignment-III Solutions (Week-3)

1. In Spark, a is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.

A. Spark Streaming

B. FlatMap

C. Driver

D. Resilient Distributed Dataset (RDD)

Answer: D) Resilient Distributed Dataset (RDD)

Explanation: Resilient Distributed Data Sets (RDDs) are a basic Spark data structure. It is a distributed and immutable collection of objects. Each dataset in RDD is divided into logical partitions that can be computed on different nodes in the cluster. RDDs can contain any type of Python, Java, or Scala object, including custom classes. Formally, an RDD is a read-only, partitioned collection of data sets. RDDs can be created by deterministic operations on data in stable storage or other RDDs. RDD is a collection of fault tolerant elements that can be operated in parallel.


2. Given the following definition about the join transformation in Apache Spark:

def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]

Where join operation is used for joining two datasets. When it is called on datasets of type (K, V) and (K, W), it returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.

Output the result of joinrdd, when the following code is run.

val rdd1 = sc.parallelize(Seq(("m",55),("m",56),("e",57),("e",58),("s",59),("s",54)))

val rdd2 = sc.parallelize(Seq(("m",60),("m",65),("s",61),("s",62),("h",63),("h",64))) val joinrdd = rdd1.join(rdd2)

joinrdd.collect


A. Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)),

(m,(56,65)), (s,(59,61)), (s,(59,62)), (h,(63,64)), (s,(54,61)), (s,(54,62)))

B. Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)),

(m,(56,65)), (s,(59,61)), (s,(59,62)), (e,(57,58)), (s,(54,61)), (s,(54,62)))

C. Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)),

(m,(56,65)), (s,(59,61)), (s,(59,62)), (s,(54,61)), (s,(54,62)))

D. None of the mentioned

Answer: C) Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)),

(m,(56,65)), (s,(59,61)), (s,(59,62)), (s,(54,61)), (s,(54,62)))

Explanation: join() is transformation which returns an RDD containing all pairs of elements with matching keys in this and other. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in this and (k, v2) is in other.

 

3. Consider the following statements in the context of Spark:

Statement 1: Spark improves efficiency through in-memory computing primitives and general computation graphs.

Statement 2: Spark improves usability through high-level APIs in Java, Scala, Python and also provides an interactive shell.

A. Only statement 1 is true

B. Only statement 2 is true

C. Both statements are true

D. Both statements are false

Answer: C) Both statements are true

Explanation: Apache Spark is a fast and universal cluster computing system. It offers high-level APIs in Java, Scala, and Python, as well as an optimized engine that supports general-execution graphics. It also supports a variety of higher-level tools, including Spark SQL for SQL and structured computing, MLlib for machine learning, GraphX ​​for graph processing, and Spark Streaming. Spark comes with several sample programs. Spark offers an interactive shell, a powerful tool for interactive data analysis. It is available in Scala or Python language. Spark improves efficiency through in-memory computing primitives. With in-memory computing, data is kept in random access memory (RAM) instead of some slow disk drives and is processed in parallel. This allows us to recognize a pattern and analyze large amounts of data. This has become popular because it reduces the cost of storage. Therefore, in-memory processing is economical for applications.


4. True or False ?

Resilient Distributed Datasets (RDDs) are fault-tolerant and immutable.

A. True

B. False

Answer: True

Explanation: Resilient Distributed Datasets (RDDs) are:

1. Immutable collections of objects spread across a cluster

2. Built through parallel transformations (map, filter, etc.)

3. Automatically rebuilt on failure

4. Controllable persistence (e.g. caching in RAM)


5. Which of the following is not a NoSQL database?

A. HBase

B. Cassandra

C. SQL Server

D. None of the mentioned

Answer: C) SQL Server

Explanation: NoSQL, which stands for "not just SQL", is an alternative to traditional relational databases where the data is stored in tables and the data schema is carefully designed before the database is created. NoSQL databases are particularly useful for working with large amounts of distributed data.

 

6. True or False ?

Apache Spark potentially run batch-processing programs up to 100 times faster than Hadoop MapReduce in memory, or 10 times faster on disk.

A. True

B. False

Answer: True

Explanation: Spark's biggest claim about speed is that "it can run programs up to 100 times faster than Hadoop MapReduce in memory or 10 times faster on disk." Spark could make this claim because it takes care of the processing in the main memory of the worker nodes and avoids unnecessary I / O operations on the disks. The other benefit that Spark offers is the ability to chain tasks at the application programming level without actually writing to disks or minimizing the amount of writes to disks.


7. _____________leverages Spark Core fast scheduling capability to perform streaming analytics.

A. MLlib

B. Spark Streaming

C. GraphX

D. RDDs

Answer: B) Spark Streaming

Explanation: Spark Streaming ingests data in mini-batches and performs RDD transformations on those mini-batches of data.


8. _________ is a distributed graph processing framework on top of Spark.

A. MLlib

B. Spark streaming

C. GraphX

D. All of the mentioned

Answer: C) GraphX

Explanation: GraphX is Apache Spark's API for graphs and graph-parallel computation. It is a distributed graph processing framework on top of Spark.


9. Point out the incorrect statement in the context of Cassandra:

A. It is a centralized key-value store

B. It is originally designed at Facebook

C. It is designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure

D. It uses a ring-based DHT (Distributed Hash Table) but without finger tables or routing

Answer: A) It is a centralized key-value store

Explanation: Cassandra is a distributed key-value store.


10. Consider the following statements:

Statement 1: Scale out means grow your cluster capacity by replacing with more powerful machines.

Statement 2: Scale up means incrementally grow your cluster capacity by adding more COTS machines (Components Off the Shelf).

A. Only statement 1 is true

B. Only statement 2 is true

C. Both statements are false

D. Both statements are true

Answer: C) Both statements are false

Explanation: The correct statements are:

Scale up = grow your cluster capacity by replacing with more powerful machines

Scale out = incrementally grow your cluster capacity by adding more COTS machines (Components Off the Shelf)

Monday, August 30, 2021

Bigdata: Challenges and solutions

Big Data: It is very huge, quite large or abundant amount of data, information or the co-related statistics collected by the big organizations. Most of the software and data storage developed and prepared, as it is tough to evaluate the big data, manually. It is used to find out patterns and trends to make decisions concerning human, and interactive technology.

Applications of Big Data

1. Banking and Financial Services

All Credit card companies, retail banks, private wealth management services, insurance companies, and institutional investment houses use big data analysis for their financial services. The problem among them is that the massive amount of is multi-structured data stored in multiple systems, which big data can solve in quick time to make decisions. Big data is used in many ways, such as:

• Customer analytics

• Compliance analytics

• Fraud analytics

• Operational analytics

2. Big Data in telecommunications

Gaining new customers to subscribe, retaining the customers, and expanding within current customer base are top priorities for telephone communication companies. The solutions to these challenges is in the ability to collate and analyze the customer-generated data and/or machine-generated data that is being created day by day.

3. Big Data for Retail marketing

Whether the company is an online retailer or offline construction company, They all want to understand the demand of the customers and change in their needs. This need is to analyze all different data sources (data-mart) that companies deal day to day, including the customer transaction data, weblogs, social media, credit card data, and reward/coupon program data.

Bigdata challenges and solution

1. Lack of understanding of Big Data

Many organizations fail in their Big Data initiatives due to lack of understanding. Employees might not be knowing what data is, its storage methods, operations on data, importance, and data sources. Data professionals may know what needs to be done, but others may not have a clear view.

For example, if an employee do not understand the significance of data storage, he may not keep the backup of confidential or sensitive data. They might not use database systems properly for storage. As a result, when this data is required and needs to be accessed, it cannot be retrieved, easily.

Solution:

Big Data workshops and hands-on practice must be conducted for everyone. Basic training programs must be conducted for all the employees who are handling data, daily and as a part of the Big Data projects. A basic understanding of concept of Bigdata must be inculcated by all organization.

2. Data growth issues

One of the most complex challenge of Big Data is storing all these voluminous data, properly. The abundance of data being stored in data marts and databases of companies is growing, rapidly.

As these data grow rapidly with time, it will be difficult to handle in the future. The data is unstructured and comes from documents, audios, videos, text files and other sources. It means that you cannot search them in databases.

Solution:

In order to maintain these large data sets, companies are going for present techniques, such as compression, tiering (level-wise storage), and de-duplication. Compression is used for reducing the redundancies in the data, thus reducing its overall size upto some extent witout changing the meaning of data. De-duplication is the process of eradicating duplicate and unwanted data from a data. Data tiering allows companies to store the data in different storage tiers to ensure the data is residing in the most appropriate storage space. Data tiers can be private cloud, public cloud, and flash storage, depending on the data size and significance.

3. Confusion in selecting Bigdata tool

The companies sometimes get confused while selecting the best tool for Big Data analysis and storage. There are many questions arises like;

Is HBase or Cassandra the best technology for storage?

Is Hadoop or MapReduce good enough or Spark be a better choice for data analytics and storage?

Above questions bother companies and often they are unable to find the answers. They end up making poor decisions and select a technology which is not suitable. Therefore, money, time, and efforts are wasted.

Solution:

The best way to seek professional assistance. You can either hire experienced Bigdata professionals who knows much more about the tools. Another way is to go for Big Data consultancy for proper advice. Here, consultants will give some advice and recommend best tools, based on the company’s scenario. Based on their advice, you can make a strategy and then select the best tool for the betterment of the company.

4. Lack of data professionals

To utilize these novice technologies and Big Data tools, companies need to have skilled data professionals. These data professionals include data scientists, data analysts and data engineers who are experienced in working with the data handling tools and making sense out of voluminous data sets. Companies face lack of Big Data professionals in current scenario. This is because data handling tools have evolved, rapidly, but in many cases, the data professionals have not evolved as compared to.

Solution:

The companies are investing more and more money in hiring skilled professionals. They also have to offer free training programs to the existing staff to get the most out of them.

Another significant step taken by companies is to purchase the data analytics software/tools that are powered by artificial intelligence and /or machine learning. These tools can be used by professionals who are not data science experts but have preliminary knowledge.

5. Securing the data

Securing the huge data is one of the challenges task of Big Data. Often many big companies are also busy in collecting, understanding, storing, and analyzing the data that arises data security for later stages. But, this is not a good move as unprotected data repositories may become breeding grounds for hackers. Companies can lose the data with their revenue.

Solution:

Companies should recruit cyber-security professionals to protect the data. Other steps taken for securing data; such as:

• Data encryption

• Data segregation

• Identity and access control

• Implementation of endpoint security

• Real-time security monitoring

• Use Big Data security tools

6. Integrating data from a various sources

Data in company comes from a variety of sources or data marts, such as social media pages, ERP applications, MIS applications, customer logs, financial reports, e-mails, presentations and data reports created by employees. Combining all these types data to prepare a single reports is a challenging task. This is field often neglected by firms. But, data integration is important for analysis, reporting and business intelligence, so it has to be worked out.

Solution:

Companies have to resolve the data integration problems by buying the right data handling tools. Few of them are mentioned below:

• Talend Data Integration

• Centerprise Data Integrator

• ArcESB

• IBM InfoSphere

• Xplenty

• Informatica PowerCenter

• CloverDX

• Microsoft SQL

• QlikView

• Oracle Data Service Integrator

Search Aptipedia