Showing posts with label MapReduce. Show all posts
Showing posts with label MapReduce. Show all posts

Saturday, September 11, 2021

Big Data Computing: Quiz Assignment-II Solutions (Week-2)

1. Consider the following statements:

Statement 1: The Job Tracker is hosted inside the master and it receives the job execution request from the client.

Statement 2: Task tracker is the MapReduce component on the slave machine as there are multiple slave machines.

A. Only statement 1 is true

B. Only statement 2 is true

C. Both statements are true

D. Both statements are false


Answer: C) Both statements are true


2. _______________is the slave/worker node and holds the user data in the form of Data Blocks.

A. NameNode

B. Data block

C. Replication

D. DataNode

Answer: D) DataNode

Explanation: The NameNode acts as the main server that manages the namespace of the file system, primarily managing client access to these files, and also keeping track of the location of the data on the DataNode and the basic distribution location of blocks. On the other hand, DataNode is a slave / worker node and stores user data in the form of data blocks.


3. ________works as a master server that manages the file system namespace and basically regulates access to these files from clients, and it also keeps track of where the data is on the Data Nodes and where the blocks are distributed essentially.

A. Name Node

B. Data block

C. Replication

D. Data Node

Answer: A) Name Node

Explanation: Namenode, as the main server, manages the namespace of the file system, and basically regulates the client's access to these files. At the same time, it also tracks the location of the data on the data node and the basic distribution location of the block. On the other hand, data nodes are slave/work nodes, which contain user data in the form of data blocks.


4. The number of maps in MapReduce is usually driven by the total size of

A. Inputs

B. Outputs

C. Tasks

D. None of the mentioned

Answer: A) Inputs

Explanation: The map, written by the user takes a pair of entry and produces a series of intermediate keys / value pairs. The MapReduce Library groups together all the intermediate values associated with the same intermediate key "I 'and pass them to the function reduce.


5. True or False ?

The main duties of task tracker are to break down the receive job that is big computations in small parts, allocate the partial computations that is tasks to the slave nodes monitoring the progress and report of task execution from the slave.

A. True

B. False

Answer: B) False

Explanation: The task tracker will communicate the progress and report the results to the job tracker.


6. Point out the correct statement in context of YARN:

A. YARN is highly scalable.

B. YARN enhances a Hadoop compute cluster in many ways

C. YARN extends the power of Hadoop to incumbent and new technologies found within the data center

D. All of the mentioned

Answer: D) All of the mentioned


7. Consider the pseudo-code for MapReduce's WordCount example (not shown here). Let's now assume that you want to determine the frequency of phrases consisting of 3 words each instead of determining the frequency of single words. Which part of the (pseudo-)code do you need to adapt?

A. Only map()

B. Only reduce()

C. map() and reduce()

D. The code does not have to be changed

Answer: A) Only map()

Explanation: The map function takes a value and outputs key:value pairs.

For instance, if we define a map function that takes a string and outputs the length of the word as the key and the word itself as the value then

map(steve) would return 5:steve and map(savannah) would return 8:savannah.

This allows us to run the map function against values in parallel. So we have to only adapt the map() function of pseudo code.


8. The namenode knows that the datanode is active using a mechanism known as

A. Heartbeats

B. Datapulse

C. h-signal

D. Active-pulse

Answer: A) heartbeats

Explanation: Use Heartbeat to communicate between the Hadoop Namenode and Datanode. Heartbeat is therefore a signal that the data node sends to the name node after a certain time interval to indicate its existence, ie to indicate that it is alive.


9. True or False ?

HDFS performs replication, although it results in data redundancy?

A. True

B. False

Answer: True

Explanation: Once the data has been written on HDFS, it is replicated immediately along the cluster, so that the different data copies are stored in different data nodes. Normally, the replication factor is 3, since due to this, the data does not remain on replicates or are lower.


10. _____________function processes a key/value pair to generate a set of intermediate key/value pairs.

A. Map

B. Reduce

C. Both Map and Reduce

D. None of the mentioned

Answer: A) Map

Explanation: Mapping is a single task that converts input data records into intermediate data records and reduces the process and merges all intermediate values ​​assigned by each key.

Monday, August 30, 2021

Bigdata: Challenges and solutions

Big Data: It is very huge, quite large or abundant amount of data, information or the co-related statistics collected by the big organizations. Most of the software and data storage developed and prepared, as it is tough to evaluate the big data, manually. It is used to find out patterns and trends to make decisions concerning human, and interactive technology.

Applications of Big Data

1. Banking and Financial Services

All Credit card companies, retail banks, private wealth management services, insurance companies, and institutional investment houses use big data analysis for their financial services. The problem among them is that the massive amount of is multi-structured data stored in multiple systems, which big data can solve in quick time to make decisions. Big data is used in many ways, such as:

• Customer analytics

• Compliance analytics

• Fraud analytics

• Operational analytics

2. Big Data in telecommunications

Gaining new customers to subscribe, retaining the customers, and expanding within current customer base are top priorities for telephone communication companies. The solutions to these challenges is in the ability to collate and analyze the customer-generated data and/or machine-generated data that is being created day by day.

3. Big Data for Retail marketing

Whether the company is an online retailer or offline construction company, They all want to understand the demand of the customers and change in their needs. This need is to analyze all different data sources (data-mart) that companies deal day to day, including the customer transaction data, weblogs, social media, credit card data, and reward/coupon program data.

Bigdata challenges and solution

1. Lack of understanding of Big Data

Many organizations fail in their Big Data initiatives due to lack of understanding. Employees might not be knowing what data is, its storage methods, operations on data, importance, and data sources. Data professionals may know what needs to be done, but others may not have a clear view.

For example, if an employee do not understand the significance of data storage, he may not keep the backup of confidential or sensitive data. They might not use database systems properly for storage. As a result, when this data is required and needs to be accessed, it cannot be retrieved, easily.

Solution:

Big Data workshops and hands-on practice must be conducted for everyone. Basic training programs must be conducted for all the employees who are handling data, daily and as a part of the Big Data projects. A basic understanding of concept of Bigdata must be inculcated by all organization.

2. Data growth issues

One of the most complex challenge of Big Data is storing all these voluminous data, properly. The abundance of data being stored in data marts and databases of companies is growing, rapidly.

As these data grow rapidly with time, it will be difficult to handle in the future. The data is unstructured and comes from documents, audios, videos, text files and other sources. It means that you cannot search them in databases.

Solution:

In order to maintain these large data sets, companies are going for present techniques, such as compression, tiering (level-wise storage), and de-duplication. Compression is used for reducing the redundancies in the data, thus reducing its overall size upto some extent witout changing the meaning of data. De-duplication is the process of eradicating duplicate and unwanted data from a data. Data tiering allows companies to store the data in different storage tiers to ensure the data is residing in the most appropriate storage space. Data tiers can be private cloud, public cloud, and flash storage, depending on the data size and significance.

3. Confusion in selecting Bigdata tool

The companies sometimes get confused while selecting the best tool for Big Data analysis and storage. There are many questions arises like;

Is HBase or Cassandra the best technology for storage?

Is Hadoop or MapReduce good enough or Spark be a better choice for data analytics and storage?

Above questions bother companies and often they are unable to find the answers. They end up making poor decisions and select a technology which is not suitable. Therefore, money, time, and efforts are wasted.

Solution:

The best way to seek professional assistance. You can either hire experienced Bigdata professionals who knows much more about the tools. Another way is to go for Big Data consultancy for proper advice. Here, consultants will give some advice and recommend best tools, based on the company’s scenario. Based on their advice, you can make a strategy and then select the best tool for the betterment of the company.

4. Lack of data professionals

To utilize these novice technologies and Big Data tools, companies need to have skilled data professionals. These data professionals include data scientists, data analysts and data engineers who are experienced in working with the data handling tools and making sense out of voluminous data sets. Companies face lack of Big Data professionals in current scenario. This is because data handling tools have evolved, rapidly, but in many cases, the data professionals have not evolved as compared to.

Solution:

The companies are investing more and more money in hiring skilled professionals. They also have to offer free training programs to the existing staff to get the most out of them.

Another significant step taken by companies is to purchase the data analytics software/tools that are powered by artificial intelligence and /or machine learning. These tools can be used by professionals who are not data science experts but have preliminary knowledge.

5. Securing the data

Securing the huge data is one of the challenges task of Big Data. Often many big companies are also busy in collecting, understanding, storing, and analyzing the data that arises data security for later stages. But, this is not a good move as unprotected data repositories may become breeding grounds for hackers. Companies can lose the data with their revenue.

Solution:

Companies should recruit cyber-security professionals to protect the data. Other steps taken for securing data; such as:

• Data encryption

• Data segregation

• Identity and access control

• Implementation of endpoint security

• Real-time security monitoring

• Use Big Data security tools

6. Integrating data from a various sources

Data in company comes from a variety of sources or data marts, such as social media pages, ERP applications, MIS applications, customer logs, financial reports, e-mails, presentations and data reports created by employees. Combining all these types data to prepare a single reports is a challenging task. This is field often neglected by firms. But, data integration is important for analysis, reporting and business intelligence, so it has to be worked out.

Solution:

Companies have to resolve the data integration problems by buying the right data handling tools. Few of them are mentioned below:

• Talend Data Integration

• Centerprise Data Integrator

• ArcESB

• IBM InfoSphere

• Xplenty

• Informatica PowerCenter

• CloverDX

• Microsoft SQL

• QlikView

• Oracle Data Service Integrator

Wednesday, May 27, 2015

Cloud Computing question bank


Objective types questions:
1. Write full form of GFS ……………………………………………………………………………….

2. Write full for of HDFS …………………………………………………………………………………………………………

3. MapReduce is used for big data analysis. (T/F)

4. MapReduce uses parallel computing paradigm. (T/F)

5. BigTable is introduced by Yahoo Inc. (T/F)

6. HBase is introducted by Google Inc. (T/F)

7. DynamoDb is introduced by Amazon. (T/F)

Search Aptipedia