Showing posts with label Scala. Show all posts
Showing posts with label Scala. Show all posts

Wednesday, April 20, 2022

Data Science vs. Business Analytics


Key Differences Between Data Science and Business Analysis:

Here are some of the key differences between data scientists and business analysts.

1. Data science is the science of studying data using statistics, algorithms and technologies, and business analysis is the statistical study of business data.

2. Data science is a relatively recent development in analytics, but business analytics has existed since the late 19th century.

3 Data science requires a lot of programming skills, but business analysis doesn't require a lot of programming.

4. Data science is an important subset of business analysis. Therefore, anyone with data science skills can do business analysis, but not vice versa.

5. Taking data science one step ahead of business analysis is a luxury. However, business analysis is needed for companies to understand how it works and gain insights.

6. Analytical Data Science results cannot be used for everyday business decision making, but business analysis is essential for critical administrative decision making.

7. Data science does not answer obvious questions. Questions are almost common. However, business analysis mainly answers very specific questions about finance and business.

8. Data science can answer questions that can be used for business analysis, but not the other way around.

9. Data science uses both structured and unstructured data, while business analytics primarily uses structured data.

10. Data science has the potential to make a big leap, especially with the advent of machine learning and artificial intelligence, while business analysis is still slow.

11. Unlike business analysts, data scientists don't come across a lot of dirty data.

12. In contrast to business analysis, data science relies heavily on data availability.

13. Investing in data science The cost of is high and business analysis is low.

14. Data science can keep up with today's data. Data is growing and diverging into many data types. Data scientists have the necessary skills to handle it. However, commercial analysts do not own it.


Data Science and Business Analytics Comparison Table

Below is the comparison table between Data Scientist and Business Analytics.

Comparison base

Data Science

Business Analytics

Coining of Term

In 2008, DJ Patil and Jeff Hammerbacher from LinkedIn and Facebook, respectively, invented the term Data Scientist.

Since Frederick Winslow Taylor's implementation in the late 1800s, business analytics has been in use.

Concept

Data inference, algorithm development, and data-driven systems are all interdisciplinary fields.

To derive insights from business data, statistical principles are used. 

Application-Top 5 Industries

·         Technology

·         Financial

·         Mix of fields

·         Internet-based

·         Academic

·         Financial

·         Technology

·         Mix of fields

·         CRM/Marketing

·         Retail

Coding

Coding is needed. Traditional analytics approaches are combined with a solid understanding of computer science in this subject.

There isn't a lot of coding involved. Statistically orientated.

Languages Recommendations

C/C++/C#, Haskell, Java, Julia, Matlab, Python, R, SAS, Scala, SQL

C/C++/C#, Java, Matlab, Python, R SAS, Scala, SQL

Statistics

Following the creation and coding of algorithms, statistics is used at the end of the analysis.

The entire investigation is based on statistical principles.

Work Challenges

·         • Business decision-makers do not employ data science results.

·         • Inability to adapt results to the decision-making process of the company.

·         • There is a lack of clarity about the questions that must be answered with the data set provided.

·         • Data is unavailable or difficult to obtain.

·         • IT needs to be consulted.

·         • There is a notable lack of domain expert involvement.

·         • Unavailability of/difficult access to data 

·         • Dirty data

·         • Concerns about privacy

·         • Insufficient finances to purchase meaningful data sets from outside sources.

·         • Inability to adapt results to the decision-making process of the company.

·         • There is a lack of clarity about the questions that must be answered with the data set provided.

·         • Tools have limitations.

·         • IT needs to be consulted.

Data Needed

Both structured and unstructured data.

Predominantly structured data.

Future Trends

Machine Learning and Artificial Intelligence

Cognitive Analytics, Tax Analytics

Friday, September 24, 2021

Big Data Computing: Quiz Assignment-III Solutions (Week-3)

1. In Spark, a is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.

A. Spark Streaming

B. FlatMap

C. Driver

D. Resilient Distributed Dataset (RDD)

Answer: D) Resilient Distributed Dataset (RDD)

Explanation: Resilient Distributed Data Sets (RDDs) are a basic Spark data structure. It is a distributed and immutable collection of objects. Each dataset in RDD is divided into logical partitions that can be computed on different nodes in the cluster. RDDs can contain any type of Python, Java, or Scala object, including custom classes. Formally, an RDD is a read-only, partitioned collection of data sets. RDDs can be created by deterministic operations on data in stable storage or other RDDs. RDD is a collection of fault tolerant elements that can be operated in parallel.


2. Given the following definition about the join transformation in Apache Spark:

def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]

Where join operation is used for joining two datasets. When it is called on datasets of type (K, V) and (K, W), it returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key.

Output the result of joinrdd, when the following code is run.

val rdd1 = sc.parallelize(Seq(("m",55),("m",56),("e",57),("e",58),("s",59),("s",54)))

val rdd2 = sc.parallelize(Seq(("m",60),("m",65),("s",61),("s",62),("h",63),("h",64))) val joinrdd = rdd1.join(rdd2)

joinrdd.collect


A. Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)),

(m,(56,65)), (s,(59,61)), (s,(59,62)), (h,(63,64)), (s,(54,61)), (s,(54,62)))

B. Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)),

(m,(56,65)), (s,(59,61)), (s,(59,62)), (e,(57,58)), (s,(54,61)), (s,(54,62)))

C. Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)),

(m,(56,65)), (s,(59,61)), (s,(59,62)), (s,(54,61)), (s,(54,62)))

D. None of the mentioned

Answer: C) Array[(String, (Int, Int))] = Array((m,(55,60)), (m,(55,65)), (m,(56,60)),

(m,(56,65)), (s,(59,61)), (s,(59,62)), (s,(54,61)), (s,(54,62)))

Explanation: join() is transformation which returns an RDD containing all pairs of elements with matching keys in this and other. Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in this and (k, v2) is in other.

 

3. Consider the following statements in the context of Spark:

Statement 1: Spark improves efficiency through in-memory computing primitives and general computation graphs.

Statement 2: Spark improves usability through high-level APIs in Java, Scala, Python and also provides an interactive shell.

A. Only statement 1 is true

B. Only statement 2 is true

C. Both statements are true

D. Both statements are false

Answer: C) Both statements are true

Explanation: Apache Spark is a fast and universal cluster computing system. It offers high-level APIs in Java, Scala, and Python, as well as an optimized engine that supports general-execution graphics. It also supports a variety of higher-level tools, including Spark SQL for SQL and structured computing, MLlib for machine learning, GraphX ​​for graph processing, and Spark Streaming. Spark comes with several sample programs. Spark offers an interactive shell, a powerful tool for interactive data analysis. It is available in Scala or Python language. Spark improves efficiency through in-memory computing primitives. With in-memory computing, data is kept in random access memory (RAM) instead of some slow disk drives and is processed in parallel. This allows us to recognize a pattern and analyze large amounts of data. This has become popular because it reduces the cost of storage. Therefore, in-memory processing is economical for applications.


4. True or False ?

Resilient Distributed Datasets (RDDs) are fault-tolerant and immutable.

A. True

B. False

Answer: True

Explanation: Resilient Distributed Datasets (RDDs) are:

1. Immutable collections of objects spread across a cluster

2. Built through parallel transformations (map, filter, etc.)

3. Automatically rebuilt on failure

4. Controllable persistence (e.g. caching in RAM)


5. Which of the following is not a NoSQL database?

A. HBase

B. Cassandra

C. SQL Server

D. None of the mentioned

Answer: C) SQL Server

Explanation: NoSQL, which stands for "not just SQL", is an alternative to traditional relational databases where the data is stored in tables and the data schema is carefully designed before the database is created. NoSQL databases are particularly useful for working with large amounts of distributed data.

 

6. True or False ?

Apache Spark potentially run batch-processing programs up to 100 times faster than Hadoop MapReduce in memory, or 10 times faster on disk.

A. True

B. False

Answer: True

Explanation: Spark's biggest claim about speed is that "it can run programs up to 100 times faster than Hadoop MapReduce in memory or 10 times faster on disk." Spark could make this claim because it takes care of the processing in the main memory of the worker nodes and avoids unnecessary I / O operations on the disks. The other benefit that Spark offers is the ability to chain tasks at the application programming level without actually writing to disks or minimizing the amount of writes to disks.


7. _____________leverages Spark Core fast scheduling capability to perform streaming analytics.

A. MLlib

B. Spark Streaming

C. GraphX

D. RDDs

Answer: B) Spark Streaming

Explanation: Spark Streaming ingests data in mini-batches and performs RDD transformations on those mini-batches of data.


8. _________ is a distributed graph processing framework on top of Spark.

A. MLlib

B. Spark streaming

C. GraphX

D. All of the mentioned

Answer: C) GraphX

Explanation: GraphX is Apache Spark's API for graphs and graph-parallel computation. It is a distributed graph processing framework on top of Spark.


9. Point out the incorrect statement in the context of Cassandra:

A. It is a centralized key-value store

B. It is originally designed at Facebook

C. It is designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure

D. It uses a ring-based DHT (Distributed Hash Table) but without finger tables or routing

Answer: A) It is a centralized key-value store

Explanation: Cassandra is a distributed key-value store.


10. Consider the following statements:

Statement 1: Scale out means grow your cluster capacity by replacing with more powerful machines.

Statement 2: Scale up means incrementally grow your cluster capacity by adding more COTS machines (Components Off the Shelf).

A. Only statement 1 is true

B. Only statement 2 is true

C. Both statements are false

D. Both statements are true

Answer: C) Both statements are false

Explanation: The correct statements are:

Scale up = grow your cluster capacity by replacing with more powerful machines

Scale out = incrementally grow your cluster capacity by adding more COTS machines (Components Off the Shelf)

Search Aptipedia