Sunday, October 24, 2021

Big Data Computing: Quiz Assignment-VIII Solutions (Week-8)

1. Which of the following are provided by spark API for graph parallel computations:
i. joinVertices
ii. subgraph
iii. aggregateMessages
A. Only (i)
B. Only (i) and (ii)
C. Only (ii) and (iii)
D. All of the mentioned
Answer: D) All of the mentioned


2. Which of the following statement(s) is/are true in the context of Apache Spark GraphX operators ?
S1: Property operators modify the vertex or edge properties using a user defined map function and produces a new graph.
S2: Structural operators operate on the structure of an input graph and produces a new graph. S3: Join operators add data to graphs and produces a new graphs.
A. Only S1 is true
B. Only S2 is true
C. Only S3 is true
D. All of the mentioned
Answer: D) All of the mentioned


3. True or False ?
The outerJoinVertices() operator joins the input RDD data with vertices and returns a new graph. The vertex properties are obtained by applying the user defined map() function to the all vertices, and includes ones that are not present in the input RDD.
A. True
B. False
Answer: A) True


4. Which of the following statements are true ?
S1: Apache Spark GraphX provides the following property operators - mapVertices(), mapEdges(), mapTriplets()
S2: The RDDs in Spark, depend on one or more other RDDs. The representation of dependencies in between RDDs is known as the lineage graph. Lineage graph information is used to compute each RDD on demand, so that whenever a part of persistent RDD is lost, the data that is lost can be recovered using the lineage graph information.
A. Only S1 is true
B. Only S2 is true
C. Both S1 and S2 are true
D. None of the mentioned
Answer: C) Both S1 and S2 are true


5. GraphX provides an API for expressing graph computation that can model the
abstraction.
A. GaAdt
B. Pregel
C. Spark Core
D. None of the mentioned
Answer: B) Pregel


6. Match the following:
A. Dataflow Systems i. Vertex Programs
B. Graph Systems ii. Parameter Servers
C. Shared Memory Systems iii. Guinea Pig
A. A:ii, B: i, C: iii
B. A:iii, B: i, C: ii
C. A:ii, B: iii, C: i
D. A:iii, B: ii, C: i
Answer: B) A:iii, B: i, C: ii


7. Which of the following statement(s) is/are true in context of Parameter Servers.
S1: A machine learning framework
S2: Distributes a model over multiple machines
S3: It offers two operations: (i) Pull for query parts of the model (ii) Push for update parts of the model.
A. Only S1 is true
B. Only S2 is true
C. Only S3 is true
D. All of the mentioned
Answer: D) All of the mentioned


8.



What is the PageRank score of vertex B after the second iteration? (Without damping factor)
Hint:- The basic PageRank formula is:

Where, PRt+1(u): page rank of node u under consideration PRt(v): previous page rank of node ‘v’ pointing to node ‘u’ C(v): outgoing degree of vertex ‘v’
A. 1/6
B. 1.5/12
C. 2.5/12
D. 1/3 

Answer: A) 1/6

Explanation: The Page Rank score of all vertex is calculated as follows: 

 

Iteration0

Iteration1

Iteration2

Page Rank

A

  1/4

1/12

1.5/12

1

B

1/4

2.5/12

2/12

2

C

1/4

4.5/12

4.5/12

4

D

1/4

4/12

4/12

3

 

Friday, October 15, 2021

Big Data Computing: Quiz Assignment-VII Solutions (Week-7)

1. Suppose you are using a bagging based algorithm say a Random Forest in model building. Which of the following can be true?
1 Number of tree should be as large as possible
2 You will have interpretability after using Random Forest
A. Only 1
B. Only 2
C. Both 1 and 2
D. None of these
Answer: A) Only 1
Explanation: Since Random Forest collects results from a few weak students, if possible we would like more trees in building the model. Random Forest is a black box model that you will lose interpretation after using it.
 
2. To apply bagging to regression trees which of the following is/are true in such case?
1. We build the N regression with N bootstrap sample
2. We take the average the of N regression tree
3. Each tree has a high variance with low bias
A. 1 and 2
B. 2 and 3
C. 1 and 3
D. 1,2 and 3
Answer: D) 1,2 and 3
Explanation: All of the options are correct and self explanatory
 
3. In which of the following scenario a gain ratio is preferred over Information Gain?
A. When a categorical variable has very small number of category
B. Number of categories is the not the reason
C. When a categorical variable has very large number of category
D. None of the mentioned
Answer: C) When a categorical variable has very large number of category
Explanation: When high cardinality problems, gain ratio is preferred over Information Gain technique.
 
4. Which of the following is/are true about Random Forest and Gradient Boosting ensemble methods?
1. Both methods can be used for classification task
2. Random Forest is use for classification whereas Gradient Boosting is use for regression task
3. Random Forest is use for regression whereas Gradient Boosting is use for Classification task
4. Both methods can be used for regression task
A. 1 and 2
B. 2 and 3
C. 2 and 4
D. 1 and 4
Answer: D) 1 and 4
Explanation: Both algorithms are design for classification as well as regression task.
 
5. True or False ?
Bagging provides an averaging over a set of possible datasets, removing noisy and non-stable parts of models.
A. True
B. False
Answer: A) True
 
6. Hundreds of trees can be aggregated to form a Random forest model. Which of the following is true about any individual tree in Random Forest?
1. Individual tree is built on a subset of the features
2. Individual tree is built on all the features
3. Individual tree is built on a subset of observations
4. Individual tree is built on full set of observations
A. 1 and 3
B. 1 and 4
C. 2 and 3
D. 2 and 4
Answer: A) 1 and 3
Explanation: Random forest is based on the bagging concept, which takes into account the champion faction and the characteristic faction for the construction of individual trees.
 
7. Boosting any algorithm takes into consideration the weak learners. Which of the following is the main reason behind using weak learners?
Reason I-To prevent overfitting Reason II- To prevent underfitting
A. Reason I
B. Reason II
C. Both Reason I and Reason II
D. None of the Reasons
Answer: A) Reason I
Explanation: To prevent overfitting, because the overall complexity of the learner increases with each step. Starting with weak students implies that late grade students will tend to be less big.

Big Data Computing: Quiz Assignment-VI Solutions (Week-6)

1. Which of the following is required by K-means clustering ?
A. Defined distance metric
B. Number of clusters
C. Initial guess as to cluster centroids
D. All of the mentioned
Answer: D) All of the mentioned
Explanation: K-means clustering follows partitioning approach.
 
 
2. Identify the correct statement in context of Regressive model of Machine Learning.
A. Regressive model predicts a numeric value instead of category.
B. Regressive model organizes similar item in your dataset into groups.
C. Regressive model comes up with a set of rules to capture associations between items or events.
D. None of the Mentioned
Answer: A) Regressive model predicts a numeric value instead of category.
 
 
3. Which of the following tasks can be best solved using Clustering ?
A. Predicting the amount of rainfall based on various cues
B. Training a robot to solve a maze
C. Detecting fraudulent credit card transactions
D. All of the mentioned
Answer: C) Detecting fraudulent credit card transactions
Explanation: Credit card transactions can be clustered into fraud transactions using unsupervised learning.
 
 
4. Identify the correct method for choosing the value of ‘k’ in k-means algorithm ?
A. Dimensionality reduction
B. Elbow method
C. Both Dimensionality reduction and Elbow method
D. Data partitioning
Answer: C) Both Dimensionality reduction and Elbow method
 
 
5. Identify the correct statement(s) in context of overfitting in decision trees:
Statement I: The idea of Pre-pruning is to stop tree induction before a fully grown tree is built, that perfectly fits the training data.
Statement II: The idea of Post-pruning is to grow a tree to its maximum size and then remove the nodes using a top-bottom approach.
A. Only statement I is true
B. Only statement II is true
C. Both statements are true
D. Both statements are false
Answer: A) Only statement I is true
Explanation: With early pruning, the idea is to stop tree induction before a mature tree is built that fits the training data perfectly.
In post-pruning, the tree is grown to its maximum size, then the tree is pruned by removing the nodes using a bottom-up approach.
 
 
6. Which of the following options is/are true for K-fold cross-validation ?
1. Increase in K will result in higher time required to cross validate the result.
2. Higher values of K will result in higher confidence on the cross-validation result as compared to lower value of K.
3. If K=N, then it is called Leave one out cross validation, where N is the number of observations.
A. 1 and 2
B. 2 and 3
C. 1 and 3
D. 1, 2 and 3
Answer: D) 1,2 and 3
Explanation: A larger k value means less bias towards the true expected error estimate (because the training fold will be closer to the total dataset) and higher runtime (when you approach the edge case: Leave-One-Out CV). We must also consider the variance between the accuracy of k folds when selecting k.
 
 
7. Imagine you are working on a project which is a binary classification problem. You trained a model on training dataset and get the below confusion matrix on validation dataset. 
 
Based on the above confusion matrix, choose which option(s) below will give you correct predictions ?
1. Accuracy is ~0.91
2. Misclassification rate is ~ 0.91
3. False positive rate is ~0.95
4. True positive rate is ~0.95
A. 1 and 3
B. 2 and 4
C. 2 and 3
D. 1 and 4 
Answer: D) 1 and 4 
Explanation:
The Accuracy (correct classification) is (50+100)/165 which is nearly equal to 0.91.
The true Positive Rate is how many times you are predicting positive class correctly so true positive rate would be 100/105 = 0.95 also known as “Sensitivity” or “Recall” 
 
 
8. Identify the correct statement(s) in context of machine learning approaches:
Statement I: In supervised approaches, the target that the model is predicting is unknown or unavailable. This means that you have unlabeled data.
Statement II: In unsupervised approaches the target, which is what the model is predicting, is provided. This is referred to as having labeled data because the target is labeled for every sample that you have in your data set.
A. Only Statement I is true
B. Only Statement II is true
C. Both Statements are false
D. Both Statements are true
Answer: C) Both Statements are false
Explanation: The correct statements are:
Statement I: In the supervised approach, goals are given, which are predicted by the model. This is called having labeled data because the target is labeled for each sample you have in your data set.
Statement II: In the unsupervised approach, the target predicted by the model is unknown or unavailable. This means you have unmarked data.

Big Data Computing: Quiz Assignment-V Solutions (Week-5)

1. Columns in HBase are organized to
A. Column group
B. Column list
C. Column base
D. Column families
Answer: D) Column families
Explanation: The HBase table consists of a column family which is a logical and physical grouping of columns. Columns of one family are stored separately from columns of other families.

 

 

2. HBase is a distributed database built on top of the Hadoop file system.
A. Row-oriented
B. Tuple-oriented
C. Column-oriented
D. None of the mentioned
Answer: C) Column-oriented
Explanation: Column-oriented distributed data storage capable of horizontally scaling up to 1,000 standard servers and petabytes of indexed storage.

 

 

3. A small chunk of data residing in one machine which is part of a cluster of machines holding one HBase table is known as
A. Region
B. Split
C. Rowarea
D. Tablearea
Answer : A) Region
Explanation: In Hbase, table Split into regions and served by region servers.


4. In HBase, is a combination of row, column family, column qualifier and contains a value and a timestamp.
A. Cell
B. Stores
C. HMaster
D. Region Server
Answer: A) Cell
Explanation: Data is stored in the HBASE table Cells and Cells are a combination of rows, column families, column qualifiers and contain values ​​and timestamps.

 

 

5. HBase architecture has 3 main components:
A. Client, Column family, Region Server
B. Cell, Rowkey, Stores
C. HMaster, Region Server, Zookeeper
D. HMaster, Stores, Region Server
Answer: C) HMaster, Region Server, Zookeeper
Explanation: HBase architecture has 3 main components: HMaster, Region Server, Zookeeper.

1. HMaster: The Master Server implementation in HBase is HMaster. This is the process in which the region is assigned to the region server and DDL operations (create, drop tables). Monitors all Regional Server instances in the cluster.

2. Region Servers: The HBase table is divided horizontally by row key range into regions. Regions are the basic building blocks of the HBase cluster which consist of distribution tables and consist of column families. Region Server running on HDFS DataNode which is in Hadoop cluster.

3. Zookeeper: It's like the coordinator at HBase. It provides services such as maintaining configuration information, naming, distributed synchronization, server failure notification, etc. The client communicates with the regional server via zookeeper.


6. HBase stores data in
A. As many filesystems as the number of region servers
B. One filesystem per column family
C. A single filesystem available to all region servers
D. One filesystem per table.
Answer : C) A single filesystem available to all region servers


7. Kafka is run as a cluster comprised of one or more servers each of which is called
A. cTakes
B. Chunks
C. Broker
D. None of the mentioned
Answer: C) Broker
Explanation: Kafka broker allows consumers to retrieve messages by subject, partition and offset. Kafka brokers can create Kafka clusters by sharing information directly or indirectly using Zookeeper. A Kafka cluster has exactly one broker acting as a controller.


8. True or False ?
Statement 1: Batch Processing provides ability to process and analyze data at-rest (stored data)
Statement 2: Stream Processing provides ability to ingest, process and analyze data in- motion in real or near-real-time.
A. Only statement 1 is true
B. Only statement 2 is true
C. Both statements are true
D. Both statements are false
Answer: C) Both statements are true


9. _________________is a central hub to transport and store event streams in real time.
A. Kafka Core
B. Kafka Connect
C. Kafka Streams
D. None of the mentioned
Answer: A) Kafka Core
Explanation: Kafka Core is a central hub to transport and store event streams in real time.


10. What are the parameters defined to specify window operation ?
A. State size, window length
B. State size, sliding interval
C. Window length, sliding interval
D. None of the mentioned
Answer: C) Window length, sliding interval
Explanation:
Following parameters are used to specify window operation:
i) Window length: duration of the window
(ii) Sliding interval: interval at which the window operation is performed Both the parameters must be a multiple of the batch interval 

 

 

11. _________________is a Java library to process event streams live as they occur.
A. Kafka Core
B. Kafka Connect
C. Kafka Streams
D. None of the mentioned
Answer: C) Kafka Streams  

Explanation: Kafka Streams is a Java library to process event streams live as they occur.

Monday, September 27, 2021

Promote your blog or website

1. Submit to Search Engine - Submit your blog / website to the major and regional search engine for easy search and visit. Use meta tags to make the search engine crawl the website. Use the correct keyword for your website and the title that describes your website. Use search engine submitters. There are many websites that offer free shipping to search engines.

For example: http://www.coltdurl.com/

2. Submit RSS Feed - When submitting your website URL, you must submit your website RSS feed (if available) in the same way. This improves the revenue generated by AdSense for feeds, a feature of Google AdSense. There are many websites and software that offer RSS feeds for free. This feature sends RSS feeds to online news channels and readers like Google Reader, etc.

For example: http: // www.pingomatic.com/

3. Submit URL to Directory - Many search engines rely on the online directory for searches. the relevant content / website being searched for. So please submit your blog / website url to a specific directory of a specific category. For example, Google uses the directory www.dmog.org to crawl the relevant website, so submit your blog to a specific category. If you have an educational website / blog, please submit it in the Education / Academic category etc.

For example: http://www.freedirectorysubmission.com/

4. Email signature - Set your email signature with the name of your website and the URL with your name. When you send the email to someone, that person will see your email signature with a clickable URL, and if they are interested, they will click on this URL. However, do not send mass emails promoting your website, as this is the privacy policy of all email provider sites. Email is limited to 1000. Best used in email signature.

For example: http://www.gmail.com/

5. Join the forum - Join any website advertising forum to promote your website, and you can also discuss your website and receive opinions from other members. There are many experienced bloggers and website marketers who will surely help you promote your website / blog. They will also help you design the blog / website to match the advertisement.

For example: http://www.linkreferral.com/

6. Join the social networking site - Join the social networking site as often as possible by providing the link on your website with the correct title and description . Please use the promotional feature provided by these social media sites. Attach your feed to www.twitter.com to promote the website using the publishing service www.feedburner.com.

For example: http://www.facebook.com/

7. Use reciprocal links - This is also known as backlinks or exchange links. There are many websites that offer backlink and exchange link functionality. But you can also exchange the link with your friend and put a clickable link on your website / blog with the agreement with your friend that they will also put your website link on their website.

For example: http://www.stufenraffic.net/

8. Online Advertising - There are many websites that offer free online advertising. In it you have to provide all the information about your website and your contact details. If you have an approved Google AdSense account, you will receive a letter from Google AdWords providing you with Rs. 1500 / for free advertising from your Google AdSense publishers. You can advertise on the huge Google network with redemption code

provided by Google AdWords. For example: http://www.quikr.com/

9. Set as homepage - If you depend on an Internet cafe for your work, set the browser's homepage to the URL of your website. When someone else opens their web browser, they will see your website and will keep your website name / URL if interested. Feel free to do this. I discovered this idea and it works for me because a lot of people like my blog.

10. Tell a friend -Talk to your friends about your website and blog and try to convince them to use them. It doesn't matter what they think of your website. If possible, tell someone else about your website and the features you have provided on your website. Always speak positively about your website.

11. Email Subscription - Many online readers and websites made email subscription functionality available. Through this feature, every visitor will get the warning message on their email ID (if subscribed) when they update / publish content on their blog / website.

For example: http://www.feedburner.com/

12. SMS Alert - Google has provided this SMS notification feature, with which each member who subscribes to your website / blog via their email number mobile phone receives an SMS when updating the blog. Website.

For example: http://www.googlelabs.com/

 

SEO techniques:

SEO: Search engine optimization is a technique to improve the ranking of websites by using some techniques and suitable keywords, etc. . Here are a few:

1. MetaTag: Create or generate a meta tag with the meta tag generator or you can create it yourself using the tag in HTML and inserting it into the HEAD section of the HTML document. It should contain keywords about your website, the description and title of your website, and information about you.

2. Keywords: Keywords are the most important feature of any website that the search engine wants to crawl. Include the most searchable keyword in your meta tag. You can also use the Keyword Builder, but these are not good enough. So, use your SEO knowledge and do it manually.

3. Website Title: Make the website title easy for your website visitors to remember and suitable for search engine searches. Your website title should be perfect and it is best to use common English words and phrases as they are easy to remember.

4. Description: Please describe your website as accurately as possible and be very correct. Before writing a description, analyze your website. Please use the correct keyword in your description so that it is traceable.

5. HTML Tags: Use fewer HTML tags if possible, as HTML tags interfere with the search engine when crawling a search term. Use h1, h2, h3 tags for titles and subtitles and B tag for blogs, etc. 

6. Minimize: minimize the use of CSS, Flash, etc. It slows down the loading of the website in the client-side web browser. You can use it if you need to, but you can't use it in the content area.

Search Aptipedia