Friday, October 15, 2021

Big Data Computing: Quiz Assignment-VII Solutions (Week-7)

1. Suppose you are using a bagging based algorithm say a Random Forest in model building. Which of the following can be true?
1 Number of tree should be as large as possible
2 You will have interpretability after using Random Forest
A. Only 1
B. Only 2
C. Both 1 and 2
D. None of these
Answer: A) Only 1
Explanation: Since Random Forest collects results from a few weak students, if possible we would like more trees in building the model. Random Forest is a black box model that you will lose interpretation after using it.
 
2. To apply bagging to regression trees which of the following is/are true in such case?
1. We build the N regression with N bootstrap sample
2. We take the average the of N regression tree
3. Each tree has a high variance with low bias
A. 1 and 2
B. 2 and 3
C. 1 and 3
D. 1,2 and 3
Answer: D) 1,2 and 3
Explanation: All of the options are correct and self explanatory
 
3. In which of the following scenario a gain ratio is preferred over Information Gain?
A. When a categorical variable has very small number of category
B. Number of categories is the not the reason
C. When a categorical variable has very large number of category
D. None of the mentioned
Answer: C) When a categorical variable has very large number of category
Explanation: When high cardinality problems, gain ratio is preferred over Information Gain technique.
 
4. Which of the following is/are true about Random Forest and Gradient Boosting ensemble methods?
1. Both methods can be used for classification task
2. Random Forest is use for classification whereas Gradient Boosting is use for regression task
3. Random Forest is use for regression whereas Gradient Boosting is use for Classification task
4. Both methods can be used for regression task
A. 1 and 2
B. 2 and 3
C. 2 and 4
D. 1 and 4
Answer: D) 1 and 4
Explanation: Both algorithms are design for classification as well as regression task.
 
5. True or False ?
Bagging provides an averaging over a set of possible datasets, removing noisy and non-stable parts of models.
A. True
B. False
Answer: A) True
 
6. Hundreds of trees can be aggregated to form a Random forest model. Which of the following is true about any individual tree in Random Forest?
1. Individual tree is built on a subset of the features
2. Individual tree is built on all the features
3. Individual tree is built on a subset of observations
4. Individual tree is built on full set of observations
A. 1 and 3
B. 1 and 4
C. 2 and 3
D. 2 and 4
Answer: A) 1 and 3
Explanation: Random forest is based on the bagging concept, which takes into account the champion faction and the characteristic faction for the construction of individual trees.
 
7. Boosting any algorithm takes into consideration the weak learners. Which of the following is the main reason behind using weak learners?
Reason I-To prevent overfitting Reason II- To prevent underfitting
A. Reason I
B. Reason II
C. Both Reason I and Reason II
D. None of the Reasons
Answer: A) Reason I
Explanation: To prevent overfitting, because the overall complexity of the learner increases with each step. Starting with weak students implies that late grade students will tend to be less big.

Big Data Computing: Quiz Assignment-VI Solutions (Week-6)

1. Which of the following is required by K-means clustering ?
A. Defined distance metric
B. Number of clusters
C. Initial guess as to cluster centroids
D. All of the mentioned
Answer: D) All of the mentioned
Explanation: K-means clustering follows partitioning approach.
 
 
2. Identify the correct statement in context of Regressive model of Machine Learning.
A. Regressive model predicts a numeric value instead of category.
B. Regressive model organizes similar item in your dataset into groups.
C. Regressive model comes up with a set of rules to capture associations between items or events.
D. None of the Mentioned
Answer: A) Regressive model predicts a numeric value instead of category.
 
 
3. Which of the following tasks can be best solved using Clustering ?
A. Predicting the amount of rainfall based on various cues
B. Training a robot to solve a maze
C. Detecting fraudulent credit card transactions
D. All of the mentioned
Answer: C) Detecting fraudulent credit card transactions
Explanation: Credit card transactions can be clustered into fraud transactions using unsupervised learning.
 
 
4. Identify the correct method for choosing the value of ‘k’ in k-means algorithm ?
A. Dimensionality reduction
B. Elbow method
C. Both Dimensionality reduction and Elbow method
D. Data partitioning
Answer: C) Both Dimensionality reduction and Elbow method
 
 
5. Identify the correct statement(s) in context of overfitting in decision trees:
Statement I: The idea of Pre-pruning is to stop tree induction before a fully grown tree is built, that perfectly fits the training data.
Statement II: The idea of Post-pruning is to grow a tree to its maximum size and then remove the nodes using a top-bottom approach.
A. Only statement I is true
B. Only statement II is true
C. Both statements are true
D. Both statements are false
Answer: A) Only statement I is true
Explanation: With early pruning, the idea is to stop tree induction before a mature tree is built that fits the training data perfectly.
In post-pruning, the tree is grown to its maximum size, then the tree is pruned by removing the nodes using a bottom-up approach.
 
 
6. Which of the following options is/are true for K-fold cross-validation ?
1. Increase in K will result in higher time required to cross validate the result.
2. Higher values of K will result in higher confidence on the cross-validation result as compared to lower value of K.
3. If K=N, then it is called Leave one out cross validation, where N is the number of observations.
A. 1 and 2
B. 2 and 3
C. 1 and 3
D. 1, 2 and 3
Answer: D) 1,2 and 3
Explanation: A larger k value means less bias towards the true expected error estimate (because the training fold will be closer to the total dataset) and higher runtime (when you approach the edge case: Leave-One-Out CV). We must also consider the variance between the accuracy of k folds when selecting k.
 
 
7. Imagine you are working on a project which is a binary classification problem. You trained a model on training dataset and get the below confusion matrix on validation dataset. 
 
Based on the above confusion matrix, choose which option(s) below will give you correct predictions ?
1. Accuracy is ~0.91
2. Misclassification rate is ~ 0.91
3. False positive rate is ~0.95
4. True positive rate is ~0.95
A. 1 and 3
B. 2 and 4
C. 2 and 3
D. 1 and 4 
Answer: D) 1 and 4 
Explanation:
The Accuracy (correct classification) is (50+100)/165 which is nearly equal to 0.91.
The true Positive Rate is how many times you are predicting positive class correctly so true positive rate would be 100/105 = 0.95 also known as “Sensitivity” or “Recall” 
 
 
8. Identify the correct statement(s) in context of machine learning approaches:
Statement I: In supervised approaches, the target that the model is predicting is unknown or unavailable. This means that you have unlabeled data.
Statement II: In unsupervised approaches the target, which is what the model is predicting, is provided. This is referred to as having labeled data because the target is labeled for every sample that you have in your data set.
A. Only Statement I is true
B. Only Statement II is true
C. Both Statements are false
D. Both Statements are true
Answer: C) Both Statements are false
Explanation: The correct statements are:
Statement I: In the supervised approach, goals are given, which are predicted by the model. This is called having labeled data because the target is labeled for each sample you have in your data set.
Statement II: In the unsupervised approach, the target predicted by the model is unknown or unavailable. This means you have unmarked data.

Search Aptipedia