Friday, April 22, 2022

Pearson Chi-square test

Chi-square (χ2) test for independence (Pearson Chi-square test)

· In a contingency table, the Chi-square test is a non-parametric (distribution-free) approach for comparing the association between two categorical (nominal) variables.

· If we have distinct treatments (treated and non-treated) and different treatment outcomes (cured and non-cured), we can apply the chi-square test for independence to see if treatments are connected to treatment outcomes.

· Because the Chi-square test is based on approximation (it returns an approximate p value), it necessitates a higher sample size. For more than 20% of cells, the anticipated frequency count should not be less than 5. If the sample size is tiny, the chi-square test is ineffective, and Fisher's exact test should be used instead.

· The chi-square independence test is not the same as the chi-square goodness of fit test.

Formula


Hypotheses for Chi-square test for independence

· The two category variables are independent, according to the null hypothesis (no association between the two variables) (H0: Oi = Ei)

· Hypothesis #2(alternate hypothesis): The two categorical variables are interdependent (there is an association between the two variables) ( Ha: Oi ≠ Ei )

· There is no one-tailed or two-tailed p value. The chi-square test's rejection zone is always on the right side of the distribution.



Chi-square test assumption

· Data is randomly sampled and the two variables are categorical (nominal).

· The levels of the variables are mutually exclusive.

· A contingency table's predicted frequency count for at least 80% of the cells is at least 5. For modest frequency counts, Fisher's exact test is appropriate.

· The predicted frequency count must be at least one.

· Observations should be separate from one another. Observation data should be frequency counts and not percentages, proportions or transformed data

Data Science applications

 If you are a certified data scientist, you probably have encountered some of these issues before. If you are a beginner, these use cases will help you learn different data science ideas that can be applied across the industry. Data science challenges are not evolving as quickly as possible for most organizations. Use cases will grow through many competitors depending on your planning needs and expectations. It is crucial to provide insights into current use cases so that they can be condensed and applied to new use cases. You'll occasionally encounter scenarios that haven't been written about in articles or studied at institutions. The allure of data science is that it is scalable and applicable to many issues while requiring relatively little effort.

 1. Credit Card Fraud Detection

We'd create a supervised model to classify it as either fraud or not fraud in this situation. In an ideal world, you'd have many samples of what noise looks like in your data.

 2. Customer Segmentation

Unsupervised learning would be preferred over classification to employ clustering in this circumstance. K-Means is a traditional clustering algorithm. This task is unsupervised because you don't possess labels and don't know what to group. However, you'd like to uncover patterns of novel combinations based on their shared points.

 3. Customer Churn Prediction

A family of machine learning techniques could help with this problem. This query is similar to the one used to detect credit card fraud. We want to collect customer information with a specific label, such as churn or no-churn.

 4. Sales Forecasting

Forecasting transactions is perhaps the most diverse of the three use cases discussed so far. We can apply deep learning to anticipate future commodities purchases in this example. The LSTM algorithm was utilized. LSTM stands for Long Short-Term Memory.


Search Aptipedia