Experimental Design Guide¶

Sampling Methods¶

https://www.scribbr.com/methodology/sampling-methods/

Types of Sampling¶

Probability sampling involves random selection, allowing you to make strong statistical inferences about the whole group.
Non-probability sampling involves non-random selection based on convenience or other criteria, allowing you to easily collect data.

Population vs Samples¶

The population is the entire group that you want to draw conclusions about.
The sample is the specific group of individuals that you will collect data from.

Sampling frame¶

The sampling frame is the actual list of individuals that the sample will be drawn from. Ideally, it should include the entire target population (and nobody who is not part of that population).

Probability sampling methods¶

Probability sampling means that every member of the population has a chance of being selected. It is mainly used in quantitative research. If you want to produce results that are representative of the whole population, probability sampling techniques are the most valid choice.

There are four main types of probability sample.

1-Simple random sampling¶

In a simple random sample, every member of the population has an equal chance of being selected. Your sampling frame should include the whole population.

To conduct this type of sampling, you can use tools like random number generators or other techniques that are based entirely on chance.

2-Systematic sampling¶

Systematic sampling is similar to simple random sampling, but it is usually slightly easier to conduct. Every member of the population is listed with a number, but instead of randomly generating numbers, individuals are chosen at regular intervals.

3-Stratified sampling¶

Stratified sampling involves dividing the population into subpopulations that may differ in important ways. It allows you draw more precise conclusions by ensuring that every subgroup is properly represented in the sample.

To use this sampling method, you divide the population into subgroups (called strata) based on the relevant characteristic (e.g., gender identity, age range, income bracket, job role).

Based on the overall proportions of the population, you calculate how many people should be sampled from each subgroup. Then you use random or systematic sampling to select a sample from each subgroup.

4-Cluster sampling¶

Cluster sampling also involves dividing the population into subgroups, but each subgroup should have similar characteristics to the whole sample. Instead of sampling individuals from each subgroup, you randomly select entire subgroups.

If it is practically possible, you might include every individual from each sampled cluster. If the clusters themselves are large, you can also sample individuals from within each cluster using one of the techniques above. This is called multistage sampling.

This method is good for dealing with large and dispersed populations, but there is more risk of error in the sample, as there could be substantial differences between clusters. It’s difficult to guarantee that the sampled clusters are really representative of the whole population.

Non-Probability Sampling¶

1. Convenience sampling¶

A convenience sample simply includes the individuals who happen to be most accessible to the researcher.

This is an easy and inexpensive way to gather initial data, but there is no way to tell if the sample is representative of the population, so it can’t produce generalizable results. Convenience samples are at risk for both sampling bias and selection bias.

2. Voluntary response sampling¶

Similar to a convenience sample, a voluntary response sample is mainly based on ease of access. Instead of the researcher choosing participants and directly contacting them, people volunteer themselves (e.g. by responding to a public online survey).

Voluntary response samples are always at least somewhat biased, as some people will inherently be more likely to volunteer than others, leading to self-selection bias.

3. Purposive sampling¶

This type of sampling, also known as judgement sampling, involves the researcher using their expertise to select a sample that is most useful to the purposes of the research.

It is often used in qualitative research, where the researcher wants to gain detailed knowledge about a specific phenomenon rather than make statistical inferences, or where the population is very small and specific. An effective purposive sample must have clear criteria and rationale for inclusion. Always make sure to describe your inclusion and exclusion criteria and beware of observer bias affecting your arguments.

4. Snowball¶

Sampling If the population is hard to access, snowball sampling can be used to recruit participants via other participants. The number of people you have access to “snowballs” as you get in contact with more people. The downside here is also representativeness, as you have no way of knowing how representative your sample is due to the reliance on participants recruiting others. This can lead to sampling bias.

5. Quota sampling¶

Quota sampling relies on the non-random selection of a predetermined number or proportion of units. This is called a quota.

You first divide the population into mutually exclusive subgroups (called strata) and then recruit sample units until you reach your quota. These units share specific characteristics, determined by you prior to forming your strata. The aim of quota sampling is to control what or who makes up your sample.

Bias¶

1. Confirmation bias¶

You’ve probably encountered this underlying bias every day of your life. We all love being right, so our brains are constantly on the hunt for evidence that supports our prior beliefs. Even if we’re trying our best to be open to alternative ideas, our minds are pushing back towards the safety and comfort of our own first thoughts. This can happen subconsciously through biases in how we search for, interpret, or recall information, or consciously, when we decide to cherry pick, by focusing on information that supports our arguments.

How to avoid confirmation bias

Record your beliefs and assumptions before starting your analysis. This will help you proactively recognize your biases as you review your results.
Go through all the presented data and evidence, but don’t immediately jump to conclusions. Resist the temptation to generate hypotheses or gather additional information to confirm your beliefs.
Revisit your recorded beliefs and assumptions at the conclusion of your analysis, and evaluate if they’ve influenced your findings.

2. Selection bias¶

Selection biases occur when looking at samples that are not representative of the population. This can happen organically when working with small sets of data, or when the sampling methodology is not truly randomized.

How to avoid selection bias

Use randomization to ensure you have a representative sample rather than a convenient one.
Make sure the samples are representative of the population in the variables you want to control (eg. Geos are evenly distributed)

3. Historical Bias¶

Historical data bias occurs when socio-cultural prejudices and beliefs are mirrored into systematic processes. This becomes particularly challenging when data from historically-biased sources are used to train machine learning models—for example, if manual systems give certain groups of people poor credit ratings, and you’re using that data to train the automatic system, the automatic system will replicate and may amplify the original system’s biases.

How to avoid historical bias

Acknowledge and identify biases in historic and contemporary data sources.
Establish and foster inclusivity frameworks for underrepresented groups.

4. Survivorship Bias¶

It’s easier to focus on the winners rather than the runners-up. If you think back to your favorite competition from the 2016 Olympics, it’s probably pretty tough to recall who got the silver and bronze. Survivorship bias influences us to focus on the characteristic of winners, due to a lack of visibility of other samples—confusing our ability to discern correlation and causation.

How to avoid survivorship bias

Don’t overindex on what survived. Take a step back and consider the paths taken by both successful and unsuccessful participants.
Find more data for the other side of the story.

5. Availability Bias¶

Availability of data has a big influence on how we view the world—but not all data is investigated and weighed equally. Have you ever found yourself wondering if crime has increased in your neighborhood because you’ve seen a broken car window? You’ve seen a vivid clue that something might be going on, but since you probably didn’t go on to investigate crime statistics, it’s likely that your perception shifted based on the immediately available information.

How to avoid availability bias

Focus on larger trends and patterns, rather than vivid anecdotal outliers. It’s possible that the vivid memories you have, are the exception rather than the rule, and you can only be sure by investigating further.
Look for different perspectives! News, media, but also your team, family, and friends play a huge role in shaping this natural shortcut, because they expose you more frequently to what they think is important for you to remember and pay attention to. You can counteract availability bias by exercising curiosity and proactively digging into available information, (even if it’s harder to get) to understand a given situation.

6. Outlier Bias¶

Averages are a great place to hide uncomfortable truths. Some data is convenient to visualize as an average, but this simple operation hides the effect of outliers and anomalies, and skews our observations.

How to avoid outlier bias

When averages tell you things are looking good, it’s time to dig deeper.
Look at the entire range of distribution
Use median instead of average
Find and investigate outliers

Hypothesis Testing¶

When we refer to Hypothesis Testing, it means using a systematic procedure to decide whether data and research study can support our particular theory which applies to a population.

We do this by using two mutually exclusive hypotheses about a population, and evaluating these statements to decide if the statements are supported by the sample data.

When to Use Hypothesis Testing in Data Science?¶

If you want to compare your results based on predictions, then you want to use hypothesis testing. It will allow you to compare the before and after results of your findings.

It is generally used when we want to compare:

A single group with an external standard
Two or more groups with each other

In the world of Data Science, there are two parts to consider when putting together a hypothesis.

Hypothesis Testing is when the team builds a strong hypothesis based on the available dataset. This will help direct the team and plan accordingly throughout the data science project. The hypothesis will then be tested with a complete dataset and determine if it is:

Null hypothesis - There’s no effect on the population
The Alternative hypothesis - There’s an effect on the population

Hypothesis Generation is an educated guess based on various factors that can be used to resolve the problem at hand. It is the process of combining our problem-solving skills with our business intuition. You will focus on how specific factors impact the target variable and then move on to conclude the relationship between the variables using hypothesis testing.

Different Types of Hypothesis Testing¶

Null Hypothesis¶

There is no relation between statistical variables and refer to this type of testing as null hypothesis testing. A null hypothesis is represented as H0. There are types of null hypotheses:

Simple Hypothesis
Composite Hypothesis
Exact Hypothesis
Inexact Hypothesis

Alternative Hypothesis¶

There is a relationship between two variables, proving that they have a statistical bond. An alternative hypothesis is represented as H1 or HA. The alternative hypothesis can be split into:

One-tailed. This is when you are testing in one direction and disregarding the possibility of a relationship with another variable in another direction. The sample mean would be higher or lower than the population mean, but not both.
Two-tailed. This is when you are testing in both directions and shows whether the sample mean is higher than or less than the mean of a population.

Non-directional Hypothesis¶

This is when a hypothesis does not state a direction but states that one factor affects another, or there is a correlation between two variables. However, the main point is that there is no direction between the 2 variables.

Directional Hypothesis¶

This is when a hypothesis has been built using the specific directional relationship between two variables and is based upon existing theory.

Hypothesis helps data scientists to:

Get a better understanding of the business problem at hand and allow them to dig deeper into the variables in the dataset.
Allows them to conclude what significant factors are essential to solving the problem, and use their time effectively on factors that don’t.
Help in the preparation stage of the process by collecting data from various sources that are fundamental to the business problem.

Other Terminology for Hypothesis Testing¶

Parameter¶

Parameter is a summary description of the target population. For example, if you were given the task to find the average height of your classmates, you would ask everyone in your class (population) about their height. Because everyone was asked the same question, you will have got a true description and received a parameter.

Statistic¶

Statistic is a description of a small portion of a population (sample). Using the same example as above, you are now given the task to find the average height of your age group (population), you can then use the information that you gathered from your class (sample). This type of information is known as a statistic.

Sampling Distribution¶

Sampling Distribution is a probability distribution by choosing a large number of samples drawn from a specific population. For example, if you were to provide a random sample of 10 coffee shops in your borough, from a population of 200 coffee shops. The random sample could be coffee shop numbers 4, 7, 13, 76, 94, 145, 11, 189, 52, 165, or any of the other combinations.

Standard Error¶

Standard Error is similar to standard deviation, in the respect that both measure the spread of your data. The higher the value, the more spread your data is. However, the difference is that standard error uses sample data, whereas standard deviation uses population. The standard error tells you how far your sample statistic is from the actual population mean.

Type-I error¶

Type-I error also known as a false positive and happens when the team incorrectly rejects a true null hypothesis. This means that the report states that your findings are significant, however, they have occurred by chance.

Type-II error¶

Type-II error also known as a false negative, happens when the team fails to reject a null hypothesis, which is in fact false. This means that the report states that your findings are not significant when there actually are.

The level of significance¶

The level of significance is the probability and maximum risk of making a false positive conclusion (Type I error) that you are willing to accept. Data Scientists, researchers, etc set this in advance and use it as a threshold for statistical significance.

P-value¶

P-value means probability value and is a number compared to the significance level to decide whether to reject the null hypothesis. It decides whether the sample data support the counter-argument and the null hypothesis is true. If you have a higher p-value than the significance level, the null hypothesis is not wrong or false, and the results are not statistically significant. However, if you have a lower p-value than the significant level, the results will be interpreted as false against the null hypothesis and be seen as statistically significant.

Example- https://online.stat.psu.edu/statprogram/reviews/statistical-concepts/hypothesis-testing/examples

Confidence Intervals¶

What Is Confidence Interval?

A confidence interval shows the probability that a parameter will fall between a pair of values around the mean. Confidence intervals show the degree of uncertainty or certainty in a sampling method. They are constructed using confidence levels of 95% or 99%.

ALT_TEXT_FOR_SCREEN_READERS

Statisticians use confidence intervals to measure the uncertainty in a sample variable. The confidence is in the method, not in a particular CI. Approximately 95% of the intervals constructed would capture the true population mean if the sampling method was repeated many times. Confidence Interval Formula

The formula to find Confidence Interval is:

ALT_TEXT_FOR_SCREEN_READERS

X bar is the sample mean.
Z is the number of standard deviations from the sample mean.
S is the standard deviation in the sample.
n is the size of the sample.

The value after the ± symbol is known as the margin of error.

Question: In a tree, there are hundreds of mangoes. You randomly choose 40 mangoes with a mean of 80 and a standard deviation of 4.3. Determine that the mangoes are big enough.

Solution:

Mean = 80

Standard deviation = 4.3

Number of observations = 40

Take the confidence level as 95%. Therefore the value of Z = 1.9

Substituting the value in the formula, we get

= 80 ± 1.960 × [ 4.3 / √40 ]

= 80 ± 1.960 × [ 4.3 / 6.32]

= 80 ± 1.960 × 0.6803

= 80 ± 1.33

The margin of error is 1.33

All the hundreds of mangoes are likely to be in the range of 78.67 and 81.33

Calculating A Confidence Interval¶

Imagine a group of researchers who are trying to decide whether or not the oranges produced on a certain farm are large enough to be sold to a potential grocery chain. This will serve as an example of how to compute a confidence interval.

Step 1: Determine the sample size (n).¶

46 oranges are chosen at random by the researchers from farm trees.

Consequently, n is 46.

Step 2: Determine the samples' means (x).¶

The researchers next determine the sample's mean weight, which comes out to be 86 grammes.

X = 86.

Step 3: Determine the standard deviation (s).¶

Although utilising the population-wide standard deviation is ideal, this data is frequently unavailable to researchers. In this scenario, the

If this is the case, the researchers should apply the sample's determined standard deviation.

Let's assume, for our example, that the researchers have chosen to compute the standard deviation from their sample. They get a 6.2-gramme standard deviation.

S = 6.2.

Step 4: Determine the confidence interval utilised in step #4.¶

In ordinary market research studies, 95% and 999% are the most popular selection for confidence intervals.

For this example, let's assume that the researchers employ a 95 per cent confidence interval.

Step 5: Find the Z value for the chosen confidence interval in step #5.¶

The researchers would subsequently use the following table to establish their Z value:

Confidence Interval Z¶

80% 1.282

85% 1.440

90% 1.645

95% 1.960

99% 2.576

99.5% 2.807

99.9% 3.291

Step 6: Calculate the following formula¶

The next step would be for the researchers to enter their known values into the formula. Following our example, this formula would look like this:

86 ± 1.960 (6.2/6.782)

This calculation yields a value of 86 1.79, which the researchers use as their confidence interval.

Step 7: Come to a decision.¶

According to the study's findings, the real mean of the larger population of oranges is probably (with a 95% confidence level) between 84.21 grammes and 87.79 grammes.

The Z-Value¶

Z is the number of standard deviations from the sample mean (1.96 for 95% confidence, 2.576 for 99%). Z-scores can be positive or negative. The sign tells you whether the observation is above or below the mean. For example, a z-score of +1 shows that the data point falls one standard deviation above the mean, while a -1 signifies it is one standard deviation below the mean. A z-score of zero equals the mean.

How Are Confidence Intervals Used?¶

Statisticians use confidence intervals to measure the uncertainty in a sample variable. For instance, a researcher may randomly select different samples from the same population and compute a confidence interval for each sample to determine how well it represents the actual value of the population variable. The resulting datasets are all different, with some intervals included and others not including the true population parameter.

What Is a T-Test?¶

Statistical methods such as the T-Test are used to calculate confidence intervals. A t-test is an inferential statistic used to observe a significant difference in the average of two groups that could be linked to specific characteristics. Three fundamental data values are required to calculate a t-test. They include the mean difference (the difference between the mean values in each data set), the standard deviation of each group, and the data points in each group. Mean Of Normally-Distributed Data

A normal distribution's mean and standard deviation are 0 and 1, respectively. It has a kurtosis of 3 and zero skew. Confidence Interval For Proportions

In newspaper stories during election years, confidence intervals are expressed as proportions or percentages. For instance, a survey for a specific presidential contender may indicate that they are within three percentage points of 40% of the vote (if the sample is large enough). The pollsters would be 95% certain that the actual percentage of voters who supported the candidate would be between 37% and 43% because election polls are frequently computed with a 95% confidence level.

Stock market investors are most interested in knowing the actual percentage of equities that rise and fall each week. The percentage of American households with personal computers is relevant to companies selling computers. Confidence intervals may be established for the weekly percentage change in stock prices and the percentage of American homes with personal computers. Confidence Interval For Non-Normally Distributed Data

In data analysis, calculating the confidence interval is a typical step that may be easily derived from populations with normally distributed data using the well-known x (ts)/n formula. The confidence interval, however, is not always easy to determine when working with data that is not regularly distributed. There are fewer and far less easily available references for this data in the literature.

We explain the percentile, bias-corrected, and expedited versions of the bootstrap method for calculating confidence intervals in plain terms. This approach is suitable for both normal and non-normal data sets and may be used to calculate a broad range of metrics, including mean, median, the slope of a calibration curve, etc. As a practical example, the bootstrap method determines the confidence interval around the median level of cocaine in femoral blood. Reporting Confidence Intervals

We always present confidence intervals in the manner shown below:

95% CI [LL, UL]

LL: Lower limit of the confidence interval,

UL: Upper confidence interval limit

The practice of reporting confidence intervals for various statistical tests is demonstrated in the examples below.

Example 1: Mean Confidence Interval¶

Let's say a scientist is interested in learning the average weight of a certain turtle species.

She weighs 25 turtles at random and determines that the mean weight of the sample is 300 pounds, with a 95% confidence interval of [292.75 pounds, 307.25 pounds].

She may report the findings as follows:

According to a formal study, this population's turtles weigh an average of 300 pounds, 95% confidence interval [292.75, 307.25].

Example 2: The Confidence Interval for the Means Difference¶

Let's say a scientist wishes to calculate the variation in mean weight between two turtle populations.

The mean difference, with a 90% confidence range of [-3.07 pounds, 23.07 pounds], is 10 pounds after she gathers data for both turtle populations.

She may report the findings as follows:

According to formal research, there is an average weight difference of 10 pounds, 90% CI [-3.07, 23.07], between the two groups of turtles. Caution When Using Confidence Intervals

The 'actual value' of your estimate may reside inside the confidence interval, according to various interpretations of confidence intervals. That is not the situation. Because the confidence interval is based on a sample rather than the entire population, it cannot tell you how probable it is that you discovered the real value of your statistical estimate. Only if you repeat your sampling or conduct your experiment, in the same manner will it be able to tell you what range of numbers you anticipate finding. Misconception About Confidence Intervals

Since a confidence interval is not a probability, it is incorrect to state that there is a 95% chance that a particular 95% confidence interval will include the actual value of the estimated parameter. How Do You Interpret P-Values And Confidence Intervals?

Statistical tests are used in confirmatory (evidential) research to determine whether null hypotheses should be accepted or rejected. The outcome of such a statistical test is the p-value, which is a probability. This probability indicates the strength of the evidence against the null hypothesis. Strong evidence is correlated with low p-values. The results are deemed "statistically significant" if the p-value falls below a certain threshold. Confidence Interval Example

If you compute a 95% confidence interval all around mean proportion of female infants born each year using a random sample of newborns, you may find an upper bound of 0.56 and a lower bound of 0.48. The confidence interval's upper and lower limits are presented below. The level of confidence is 95%. Conclusion

In this confidence interval in statistics tutorial, you have learned the importance of confidence intervals and the formula to calculate the same. The confidence interval tells you the range of values you can expect if you re-do the experiment in the same way.

If you are looking to pursue this further and make a career as a Data Analyst, Simplilearn’s Data Analytics Certification Program in partnership with Purdue University & in collaboration with IBM is the program for you.

Was this tutorial on Confidence Interval In Statistics helpful to you? If you have any doubts or questions, please mention them in this article's comments section, and we'll have our experts answer them for you at the earliest!

Marketing Use Cases¶

https://www.productmarketingalliance.com/the-role-of-use-cases-in-product-marketing/

Model Evaluation¶

https://domino.ai/data-science-dictionary/model-evaluation

In [ ]: