The sampling frame is the actual list of individuals that the sample will be drawn from. Ideally, it should include the entire target population (and nobody who is not part of that population).
Probability sampling means that every member of the population has a chance of being selected. It is mainly used in quantitative research. If you want to produce results that are representative of the whole population, probability sampling techniques are the most valid choice.
There are four main types of probability sample.
In a simple random sample, every member of the population has an equal chance of being selected. Your sampling frame should include the whole population.
To conduct this type of sampling, you can use tools like random number generators or other techniques that are based entirely on chance.
Systematic sampling is similar to simple random sampling, but it is usually slightly easier to conduct. Every member of the population is listed with a number, but instead of randomly generating numbers, individuals are chosen at regular intervals.
Stratified sampling involves dividing the population into subpopulations that may differ in important ways. It allows you draw more precise conclusions by ensuring that every subgroup is properly represented in the sample.
To use this sampling method, you divide the population into subgroups (called strata) based on the relevant characteristic (e.g., gender identity, age range, income bracket, job role).
Based on the overall proportions of the population, you calculate how many people should be sampled from each subgroup. Then you use random or systematic sampling to select a sample from each subgroup.
Cluster sampling also involves dividing the population into subgroups, but each subgroup should have similar characteristics to the whole sample. Instead of sampling individuals from each subgroup, you randomly select entire subgroups.
If it is practically possible, you might include every individual from each sampled cluster. If the clusters themselves are large, you can also sample individuals from within each cluster using one of the techniques above. This is called multistage sampling.
This method is good for dealing with large and dispersed populations, but there is more risk of error in the sample, as there could be substantial differences between clusters. It’s difficult to guarantee that the sampled clusters are really representative of the whole population.
A convenience sample simply includes the individuals who happen to be most accessible to the researcher.
This is an easy and inexpensive way to gather initial data, but there is no way to tell if the sample is representative of the population, so it can’t produce generalizable results. Convenience samples are at risk for both sampling bias and selection bias.
Similar to a convenience sample, a voluntary response sample is mainly based on ease of access. Instead of the researcher choosing participants and directly contacting them, people volunteer themselves (e.g. by responding to a public online survey).
Voluntary response samples are always at least somewhat biased, as some people will inherently be more likely to volunteer than others, leading to self-selection bias.
This type of sampling, also known as judgement sampling, involves the researcher using their expertise to select a sample that is most useful to the purposes of the research.
It is often used in qualitative research, where the researcher wants to gain detailed knowledge about a specific phenomenon rather than make statistical inferences, or where the population is very small and specific. An effective purposive sample must have clear criteria and rationale for inclusion. Always make sure to describe your inclusion and exclusion criteria and beware of observer bias affecting your arguments.
Sampling If the population is hard to access, snowball sampling can be used to recruit participants via other participants. The number of people you have access to “snowballs” as you get in contact with more people. The downside here is also representativeness, as you have no way of knowing how representative your sample is due to the reliance on participants recruiting others. This can lead to sampling bias.
Quota sampling relies on the non-random selection of a predetermined number or proportion of units. This is called a quota.
You first divide the population into mutually exclusive subgroups (called strata) and then recruit sample units until you reach your quota. These units share specific characteristics, determined by you prior to forming your strata. The aim of quota sampling is to control what or who makes up your sample.
You’ve probably encountered this underlying bias every day of your life. We all love being right, so our brains are constantly on the hunt for evidence that supports our prior beliefs. Even if we’re trying our best to be open to alternative ideas, our minds are pushing back towards the safety and comfort of our own first thoughts. This can happen subconsciously through biases in how we search for, interpret, or recall information, or consciously, when we decide to cherry pick, by focusing on information that supports our arguments.
How to avoid confirmation bias
Selection biases occur when looking at samples that are not representative of the population. This can happen organically when working with small sets of data, or when the sampling methodology is not truly randomized.
How to avoid selection bias
Historical data bias occurs when socio-cultural prejudices and beliefs are mirrored into systematic processes. This becomes particularly challenging when data from historically-biased sources are used to train machine learning models—for example, if manual systems give certain groups of people poor credit ratings, and you’re using that data to train the automatic system, the automatic system will replicate and may amplify the original system’s biases.
How to avoid historical bias
It’s easier to focus on the winners rather than the runners-up. If you think back to your favorite competition from the 2016 Olympics, it’s probably pretty tough to recall who got the silver and bronze. Survivorship bias influences us to focus on the characteristic of winners, due to a lack of visibility of other samples—confusing our ability to discern correlation and causation.
How to avoid survivorship bias
Availability of data has a big influence on how we view the world—but not all data is investigated and weighed equally. Have you ever found yourself wondering if crime has increased in your neighborhood because you’ve seen a broken car window? You’ve seen a vivid clue that something might be going on, but since you probably didn’t go on to investigate crime statistics, it’s likely that your perception shifted based on the immediately available information.
How to avoid availability bias
Averages are a great place to hide uncomfortable truths. Some data is convenient to visualize as an average, but this simple operation hides the effect of outliers and anomalies, and skews our observations.
How to avoid outlier bias
When we refer to Hypothesis Testing, it means using a systematic procedure to decide whether data and research study can support our particular theory which applies to a population.
We do this by using two mutually exclusive hypotheses about a population, and evaluating these statements to decide if the statements are supported by the sample data.
If you want to compare your results based on predictions, then you want to use hypothesis testing. It will allow you to compare the before and after results of your findings.
It is generally used when we want to compare:
In the world of Data Science, there are two parts to consider when putting together a hypothesis.
Hypothesis Testing is when the team builds a strong hypothesis based on the available dataset. This will help direct the team and plan accordingly throughout the data science project. The hypothesis will then be tested with a complete dataset and determine if it is:
Hypothesis Generation is an educated guess based on various factors that can be used to resolve the problem at hand. It is the process of combining our problem-solving skills with our business intuition. You will focus on how specific factors impact the target variable and then move on to conclude the relationship between the variables using hypothesis testing.
There is no relation between statistical variables and refer to this type of testing as null hypothesis testing. A null hypothesis is represented as H0. There are types of null hypotheses:
There is a relationship between two variables, proving that they have a statistical bond. An alternative hypothesis is represented as H1 or HA. The alternative hypothesis can be split into:
One-tailed. This is when you are testing in one direction and disregarding the possibility of a relationship with another variable in another direction. The sample mean would be higher or lower than the population mean, but not both.
Two-tailed. This is when you are testing in both directions and shows whether the sample mean is higher than or less than the mean of a population.
This is when a hypothesis does not state a direction but states that one factor affects another, or there is a correlation between two variables. However, the main point is that there is no direction between the 2 variables.
This is when a hypothesis has been built using the specific directional relationship between two variables and is based upon existing theory.
Hypothesis helps data scientists to:
Parameter is a summary description of the target population. For example, if you were given the task to find the average height of your classmates, you would ask everyone in your class (population) about their height. Because everyone was asked the same question, you will have got a true description and received a parameter.
Statistic is a description of a small portion of a population (sample). Using the same example as above, you are now given the task to find the average height of your age group (population), you can then use the information that you gathered from your class (sample). This type of information is known as a statistic.
Sampling Distribution is a probability distribution by choosing a large number of samples drawn from a specific population. For example, if you were to provide a random sample of 10 coffee shops in your borough, from a population of 200 coffee shops. The random sample could be coffee shop numbers 4, 7, 13, 76, 94, 145, 11, 189, 52, 165, or any of the other combinations.
Standard Error is similar to standard deviation, in the respect that both measure the spread of your data. The higher the value, the more spread your data is. However, the difference is that standard error uses sample data, whereas standard deviation uses population. The standard error tells you how far your sample statistic is from the actual population mean.
Type-I error also known as a false positive and happens when the team incorrectly rejects a true null hypothesis. This means that the report states that your findings are significant, however, they have occurred by chance.
Type-II error also known as a false negative, happens when the team fails to reject a null hypothesis, which is in fact false. This means that the report states that your findings are not significant when there actually are.
The level of significance is the probability and maximum risk of making a false positive conclusion (Type I error) that you are willing to accept. Data Scientists, researchers, etc set this in advance and use it as a threshold for statistical significance.
P-value means probability value and is a number compared to the significance level to decide whether to reject the null hypothesis. It decides whether the sample data support the counter-argument and the null hypothesis is true. If you have a higher p-value than the significance level, the null hypothesis is not wrong or false, and the results are not statistically significant. However, if you have a lower p-value than the significant level, the results will be interpreted as false against the null hypothesis and be seen as statistically significant.
What Is Confidence Interval?
A confidence interval shows the probability that a parameter will fall between a pair of values around the mean. Confidence intervals show the degree of uncertainty or certainty in a sampling method. They are constructed using confidence levels of 95% or 99%.
Statisticians use confidence intervals to measure the uncertainty in a sample variable. The confidence is in the method, not in a particular CI. Approximately 95% of the intervals constructed would capture the true population mean if the sampling method was repeated many times. Confidence Interval Formula
The formula to find Confidence Interval is:
The value after the ± symbol is known as the margin of error.
Question: In a tree, there are hundreds of mangoes. You randomly choose 40 mangoes with a mean of 80 and a standard deviation of 4.3. Determine that the mangoes are big enough.
Solution:
Mean = 80
Standard deviation = 4.3
Number of observations = 40
Take the confidence level as 95%. Therefore the value of Z = 1.9
Substituting the value in the formula, we get
= 80 ± 1.960 × [ 4.3 / √40 ]
= 80 ± 1.960 × [ 4.3 / 6.32]
= 80 ± 1.960 × 0.6803
= 80 ± 1.33
The margin of error is 1.33
All the hundreds of mangoes are likely to be in the range of 78.67 and 81.33
Imagine a group of researchers who are trying to decide whether or not the oranges produced on a certain farm are large enough to be sold to a potential grocery chain. This will serve as an example of how to compute a confidence interval.
46 oranges are chosen at random by the researchers from farm trees.
Consequently, n is 46.
The researchers next determine the sample's mean weight, which comes out to be 86 grammes.
X = 86.
Although utilising the population-wide standard deviation is ideal, this data is frequently unavailable to researchers. In this scenario, the
If this is the case, the researchers should apply the sample's determined standard deviation.
Let's assume, for our example, that the researchers have chosen to compute the standard deviation from their sample. They get a 6.2-gramme standard deviation.
S = 6.2.
In ordinary market research studies, 95% and 999% are the most popular selection for confidence intervals.
For this example, let's assume that the researchers employ a 95 per cent confidence interval.
The researchers would subsequently use the following table to establish their Z value:
80% 1.282
85% 1.440
90% 1.645
95% 1.960
99% 2.576
99.5% 2.807
99.9% 3.291
The next step would be for the researchers to enter their known values into the formula. Following our example, this formula would look like this:
86 ± 1.960 (6.2/6.782)
This calculation yields a value of 86 1.79, which the researchers use as their confidence interval.
According to the study's findings, the real mean of the larger population of oranges is probably (with a 95% confidence level) between 84.21 grammes and 87.79 grammes.
Z is the number of standard deviations from the sample mean (1.96 for 95% confidence, 2.576 for 99%). Z-scores can be positive or negative. The sign tells you whether the observation is above or below the mean. For example, a z-score of +1 shows that the data point falls one standard deviation above the mean, while a -1 signifies it is one standard deviation below the mean. A z-score of zero equals the mean.
Statisticians use confidence intervals to measure the uncertainty in a sample variable. For instance, a researcher may randomly select different samples from the same population and compute a confidence interval for each sample to determine how well it represents the actual value of the population variable. The resulting datasets are all different, with some intervals included and others not including the true population parameter.
Statistical methods such as the T-Test are used to calculate confidence intervals. A t-test is an inferential statistic used to observe a significant difference in the average of two groups that could be linked to specific characteristics. Three fundamental data values are required to calculate a t-test. They include the mean difference (the difference between the mean values in each data set), the standard deviation of each group, and the data points in each group. Mean Of Normally-Distributed Data
A normal distribution's mean and standard deviation are 0 and 1, respectively. It has a kurtosis of 3 and zero skew. Confidence Interval For Proportions
In newspaper stories during election years, confidence intervals are expressed as proportions or percentages. For instance, a survey for a specific presidential contender may indicate that they are within three percentage points of 40% of the vote (if the sample is large enough). The pollsters would be 95% certain that the actual percentage of voters who supported the candidate would be between 37% and 43% because election polls are frequently computed with a 95% confidence level.
Stock market investors are most interested in knowing the actual percentage of equities that rise and fall each week. The percentage of American households with personal computers is relevant to companies selling computers. Confidence intervals may be established for the weekly percentage change in stock prices and the percentage of American homes with personal computers. Confidence Interval For Non-Normally Distributed Data
In data analysis, calculating the confidence interval is a typical step that may be easily derived from populations with normally distributed data using the well-known x (ts)/n formula. The confidence interval, however, is not always easy to determine when working with data that is not regularly distributed. There are fewer and far less easily available references for this data in the literature.
We explain the percentile, bias-corrected, and expedited versions of the bootstrap method for calculating confidence intervals in plain terms. This approach is suitable for both normal and non-normal data sets and may be used to calculate a broad range of metrics, including mean, median, the slope of a calibration curve, etc. As a practical example, the bootstrap method determines the confidence interval around the median level of cocaine in femoral blood. Reporting Confidence Intervals
We always present confidence intervals in the manner shown below:
95% CI [LL, UL]
LL: Lower limit of the confidence interval,
UL: Upper confidence interval limit
The practice of reporting confidence intervals for various statistical tests is demonstrated in the examples below.
Let's say a scientist is interested in learning the average weight of a certain turtle species.
She weighs 25 turtles at random and determines that the mean weight of the sample is 300 pounds, with a 95% confidence interval of [292.75 pounds, 307.25 pounds].
She may report the findings as follows:
According to a formal study, this population's turtles weigh an average of 300 pounds, 95% confidence interval [292.75, 307.25].
Let's say a scientist wishes to calculate the variation in mean weight between two turtle populations.
The mean difference, with a 90% confidence range of [-3.07 pounds, 23.07 pounds], is 10 pounds after she gathers data for both turtle populations.
She may report the findings as follows:
According to formal research, there is an average weight difference of 10 pounds, 90% CI [-3.07, 23.07], between the two groups of turtles. Caution When Using Confidence Intervals
The 'actual value' of your estimate may reside inside the confidence interval, according to various interpretations of confidence intervals. That is not the situation. Because the confidence interval is based on a sample rather than the entire population, it cannot tell you how probable it is that you discovered the real value of your statistical estimate. Only if you repeat your sampling or conduct your experiment, in the same manner will it be able to tell you what range of numbers you anticipate finding. Misconception About Confidence Intervals
Since a confidence interval is not a probability, it is incorrect to state that there is a 95% chance that a particular 95% confidence interval will include the actual value of the estimated parameter. How Do You Interpret P-Values And Confidence Intervals?
Statistical tests are used in confirmatory (evidential) research to determine whether null hypotheses should be accepted or rejected. The outcome of such a statistical test is the p-value, which is a probability. This probability indicates the strength of the evidence against the null hypothesis. Strong evidence is correlated with low p-values. The results are deemed "statistically significant" if the p-value falls below a certain threshold. Confidence Interval Example
If you compute a 95% confidence interval all around mean proportion of female infants born each year using a random sample of newborns, you may find an upper bound of 0.56 and a lower bound of 0.48. The confidence interval's upper and lower limits are presented below. The level of confidence is 95%. Conclusion
In this confidence interval in statistics tutorial, you have learned the importance of confidence intervals and the formula to calculate the same. The confidence interval tells you the range of values you can expect if you re-do the experiment in the same way.
If you are looking to pursue this further and make a career as a Data Analyst, Simplilearn’s Data Analytics Certification Program in partnership with Purdue University & in collaboration with IBM is the program for you.
Was this tutorial on Confidence Interval In Statistics helpful to you? If you have any doubts or questions, please mention them in this article's comments section, and we'll have our experts answer them for you at the earliest!