Statistics and Probability Reviewer
Statistics and Probability Reviewer
Classifications of Variables
4 Basic Sampling Techniques
1. Quantitative - Numerical and can be ordered or ranked
1. Random sampling - subjects are selected by random
(age, heights, weights, body temperatures)
numbers from calculators, computers, or tables; for a sample of
a) Discrete - values that can be counted size n, all possible samples of this size have an equal chance of
being selected from the population.
b) Continuous - assume an infinite number of values
between any two specific values; obtained by Limitation: if the population is extremely large, it is time
measuring and often include fractions and decimals consuming to number and select the sample elements
2. Qualitative - variables that can be placed into distinct Methods for Random Sampling
categories, according to some characteristic or attribute
a) Fish bowl - number each element of the population, place
(gender, religion, geographic locations)
the numbers on cards in a hat or fishbowl, mix them, and
select the sample by drawing the cards
Measuring Variables - to establish relationships between b) Random numbers - number the elements of the population
variables; observe the variables and measure/record their sequentially and then select each element by using random
observations. numbers
Scale of measurement - measuring a variable into a set of 2. Systematic random sampling- using every kth number after
categories and a process that classifies each individual into the first subject us selected from 1 through k; done after the
one category first number is selected at random. The advantage of
systematic sampling is the ease of selecting the sample
elements.
4 Types of Measurement Scales
3. Stratified random sampling - dividing the population into
1. Nominal level of measurement - classifies data into subgroups, called strata, and subjects are randomly selected
mutually exclusive (non overlapping), exhausting within groups; ensures representation of all population
categories in which no order or ranking can be imposed on subgroups that are important to the study. Disadvantages:
the data.gender, zip code, eye color, nationality, religion
many variables of interest, dividing a large population into 4. Classes must be mutually exclusive - non-overlapping class
representative subgroups requires a great deal of effort. limits so that data cannot be placed into two classes
4. Cluster sampling- subjects are selected by using an 5. Classes must be continuous - no gaps in frequency
intact group(cluster) that is representative of the distribution
population.
6. Classes must be exhaustive - enough to accommodate all
Advantages: A cluster sample can reduce costs, it can the data
simplify fieldwork it is convenient.
Disadvantage: homogeneous
Reasons for constructing a frequency distribution
1. To organize the data in a meaningful, intelligible way.
Frequency Distribution and Graphs
2. To enable the reader to determine the nature or shape of
Constructing a frequency distribution - most convenient the distribution.
method of organizing data
3. To facilitate computational procedures for measures of
Frequency distribution -organization of raw data in table average and spread
form, using classes and frequencies; way of presenting a
4. To enable the researcher to draw charts and graphs to
summary of the data that shows
present data
a) possibility of seeing patterns or relationships in data
5. To enable the reader to compare different data sets
b) how many times each data point
(observation/outcome) occurs in a data set
Types of Frequency Distribution
Components of frequency distribution table
1. Categorical Frequency Distribution - used for data that can
Class - quantitative/qualitative category, each raw data
be placed in specific categories, such as nominal/ordinal level
value is placed into
data.
Tally - data recorded in the sequence which they are
2. Grouped Frequency Distributions - used when the range of
collected, before they are processed/ranked
the data is large, the data must be grouped into classes that are
Frequency - number of data values contained in a specific more than one unit in width.
class
3. Ungrouped Frequency Distribution - used when the range
1. Qualitative variable (ordinal/nominal data) of the data values is relatively small, a frequency distribution
can be constructed using single data values for each class
a) Class, tally, frequency, percent
4. Cumulative Frequency Distribution - gives total # of values
2. Quantitative variable (numerical data)
that fall below the upper boundary of each class. Values are
a) Class limit, class boundaries - numbers used to found by adding the frequencies of classes less than or equal to
separate the classes so there are no gaps in the upper class boundary of a specific class (ascending cumulative
frequency distribution; tally, frequency frequency)
Basic Rules: Constructing “Class” in the Frequency Sample of Frequency Distribution Table
Distribution
1. There should be 5-20 classes
2. Class limits should have the same decimal place value
as the data
a) Class boundaries should have one additional
place value and end in a 5
Lower limit - 0.5 = lower boundary
Upper limit + 0.5 = upper boundary Constructing statistical charts and graphs - most useful
method of presenting the data
3. Classes must be equal in width - found by subtracting
lower/upper class limit of one class from lower/upper class Uses of graphs in statistics
limit of the next class if boundaries are given. Find the 1. Convey data to viewers in pictorial form
class width by dividing the range by the number of classes
2. Useful in getting the audience’s attention in a presentation
* don’t subtract limits of a single class; incorrect answer
3. Describe/analyze data set
*researcher decides how many classes to use and the
width of each class 4. Discuss an issue, reinforce a critical point, summarize data
set
Sturge’s Rule - determining number of classes to use in a
histogram or frequency distribution table 5. Discover trends/patterns in a situation
Data Distribution
Measures of Central Tendency
Central tendency - descriptive statistical measure that
determines a single value that best describes the center
and represents the entire distribution; condense a large
set of data into a single value
- goal is to identify the single value that is the best 3. Negatively Skewed or Left-skewed Distribution - majority of
representative for the entire set of data the data values fall to the right of the mean and cluster at the
upper end of the distribution, with the tail to the left. The mean
Statistic - a characteristic or measure obtained by using is to the left of the median, and the mode is to the right of the
the data values from a sample median
Parameter - a characteristic or measure obtained by using
all the data values from a specific population
1. Mean - most commonly used measure of central
tendency; balance point of the distribution; sum of the
values divided by the total number of values
2. Median - midpoint of the list where scores in a
distribution are listed from smallest to largest; a more
appropriate measure of central tendency than the mean;
divides the scores so that 50% of the scores in the
distribution have values that are equal to or less than the
median *When a distribution is extremely skewed, the value of the
mean will be pulled toward the tail
3. Mode - most frequently occurring category or score in
the distribution or in the data set; peak or high point of Central Tendency and Variability - two primary values that are
used to describe a distribution of scores
Central tendency - the central point of the distribution Q2 is the same as the 50th percentile, or the median
Q3 corresponds to the 75th percentile
Variability - descriptive statistic that describes how the
scores are scattered around that central point; determined 4. Interquartile Range (IQR) - difference between Q1 and Q3
by measuring distance and is the range of the middle 50% of the data; used to identify
outliers, and as a measure of variability in exploratory data
- inferential statistic that describes how accurately any
analysis (EDA)
individual score or sample represents the entire
population 5. Deciles - Deciles divide the distribution into 10 groups,
denoted by D1, D2, etc. Deciles can be found by using the
formulas given for percentiles
Measures of Variation
1. Range - total distance covered by the distribution, from
Relationships Among Percentiles, Deciles, and Quartiles
the highest score to the lowest score
R = highest value - lowest value • Deciles are denoted by D1 , D2 , D3 , and they correspond to
P10, P20, P30
2. Variance ( or s2) - average of the squares of the
2
distance each value is from the mean • Quartiles are denoted by Q1 , Q2 , Q3 and they correspond to
P25, P50, P75
2 ( X ) 2
s2 ( X X ) • The median is the same as P50 or Q2 or D5
N n 1
X = individual value X = sample mean
μ = population mean n = sample size
N = population size
3. Standard Deviation ( or s) - standard distance
between a score and the mean; square root of the
variance
Exploratory (Descriptive) Data Analysis, EDA - to examine data
to find out what information can be discovered about the data
Uses of Variance and Standard Deviation such as the center and the spread
1. To determine the spread of the data.
2. To determine the consistency of a variable
3. To determine the number of data values that fall within
a specified interval in a distribution
4. Used quite often in inferential statistics.
Stem-and-Leaf Plot - data plot that uses part of the data value
Coefficient of Variation (CVar) - statistic that allows to as the stem and part of the data value as the leaf to form
compare standard deviations when the units are different; groups or classes. Leading digit (stem), trailing digit (leaf),
the standard deviation divided by the mean, result frequency
expressed as a percentage
Boxplot (Box and Whisker Plot) - graph of a data set obtained
For samples: For population: by drawing: the lowest value of the data set (minimum), Q1,
s
CVar 100% CVar 100% the median, Q3, the highest value of the data set (maximum)
X
Comparing Boxplots for Two or More Data Sets - use the
Measures of Positions - used to locate the relative location of the medians. To compare the variability, use the
position of a data value in the data set interquartile range or the length of the boxes.
1. Standard score (z-score) - tells how many standard
deviations a data value is above or below the mean for a Probability and Counting Rules
specific distribution of values
Probability - the chance of an event occurring
a) If a z score is 0, the data value is the same as the mean
Basic Concepts of Probability
b) if the z score is (+), the score is above the mean
1. Probability Experiments - a chance process that generates a
c) if the z score is (-), the score is below the mean set of data or well-defined results called outcomes
When all data for a variable are transformed into z scores, 2. Outcome - the result of a single trial of a probability
the resulting distribution will have a mean of 0 and a experiment
standard deviation of 1
3. Space sample (S) - set of all possible outcomes of a
value mean
z statistical experiment
sd
2. Percentile - divide the data set into 100 equal groups
percentile = (# of values below X)+0.5 x 100%
total # of values
3. Quartiles - divide the distribution into four groups,
separated by Q1, Q2, Q3 Tree Diagram - used to determine all possible outcomes of a
probability experiment
Q1 is the same as the 25th percentile
Classifications of Events a) Independent Events - the probability of both
occurring is P(A and B) = P(A) x P(B)
Event (E) - consists of a set of outcomes of a probability
experiment b) Dependent Events - conditional probability P(B/A)
- the probability of both occurring is
1. Independent - the first event does not affect the
P(A and B) = P(A) x P(B/A)
probability of the next event occurring
2. Dependent - the probability of the second event
occurring depends on the first event Conditional Probability
The probability that event B occurs given that event A has
3. Complementary event ( E ) - set of outcomes in the
already occurred:
sample space that are not included in the outcomes of
event E; mutually exclusive P(B|A) = P(A and B)
P(A)
P(E) 1 P(E) P(E) P(E) 1
Determination of the Number of Outcomes of Events
Three Basic Interpretations of Probability 1. Fundamental Counting Rule - mulitply (k1 * k2 * k3 * kn)
1. Classical Probability - relies of the sample space; 2. Permutation - arrangement of n objects in a specific order
assumes all outcomes are equally likely to occur; actual Permutation Rule - # of permutations of n objects taking r
performance of experiment is not necessary; outcomes objects at a time; order is important
are obtained by observation and tree diagram n!
P(E) = # of outcomes in E = Pr where n! = n factorial
(n r)!
n
*Probability values range from 0 to 1 1. Discrete Random Variables - has a finite or countable
*When probability is near 0, occurrence is highly unlikely number of values (0, 1, 2…)
*When probability is near 0.5, there is a 50-50 chance 2. Continuous Random Variables - has infinitely many values
*When probability is near 1, event is likely to occur associated with measurements on a continuous scale where
*When probability of an event/complement is known, the there are no gaps or interruptions (5, 5.1, 6.2…)
other can be found by subtracting the probability from 1
a) Mutually Exclusive Events - when two events A 1. P(x) 1 where x is a discrete variable and
and B are mutually exclusive P(A or B) = P(A) + P(x) is the probability of x
P(B)
2. 0 P(x) 1 for every value of x
b) Non-mutually Exclusive - if A and B are not
mutually exclusive P(A or B) = P(A) + P(B) - P(A
and B)
Mean of a Probability Distribution - expected value; typical
2. Multiplication Rule and Conditional Probability value that represents the central location of a probability
distribution xP(x)
Variance and Standard Deviation of a Probability Hypergeometric Random Variable - the number X of successes
Distribution - measures the amount of spread in a of a hypergeometric experiment
distribution 2 [(x ) 2 P(x)] Probability mass function (pmf)
K N K
k n k
Binomial Distribution - with parameters n and p, is the P( X k )
discrete probability distribution of the # of successes in a N
sequence of n independent experiments n
4 Properties of Binomial Distribution where N = population size
1. Fixed Number of Trials (n) K = # of success states in the population
2. Two outcomes in a trial, success or failure n = # of draws
3. Trials are independent k = # of observed successes
4. Probability of success P remains constant a = is a binomial coefficient
b
General Formula pmf is (+) when max(0, n K n) k min(K, n)
X ~ B(n, p) pmf satisfies the recurrence relation
P( X r)nc rpr qnr N K
X = random variable
n
n = # of trials P( X 0)
N
r = # of successes n
q = # of failures
p = probability of success
Mean and Variance
X ~ B(n, p)
mean E(x) np
variance 2 Var( X ) npq
where q 1 p
Mode - of a binomial B(n,p) distribution
|(n+1)p| if (n+1)p is 0 or a noninteger
(n+1)p and (n+1)p-1 if (n+1)p{1,..., n}
n if (n+1)p=n+1
1 2 k
where P = probability
n = total # of events
n1 = # of times outcome 1 occurs
n2 = # of times outcome 2 occurs
nk = # of times outcome k occurs
p1 = probability of outcome 1
p2 = probability of outcome 2
pk = probability of outcome k