|
|
STATISTICS: GLOSSARY
- Data: The facts and figures that are collected, analyzed, presented,
and interpreted. Data may be numeric or non numeric.
- Data Set: All the data collected in a particular study.
- Elements: The entities on which data are collected.
- Variable: A characteristic of interest for the elements.
- Observation: The set of measurements or data obtained for a single
element.
- Nominal Scale: A scale of measurement that uses a label or category to
define an attribute of an element. Nominal data may be recorded with a non
numeric description or with a numeric code.
- Ordinal Scale: A scale of measurement that has the properties of a
nominal scale and can be used to rank or order the observations. Ordinal data
may be recorded with a non numeric description or with a numeric code.
- Interval Scale: A scale of measurement that has the properties of an
ordinal scale and the interval between observations is expressed in terms of a
fixed unit of measure. Interval data are always numeric.
- Ratio Scale: A scale of measurement that has the properties of an
interval scale and the ratio of observations is meaningful. Ratio data are
always numeric.
- Qualitative Data: Data obtained with a nominal or ordinal scale of
measurement. Qualitative data may be recorded with a non numeric description
or with a numeric code.
- Quantitative Data: Data obtained with an interval or ratio scale of
measurement. Quantitative data are always numeric and indicate how much or how
many for the variable of interest.
- Descriptive Statistics: Tabular, graphical, and numerical methods used
to summarize data.
- Population: The collection of all elements of interest in a particular
study.
- Sample: A subset of the population.
- Statistical Inference: The process of using data obtained from a sample
to make estimates or test claims about the characteristics of a
population.
- Qualitative Data:, Data that provide labels or names for categories of
like items. These data are provided by either a nominal or ordinal scale of
measurement.
- Quantitative Data: Data that indicate how much or how many. These data
are provided by either an interval or ratio scale of measurement.
- Frequency Distribution: A tabular summary of a set of data showing the
frequency (or number) of items in each of several non overlapping classes.
- Relative Frequency Distribution: A tabular summary of a set of data
showing the relative frequency - that is, the fraction or
proportion - of the total number of items in each of several non
overlapping classes.
- Bar Graph: A graphical device for depicting the information presented
in a frequency distribution or relative frequency distribution of qualitative
data.
- Pie Chart: A pictorial device for presenting qualitative data summaries
based upon subdividing a circle into sectors that correspond to the relative
frequency for each class.
- Histogram: A graphical presentation of a frequency distribution or
relative frequency distribution of quantitative data constructed by placing
the class intervals on the horizontal axis and the frequencies or relative
frequencies on the vertical axis.
- Cumulative Frequency Distribution: A tabular summary of a set of
quantitative data showing the number of items having values less than or equal
to the upper class limit of each class.
- Cumulative Relative Frequency Distribution: A tabular summary of a set
of quantitative data showing the fraction or proportion of the items having
values less than or equal to the upper class limit of each class.
- Class Midpoint: The point in each class that is halfway between the
lower and upper class limits.
- Exploratory Data Analysis: The use of simple arithmetic and
easy-to-draw pictures to present data more effectively.
- Population Parameter: A numerical value used as a summary measure for a
population of data (e.g., the population mean, u, the population variance, s2,
and the population standard deviation, s).
- Sample Statistic: A numerical value used as a summary measure for a
sample (e.g., the sample mean, x, the sample variance, s2, and the sample
standard deviation, s).
- Mean: A measure of central location for a data set. It is computed by
summing all the data values and dividing by the number of items.
- Trimmed Mean: The mean of the data remaining after a percent of the
smallest and a percent of the largest items have been removed. The purpose of
a trimmed mean is to provide a measure of central location that has eliminated
the effect of extremely large and extremely small data values.
- Median: A measure of central location. It is the value which splits the
data into two equal groups - one with values greater than or equal to
the median, and one with values less than or equal to the median.
- Mode: A measure of location, defined as the most frequently occurring
data value.
- Percentile: A value such that at least p percent of the items are less
than or equal to this value and at least (100 - p) percent of the items are
greater than or equal to this value. The 50th percentile is the median.
- Quartiles: The 25th, 50th, and 75th percentiles of the data referred to
as the first quartile, the second quartile (median), and third quartile,
respectively. The quartiles can be used to divide the data into four parts,
with each part containing approximately 25% of the data.
- Hinges: The value of the lower hinge is approximately the first
quartile, or 25th percentile. The value of the upper hinge is approximately
the third quartile, or 75th percentile. The values of the hinges and quartiles
may differ slightly due to differing computational conventions.
- Range: A measure of dispersion, defined to be the difference between
the largest and smallest data values.
- Interquartile Range: A measure of dispersion, defined to be the
difference between the third and first quartiles.
- Variance: A measure of dispersion for a data set, found by summing the
squared deviations of the data values about the mean and then dividing the
total by N if the data is from a population or by n - 1 if the data
is from a sample.
- Standard Deviation: A measure of dispersion for a data set, found by
taking the positive square root of the variance.
- Coefficient of Variation: A measure of relative dispersion for a data
set, found by dividing the standard deviation by the mean and multiplying by
100.
- z-score: For each data item, a value found by dividing the deviation
about the mean (Xj - u) by the standard deviation s. A z-score is
referred to as a standardized value and denotes the number of standard
deviations a data value Xj is from the mean.
- Chebyshev's Theorem: A theorem applying to any data set that can be
used to make statements about the percentage of items that must be within a
specified number of standard deviations of the mean.
- Empirical Rule: A rule that states the percentages of items that are
within one, two, and three standard deviations from the mean for mound-shaped,
or bell-shaped, distributions.
- Outlier: An unusually small or unusually large data value.
- 5-Number Summary: An exploratory data analysis technique that uses the
following 5 numbers to summarize the data set smallest value, first quartile,
median, third quartile and largest value.
- Grouped Data: Data available in class intervals as summarized by a
frequency distribution. Individual values of the original data are not
recorded.
- Probability: A numerical measure of the likelihood that an event will
occur.
- Experiment: Any process which generates well defined outcomes.
- Sample Space: The set of all possible sample points (experimental
outcomes).
- Sample Points: The individual outcomes of an experiment.
- Tree Diagram: A graphical device helpful in defining sample points of
an experiment involving multiple steps.
- Classical Method A method of assigning probabilities which assumes that
the experimental outcomes are equally likely.
- Relative Frequency Method: A method of assigning probabilities based
upon experimentation or historical data.
- Subjective Method: A method of assigning probabilities based upon
judgment.
- Event: A collection of sample points.
- Complement of Event: A The event containing all sample points that are
not in A.
- Venn Diagram: A graphical device for representing symbolically the
sample space and operations involving events.
- Union of Events A and B: The event containing all sample points that
are in A, in B, or in both. The union is denoted A U B.
- Intersection of A and B: The event containing all sample points that
are in both A and B. The intersection is denoted A n B.
- Addition Law: A probability law used to compute the probability of a
union, P(A U B). It is P(A U B) = P(A) + P(B) - P(A n B). For mutually
exclusive events, since P(A n B) - 0, it reduces to P(A U B) - P(A) +
P(B).
- Mutually Exclusive Events: Events that have no sample points in common;
that is, A n B is empty and P(A n B) - 0.
- Conditional Probability: The probability of an event given that another
event has occurred. The conditional probability of A given B is P(A l B) - P(A
n B)/P(B).
- Independent events Two events A and B where P(A|B)- P(A)or P(BlA)-P(B);
that is, the events have no influence on each other.
- Multiplication Law: A probability law used to compute the probability
of an intersection, P(A n B). It is P(A n B) - P(A)P(B|A) or P(A n B) =
P(B)P(A|B). For independent events it reduces to P(A n B) = P(A )P(B).
- Prior Probabilities: Initial estimates of the probabilities of
events.
- Posterior Probabilities: Revised probabilities of events based on
additional information.
- Bayes' Theorem: A method used to compute posterior probabilities.
- Random Variable: A numerical description of the outcome of an
experiment.
- Discrete Random Variable: A random variable that can assume only a
finite or infinite sequence of values.
- Continuous Random Variable: A random variable that may assume all
values in an interval or collection of intervals.
- Probability Distribution: A description of how the probabilities are
distributed over the values the random variable can take on.
- Probability Function: A function, denoted by f(x), that for a discrete
random variable, provides the probability that x takes on a particular
value.
- Expected Value: A measure of the mean, or central location, value of a
random variable.
- Variance: A measure of the dispersion, or variability, of a random
variable.
- Standard Deviation: The positive square root of the variance.
- Binomial Experiment: A probability experiment possessing the four
properties [independence, on/off variables]
- Binomial Probability Distribution: A probability distribution showing
the probability of x successes in n trials of a binomial experiment.
- Binomial Probability Function: The function used to compute
probabilities in a binomial experiment.
- Poisson Probability Distribution: A probability distribution showing
the probability of x occurrences of an event over a specified interval of time
or space.
- Poisson Probability Function: The function used to compute Poisson
probabilities.
- Hypergeometric Probability Function: The function used to compute the
probability of x successes in n trials when the trials are dependent.
- Uniform Probability Distribution: A continuous probability distribution
where the probability that the random variable will assume a value in any
interval of equal length is the same for each interval.
- Probability Density Function: The function that defines the probability
distribution of a continuous random variable.
- Normal Probability Distribution: A continuous probability distribution.
Its probability density function is bell shaped and determined by the mean and
standard deviation.
- Standard Normal Distribution: A normal distribution with a mean of 0
and a standard deviation of 1.
- Continuity Correction Factor: A value of .S that is added and/or
subtracted from a value of x when the continuous normal probability
distribution is used to approximate the discrete binomial probability
distribution.
- Exponential Probability Distribution: A continuous probability
distribution that is useful in computing probabilities for the time, or space,
between occurrences of an event.
- Parameter: A population characteristic, such as a population mean, a
population standard deviation, a population proportion, and so on.
- Simple Random Sampling
- Finite Population: a sample selected such that each possible sample of
size n has the same probability of being selected.
- Infinite population: a sample selected such that each element comes
from the same population and the successive elements are selected
independently.
- Sampling Without Replacement: Once an item from the population has been
included in the sample it is removed from further consideration and thus
cannot be selected a second time.
- Sampling With Replacement: As each item is selected for the sample, it
is returned to the population. It is possible that a previously selected item
may be selected again and therefore appear in the sample more than once.
- Sample Statistic: A sample characteristic, such as a sample mean, x, a
sample standard deviation, s, a sample proportion, p, and so on. The value of
the sample statistic is used to estimate the value of the population
parameter.
- Sampling Distribution: A probability distribution consisting of all
possible values of a sample statistic.
- Point Estimate: A single numerical value used as an estimate of a
population parameter.
- Point Estimator: The sample statistic, such as x, s, p, etc., that
provides the point estimate of the population parameter.
- Finite Population Correction Factor: The term [(N - n)/(N - l)]1/2 that
is used in the formulas for standard deviations whenever a finite population,
rather than an infinite population, is being sampled. The generally accepted
rule of thumb is to ignore the finite population correction factor whenever
n/N < .05.
- Standard Error: The standard deviation of a point estimator.
- Central Limit Theorem: A theorem that allows us to use the normal
probability distribution to approximate the sampling distribution of x and p
whenever the sample size is large.
- Unbiasedness: A property of a point estimator that occurs whenever the
expected value of the point estimator is equal to the population parameter it
estimates.
- Relative Efficiency: Given two unbiased point estimators of the same
population parameter, the point estimator with the smaller variance is said to
have greater relative efficiency than the other.
- Consistency: A property of a point estimator that occurs whenever
larger sample sizes tend to provide point estimates closer to the population
parameter.
- Probability Sample: A sample selected such that each element in the
population has a known probability of being included in the sample. Simple
random sampling, stratified simple random sampling, cluster sampling, and
systematic sampling are probability samples.
- Nonprobability Sample: A sample selected such that the probability of
each element being included in the sample is unknown. Convenience and judgment
samples are nonprobability samples.
- Interval Estimate: An estimate of a population parameter that provides
an interval of values believed to contain the value of the parameter.
- Sampling Error: The magnitude of the difference between the value of an
unbiased point estimator and the true population parameter. In this case of
the mean, the sampling error is [ x - u]
- Precision: A probability statement about the sampling error.
- Confidence Level: The confidence associated with an interval estimate.
For example, if an interval estimation procedure provides intervals such that
95% of the intervals developed will include the population parameter, an
interval estimate is said to be constructed at the 95% confidence level; Note
that .95 is referred to as the confidence coefficient.
- t distribution: A family of probability distributions which can be used
to develop interval estimates of a population mean whenever the population
standard deviation is unknown and the population has a normal or near-normal
probability distribution
- Degrees of Freedom: A parameter of the t distribution. When the t
distribution is used in the computation of an interval estimate of a
population mean, the appropriate t distribution has n - 1 degrees of freedom,
where n is the size of the simple random sample.
- Null Hypothesis: The hypothesis tentatively assumed true in the
hypothesis-testing procedure.
- Alternative Hypothesis: The hypothesis concluded to be true if the null
hypothesis is rejected.
- Type I error: The error of rejecting Ho when it is true.
- Type II error: The error of accepting Ho when it is false.
- Critical Value: A value that is compared with the test statistic to
determine whether or not Ho should be rejected.
- Level of Significance: The maximum allowable probability of a Type I
error.
- One-tailed Test: A hypothesis test in which rejection of the null
hypothesis occurs for values of the test statistic in one tail of the sampling
distribution.
- Two-tailed Test: A hypothesis test in which rejection of the null
hypothesis occurs for values of the test statistic in either tail of the
sampling distribution.
- p-value: The probability, when the null hypothesis is true, of
obtaining a sample result that is more unlikely than what is observed. It is
often called the observed level of significance.
- Power Curve; A graph of the probability of rejecting Ho for all
possible values of the population parameter not satisfying the null
hypothesis. The power curve provides the probability of correctly rejecting
the null hypothesis.
- Pooled Variance: An estimate of the variance of a population based on
the combination of two (or more) sample results. The pooled variance estimate
is appropriate whenever the variances of two (or more) populations are assumed
equal.
- Independent Samples: Samples selected from two (or more) populations
where the elements making up one sample are chosen independently of the
elements making up the other sample(s).
- Matched Samples: Samples where each data value in one sample is matched
with a corresponding data value in the other sample.
- Analysis of Variance (ANOVA) Procedure: A statistical approach for
determining whether or not the means of several different populations are
equal.
- Factor: Another word for the variable of interest in an ANOVA
procedure.
- Treatment Different levels of a factor.
- Single-factor Experiment: An experiment involving only one factor with
k populations or treatments.
- Experimental Units: The objects of interest in the experiment.
- Completely Randomized Design: An experimental design where the
treatments are randomly assigned to the experimental units.
- Mean Square: The sum of squares divided by its corresponding degrees of
freedom. This quantity is used in the F ratio to determine if significant
differences among means exist or not.
- ANOVA Table: A table used to summarize the analysis of variance
computations and results. It contains columns showing the source of variation,
the degrees of freedom, the sum of squares, the mean squares, and the F
values.
- Partitioning: The process of allocating the total sum of squares and
degrees of freedom into the various components.
- Replication: The number of times each experimental condition is
repeated in an experiment. It is the sample size associated with each
treatment combination.
- Interaction: The response produced when the levels of one factor
interact with the levels of another factor in influencing the response
variable.
- Note: The definitions here are all stated with the understanding that
simple linear
- regression and correlation is being considered.
- Independent Variable The variable that is being predicted or explained.
It is denoted by y in the regression equation.
- Independent Variable: The variable that is doing the predicting or
explaining. It is denoted by x in the regression equation.
- Simple Linear Regression: The simplest kind of regression, involving
only two variables that are related approximately by a straight line.
- Regression Equation: The mathematical equation relating the independent
variable to the expected value of the dependent variable; that is, E( y) =B0
+B1x.
- Estimated Regression Equation: The estimate of the regression equation
obtained by the least squares method; i.e., y = b0 +b1x.
- Scatter Diagram: A graph of the available data in which the independent
variable appears on the horizontal axis and the dependent variable appears on
the vertical axis.
- Least Squares Method: The approach used to develop the estimated
regression equation which minimizes the sum of squared residuals.
- Coefficient of Determination (r2): A measure of the variation explained
by the estimated regression equation. It is a measure of how well the
estimated regression equation fits the data.
- Deterministic model A relationship between an independent variable and
a dependent variable whereby specifying the value of the independent variable
allows one to compute exactly the value of the dependent variable.
- Probabilistic Model: A relationship between an independent variable and
a dependent variable in which specifying the value of the independent variable
is not sufficient to allow determination of the value of the dependent
variable.
- Residual: The difference between the observed value of the dependent
variable and the value predicted using the estimated regression equation; i.e.
y - yi.
- Standardized Residual: The value obtained by dividing the residual by
its standard deviation .
- Sample Correlation Coefficient (rxy): A statistical measure of the
linear association between two variables.
- Multiple Regression Model: A regression model in which more than one
independent variable is used to predict the dependent variable.
- Multiple Coefficient of Determination (R2): A measure of the goodness
of fit for the estimated regression equation.
- Adjusted Multiple Coefficient of Determination (R2): A measure of the
goodness of fit for the estimated regression equation which accounts for the
number of independent variables.
- Multicollinearity: A term used to describe the case when the
independent variables in a multiple regression model are correlated.
- General Linear Model: A model of the form y = B0 +B1x1 +B2x2 + B3x3
+e
- where each of the independent variables xj is a function of x, the
variables for which data has been collected.
- Interaction: The joint effect of two variables acting together.
- Qualitative Variable: A variable that is not measured in terms of how
much or how many but instead is assigned values to represent categories.
- Dummy Variable: A variable that takes on the values 0 or 1 and is used
to incorporate the effects of qualitative variables in a regression model
- Variable-selection Procedures: Computer-based methods for selecting a
subset of the potential independent variables for a regression model
- Outlier: An observation with a residual that is far greater in
magnitude than the rest of the residual values.
- Influential Observation: An observation that has a great deal of
influence in determining the estimated regression equation.
- Leverage: A measure designed to indicate how far an observation is from
the others in terms of the values of the independent variables.
- Autocorrelation: Correlation in the errors that arises when the error
terms at successive points in time are related. First-order autocorrelation is
when et and et-l are related second-order is when et-l and et-2 are related
and so on.
- Serial correlation: Same as autocorrelation.
- Durbin-Watson Test: A test to determine whether or not first-order
- Time Series: A set of observations measured at successive points in
time or over successive periods of time.
- Forecast: A projection or prediction of future values of a time
series.
- Trend: The long-run shift or movement in the time series observable
over several periods of data.
- Cyclical Component: The component of the time series model that results
in periodic above-trend and below-trend behavior of the time series lasting
more than 1 year.
- Seasona1 Component: The component of the time series model that shows a
periodic pattern over 1 year or less.
- Irregular Component: The component of the time series model that
reflects the random variation of the actual time series values beyond what can
be explained by the trend cyclical and seasonal components.
- Moving Averages: A method of forecasting or smoothing a time series by
averaging each successive group of data points. The moving averages method can
be used to isolate the seasonal component of the time series.
- Mean Squared Error (MSE): One approach to measuring the accuracy of a
forecasting model. This measure is the average of the sum of the squared
difference between the forecast values and the actual time series values.
- Weighted Moving Averages: A method of forecasting or smoothing a time
series by computing a weighted average of past data values. The sum of the
weights must equal one.
- Exponential Smoothing: A forecasting technique that uses a weighted
average of past time series values in order to arrive at smoothed time series
values which can be used as forecasts.
- Smoothing Constant: A parameter of the exponential-smoothing model
which provides the weight given to the most recent time series value in the
calculation of the forecast value.
- Multiplicative Time-series Model: A model that assumes that the
separate components of the time series can be multiplied together to identify
the actual time series value. When the 4 components of trend cyclical seasonal
and irregular are assumed present we obtain: Yl = T*C* S* I. When cyclical is
not modeled we obtain: Y = T*I.
- Deseasonalized Time-series: A time series that has had the effect of
season removed by dividing each original time series observation by the
corresponding seasonal index.
- Causal Forecasting Method: Forecasting methods that relate a time
series to other variables that are believed to explain or cause its
behavior.
- Autoregressive Model: A time series model that uses a regression
relationship based on past time series values to predict the future time
series values.
- Delphi Approach: A qualitative forecasting method that obtains
forecasts through a group consensus."
- Scenario Writing: A qualitative forecasting method which consists of
developing a conceptual scenario of the future based upon a well defined set
of assumptions
|