Statistical Glossary


Statistically stable sample
t statistic
R square
Dependent variable
Dummy variable
Image spacer
Image spacer
Image spacer


Constant: A mathematical value that does not change. For example, in a linear regression equation of the general form

Y = aX + b

the letter b represents a constant. (See also variable and regression.)




Coefficient (as in "correlation coefficient" or "regression coefficient"; also as in "unstandardized coefficient" or "standardized coefficient"): a mathematical constant that is multiplied by a variable value to calculate a predicted value in a regression equation. For example, in a regression equation of the general form

Y = aX + b

a is a coefficient that is multiplied by the value of a predictor variable (X) to help determine a predicted value (Y).

In a linear regression model, unstandardized coefficients (labeled in statistical tables as "B") represent the absolute amount of change in
the dependent variable that is associated with a one-unit change in an independent variable (i.e., a predictor variable), measured in units
appropriate to that predictor; while standardized coefficients (labeled in statistical tables as "Beta") represent this relationship on a standardized scale that allows us to directly compare the relative strength of influence of each of several predictors on a common scale. (See also regression.)




Confidence (as in "statistical confidence" or "level of confidence"): This is a statistical term referring to the level or degree of confidence that a statistical finding is genuine and not the result of chance or error. For example, if a statistical finding has associated with it a level of confidence of .99 or 99%, then we would expect to see this result or one very close to it in about 99% percent of analytic samples drawn randomly from the population at large. In the YOR analyses, we are dealing with a statistical sample of approximately 140 to 160 observations (grow reports), depending on the particular analysis under consideration. So if we see a correlation between amount of lumens and crop yield that is statistically significant at the 99% level of confidence, that means we would expect to see a correlation of similar magnitude in approximately 99% of samples of similar size drawn randomly from the larger population of indoor-grown marijuana crops.




Confounding: This is a statistical term for a situation in which it is difficult to tell which of two or more predictors is responsible for having a statistically significant influence on a dependent variable (the thing we are trying to predict, such as crop yield). For example, if percent of lumens from HPS lighting has a significant positive correlation with crop yield (weight), but so does use of a particular growing medium, then there is a potential statistical confounding. Simple correlations will not necessarily tell us definitively how important a predictor's influence is on a dependent variable. Instead, we must turn to a more sophisticated technique such as regression, which compares the influences of more than one predictor at the same time and allows us to better sort out what is really an important influence, and what is not. Thus, regression will help us to "unconfound" or untangle these influences.




Control (as in "statistically control"): In the context of our YOR analyses, this is a term that describes one of the valuable functions performed by regression modeling. By examining the possible influences of several candidate predictors of crop yield simultaneously, regression is able to clarify the unique influence, if any, of each individual candidate predictor. It does this by examining a predictor's influence on crop yield while "controlling for" the influence of other candidate predictors. It holds constant the other predictors' influence while examining the influence of the single predictor in question. This is analogous to the common phrase, "All other things being equal . . ."

In other words, when we statistically control for other influences, we can better assess the true impact of a single predictor on crop yield. This sort of statistical control is often reported in, say, medical studies of the effects of cigarette smoking on life expectancy or mortality rates, when public health officials say that, in their analysis, they have controlled for factors such as gender, race, alcohol use, obesity, amount of daily physical exercise, and so forth. They are "removing" or "controlling for" these influences statistically, so that they can more clearly see the unique influence of smoking on health.




Correlation: This is a statistical technique which measures the degree of association between two quantitative variables, such as number of lumens and crop yield. It indicates the degree to which the two variables may vary in parallel. If one variable increases in value as the other one does, then they are said to have a positive correlation. If one variable decreases in value as the other increases, then they are said to have a negative or inverse correlation. If there is no discernible pattern of change on one variable when the other variable changes, then there is said to be no correlation between the two variables.

For example, there tends to be a positive correlation between level of education and income: more highly educated people generally tend to earn more money. Conversely, there tends to be a negative correlation between heaviness of cigarette smoking and lifespan, with people who smoke, or who smoke more heavily, tending to have shorter lifespans. A correlation coefficient can range from zero to 1.00. A zero correlation means there is no discernible relationship between the two variables; a correlation of +1.00 means there is a perfect positive correspondence between the two variables; a correlation of -1.00 means there is a perfect negative or inverse correspondence between the two variables.

If we square the correlation coefficient, this result tells us what percent of the variation in one variable is accounted for or predicted by the other variable. Thus, a perfect positive correlation of 1.00 means that we can predict any given value of one variable with perfect accuracy if we know the value of the other variable for that observation (grow report). In the following statistical table documenting the correlation between lumens and crop yield, the correlation of .562 means that approximately 32% of the variation in crop yield across the sample of 153 grow reports can be accounted for by the number of lumens the grower is using.










** Correlation is significant at the 0.01 level
* Correlation is significant at the 0.05 level
Number of records=153




Dependent variable: In a predictive statistical model, the variable whose values we are trying to predict using one or more predictor variables. The model attempts to determine to what extent the dependent variable varies across our sample of grow reports as a function of the values of the predictor variables. I.e., in our statistical model, the value of this dependent variable "depends" on the values of the predictors. (Predictors are sometimes referred to as "independent variables" to distinguish them from the dependent variable.)




Dummy variable: This is a quantitative variable created to represent a single category of a non-quantitative variable. A non-quantitative variable is sometimes also referred to as a nominal or qualitative variable, because it does not represent a measurable quantity. Lumens represent a quantitative variable because we can quantify how many lumens we are talking about. But "type of growing medium" is a qualitative variable because there are qualitatively different types of growing medium, but we cannot sensibly say that one medium is quantitatively "larger" or "smaller" than another. (Similarly, in typical demographic analyses, income is a quantitative variable, but religion is a qualitative variable.)

In order to use a variable in a statistical calculation, such as in a correlation or regression model, it must be a quantitative variable, so that we can try to associate its values with that of other quantitative variables. e.g., we try to associate the amount of light (number of lumens) with crop yield (weight measured in grams per square foot), perhaps with the expectation that more lumens will produce greater crop yield.

So in order to use the information contained in qualitative variables in our analyses, we take each category of a qualitative variable, such as "growing medium," and we create a single quantitative dummy variable for each category to represent that category in our model. We give this dummy variable a quantitative value of either 1 or zero to indicate the presence vs. absence of the category of the original qualitative variable. So, for example, one of the dummy variables might be "hydroponic" to represent the hydroponic category of growing medium. If a grower used a hydroponic medium, then the hydroponic quantitative dummy variable would be assigned a value of 1 for that grow report. If a hydroponic medium was not used, then the hydroponic quantitative dummy variable is assigned a value of zero. Each category of the original qualitative variable is similarly represented by a single quantitative dummy variable, such that if one of these dummy variables has a value of 1, then the others must necessarily all have values of zero for a given grow report.




Predictor: In a predictive statistical model, a variable whose influence on an outcome is measured by the model. For example, in a model studying the influences of various factors on crop yield, number of lumens might be one of several predictors of crop yield that we may test in the model. (A predictor is also sometimes referred to as an "independent variable" to distinguish it from the "dependent variable," which is the outcome we are trying to predict (e.g., crop yield).




Qualitative (as in "qualitative difference"): A property of a variable that cannot be quantified mathematically. For example, there are qualitatively different types of growing media or fertilizers, but one cannot reasonably say that one medium is somehow "larger" or "smaller" than another. They are just different media, but their differences are not quantifiable in the way that number of lumens is quantifiable.




Quantitative (as in "quantitative variable"): A property of a variable that can be measured on at least an ordinal scale of measurement so that we can speak of "more" or "less" of a quantity. Number of lumens is an example of a quantitative variable because we can measure and refer to a quantity of lumens. Only quantitative variables (not qualitative variables) can be used in calculating correlations or in a predictive regression model. In such an analysis, qualitative variables cannot be used directly, but must instead be represented by a series of quantitative "dummy" variables, where each category of the original qualitative variable is represented by a single dummy variable, coded with a value of "1" for the presence of the category and a "0" for the absence of that category in the original qualitative variable as reported in a given grow report. (See also Dummy variable.)




R (as in "R statistic"): This is a statistical abbreviation for a correlation, often used in tables of statistical output. If R = .50 then the correlation has a value of .50.




R square (as in "R square statistic"): This is a measure of how much variation in a dependent variable is accounted for by the variation in a predictor or independent variable, and is derived by squaring a correlation coefficient. For example, if we determine that the correlation between lumens and crop yield is +.562, then we can say that the variation in amount of light (measured in lumens) across our sample of grow reports is accounting for about 32% of the variation in crop yield across our sample of grow reports.




Record (as in "number of records"): In a database such as the YOR database, this represents a single observation or grow report. Most of our statistical analyses of the YOR data are based on approximately 140-160 grow reports, or database records.




Regression: This is a fairly powerful multivariate statistical modeling technique that is employed to try to understand the simultaneous influences of multiple predictor variables (thus the term multivariate) on a dependent variable of interest (e.g., crop yield). While a simple correlation analysis examines the influence of a single predictor in isolation, a regression model examines influences of two or more predictors simultaneously.

Thus, regression is more powerful and accurate than simple correlation analysis, because it is able to statistically "control for" other predictors' influences when determining the unique influence of a given predictor. This helps us to "unconfound" or tease apart multiple influences, so that we can better determine which variables are really important(See also confounding and control.)

The general mathematical form of the linear regression equation for a single predictor variable is:

Y = aX + b

where Y = the predicted value of the dependent variable (e.g., crop yield), X is the value of the predictor variable (e.g., lumens), a is the regression coefficient for the predictor variable, and b is a constant. When we have more than one predictor, the equation takes the form:

Y = a1X1 + a2X2 . . . + anXn + b

so that the predicted value of the dependent variable is a linear combination of the values of the predictors multiplied by their respective regression coefficients, plus a constant term.




Scatterplot: This is a simple graph which shows the relationship between a variable on the X-axis (horizontal axis) and a variable on the Y-axis (vertical axis) by placing a dot or other marker in the graph to represent each intersection of the two dimensions in our data. For example, in the section of our analysis where we show the results of the final regression model, we show a scatterplot graph indicating the relationship or correspondence between actual crop yield (on the X-axis) and predicted crop yield (on the Y-axis). Each dot in the scatterplot represents the intersection of the X and Y values for a single grow report.

The dots portrayed in this graph indicate a fairly good correspondence between actual and predicted crop yield (weight in grams per square foot), suggesting that we have a fairly good predictive regression model. In the scatterplot of actual vs. predicted crop yield, the pattern of dots tends to run from the lower left to the upper right, showing that as crop yields predicted by the model increase, so do actual reported crop yields in the grow reports.




Significant/significance (or statistically significant/statistical significance): This is a measure of the probability that a statistical finding is spurious (has occurred by chance). It is the inverse of statistical confidence. So if the level of statistical significance is .01, then the level of statistical confidence is .99 or 99%. Thus, a higher level of confidence, or a correspondingly lower significance value or probability that an outcome has occurred by chance, makes us more likely to believe that a statistical finding is genuine instead of spurious. In tables of statistical results, the significance level (sometimes labeled simply as "Sig.") is usually reported instead of the level of confidence. Thus, when we see a significance level of .001 in such a table, this means that we would expect the observed result to occur spuriously, or by chance, only about one out of a thousand times when examining a large number of samples drawn randomly from the larger statistical population. (See also confidence.)




Statistically stable sample: A sample that is large enough so that the statistical findings of the analysis are likely to be reliably confirmed when examining other similarly sized samples drawn at random from the population at large. If a sample is too small, the statistical findings may not be relied upon.




t statistic: This is a statistic that is used to determine the statistical significance of a finding, and is often reported in tables of statistical output along with its associated statistical significance level. (See also significant/significance.)




Unit (as in "unit of measurement"): This represents the mathematical scale on which a variable is typically measured and reported. For example, crop yield is often reported as grams per square foot. So one gram per square foot would be the unit of analysis of the "crop yield" variable. Canopy size is typically reported in square feet. So one square foot is the unit of analysis for the "canopy size" variable.

Unit of analysis becomes important in the regression model, because the unstandardized regression coefficients tell us how much of a change in the value of a predictor will cause a given amount of change in the dependent variable (crop yield). This allows us to determine, for example, how high a lumen level is needed in order to produce a predictable size of crop yield. For example, in our regression model, we have determined that an increase of illumination of one lumen will result in an increase of 1/1000 of a gram in crop yield. Or, to put it more usefully, for every increase of 1000 lumens, we would expect an increase in crop yield of approximately one gram per square foot.




Variable: A mathematical quantity that varies (as contrasted with a mathematical constant, which is invariant). In our analyses, we examine the influences of predictor variables on the variable of crop yield. The values of the predictor variables and the dependent variable (crop yield) tend to vary from one grow report to the next. This allows us to build a quantitative model to predict crop yield from a set of predictor variables. In the regression model, it also allows us to estimate how much of a change in crop yield is likely to result from a given change in the value of a given predictor (e.g., number of lumens, or "level of illumination").




Variation (as in "percent of variation"): The amount of measurable change in a variable. For example, using regression modeling, we are able to estimate the strength of influence of a predictor on crop yield. And we can express the strength of this influence in terms of the percent of variation in crop yield that is accounted for by a predictor or group of predictors. Predictor variables having a greater influence on crop yield will "explain" or be associated with greater percentages of variation in crop yield. In our regression model, for example, the most influential predictor, number of lumens, explains about 31% of the variation in crop yield across the sample of grow reports.




X-axis: The horizontal axis in a graph. When graphing the relationship between a predictor and the variable we are trying to predict, the predictor is conventionally represented on the X-axis.




Y-axis: The vertical axis in a graph. When graphing the relationship between a predictor and the variable we are trying to predict, the variable we are trying to predict is conventionally represented on the Y-axis.