From William Eddy, "Comments," JASA (March 1979)
From William Bradley, et al., "Metadata Matters: standardizing metadata for improved management and delivery in national information systems," (Social Environment Information Group, Health Canada), p. 17.
Performing quantitative analysis involves combining activities from three areas: statistics, data analysis, and social data. Each of these areas is discussed below.
Adapted from William Eddy, op. cit.
The earliest definition of statistics in the Oxford English Dictionary refers to the activities of collecting, classifying, and discussing numeric facts about states or nations. This usage arose during the 18th Century coinciding with the emergence of the modern nation-state. The statistics mills of these nation-states amassed data for the decision-makers in the government of the day and for the administrators of bureaus and departments of government services and programs. Statistics Canada, in this regard, has been aptly named.
In recent use, statistics has become the science of generalization. Built upon theories of probability and inference, statistics permits broad generalizations from specific observations, whether related to human or natural phenomena. Instrumental in making generalizations are models and techniques that incorporate stochastic or error components. In this workshop, statistics refers to the methods and techniques for summarizing data and for estimating a variety of models of inference.
We do offer, however, a brief summary of the more basic statistical techniques in Table 1. Again, the list is not comprehensive, but rather is illustrative of the match between statistical tools and data. In particular, two key properties of a set of variables help make this point. First, an analytic distinction is made between dependent and independent variables. This implies a causal relationship between variables, where the dependent variables represent the phenomenon being explained and the independent variables serve as the casual agents.
Not all quantitative analysis necessarily has to be conceived in terms of causality. For example, examining a single variable falls outside this context. Similarly, data reduction techniques do not require specifying dependent and independent variables, but instead are directed at identifying the shared measurement of a latent property, that is, an indirectly observed attribute. Factor analysis, latent class analysis and cluster analysis are methods commonly used for this purpose. Latent variables are further discussed below.
| Independent Variables | ||
|---|---|---|
| Dependent Variables | Categorical | Analytic |
| Categorical | Data Type: Frequencies or Count | Data Type: Biserial Correlation |
| Methods: Tables, Loglinear Analysis Latent Class analysis | Methods: Probit/Logit, Discriminant Analysis | |
| Analytic | Data Type: Means and Variances | Data Type: Correlation Matrices |
| Methods: Analysis of Variance, Multivariate ANOVA, T-Test | Methods: Regression analysis, Factor analysis | |
Variables that have been measured using a metric capture a precise amount or quantity of a property. Personal income, where dollars are the metric of measurement, is an example of this type of Analytic Variable. Furthermore, one can speak of 0 dollars as being the absence of income. Someone earning $20,000 makes twice as much as someone earning $10,000. Also, a dollar difference between 5 and 6 dollars is the same amount as the difference between 100 and 101 dollars, namely, one dollar. These three attributes, an absolute zero, ratio comparisons, and equal intervals, are all characteristics of strong Analytic Variables.
Social Data. Social data produced by governmental departments, academic research and the private sector about Canadian society can be summarized according to four basic data structures: unit record files, also known as microdata releases (e.g., the Public Use Microdata Files from the 1991 Census); aggregate files (e.g., the basic summary tables from the Census); time series records (e.g., CANSIM records); and geo-referenced or spatial files (e.g., the Census geography files).
Among the unit record files, a variety of data collection methods have been employed, including one-time cross-sectional surveys (e.g., the Family History Survey), repeated surveys (e.g., the General Social Surveys), and longitudinal surveys (e.g., the National Population Health Survey). These methods differ according to how time is controlled or manipulated. A one-time cross-sectional survey permits examining phenomena at a discrete point in time. Generalizations about change are difficult to infer from this data collection method. Repeated surveys, on the other hand, permit comparisons over time among cohorts. Individual change is difficult to infer, but aggregate change within a group can be observed. Finally, longitudinal surveys allow the examination of change at the level of the individual. While this method of data collection may appear to be the most ideal for explaining social phenomena, longitudinal surveys present some of the more complex data analysis problems.
A massive volume of data is collected by our national statistical information system in Canada. While concerns may exist about the scope and detail of the social issues captured by this system, volume of data is not a contentious point. A significant barrier, however, has been access to these data. One may have expected technology to have been the major hindrance to using these data, but a more basic factor has been simply getting copies of the data files. Several barriers have prevented open access to these data, including cost, concern about disclosure and confidentiality, poor data management practices, and issues of ownership. DLI is a program that specifically addresses access for post-secondary, non-commercial use of a sizable quantity of Statistics Canada data files. Nevertheless, identifying data collections that are appropriate to a specific research project remains a challenge.
Two general problems confront the secondary analysis of previously collected data. First, there is the challenge of matching units of analysis. This entails locating data observed using the same unit of analysis as your research interest. Since a large percentage of social research focuses on individuals as the unit of analysis, Statistics Canada's microdata files are an attractive source to examine. If your research, however, is about individuals from a special or rare population, none of the major microdata files may have enough cases to permit an analysis, or the information needed to identify the group of interest may not be present in the file. For example, someone interested in studying aboriginal women who use computers in the workplace may not find enough cases in any of the microdata files, or the variables that would allow the identification of these cases may not be in the microdata file. Research requiring the identity of smaller communities of less than 250,000 is particularly challenging since one method frequently used to protect the confidentiality of individuals when constructing microdata files is to report locations only for largely populated areas, such as CMA's, or large geographic units, such as provinces.
The second general challenge is to find data files that have captured the subject matter or content of interest. Here the focus in on finding variables that address the research question at hand. The researcher will want to ensure both that the complete mix of dependent and independent variables is within a microdata file and also that the phenomenon being studied was measured or observed as desired. The latter concern raises the issue of manifest and latent variables. Variables that are directly observed are manifest variables, while latent variables represent phenomena that, while not directly observed, are believed to underlie observed variables. A phenomenon such as worker alienation may not be measured directly by any one variable, but instead may be an underlying dimension in the measurement of several variables, including work satisfaction, salary level, relations with co-workers and management, personal goals, opportunities for advancement, etc. As mentioned above, some statistical techniques help abstract latent dimensions.
Data Analysis. Data analysis involves applying statistical computational techniques to social data. The advent of computer automation has revolutionized data processing and has introduced a number of new approaches in analyzing data. Access to computing power capable of processing public use microdata releases of the Census was once a serious concern for researchers. The current computing power of a notebook computer with an attached CD-ROM player can process these substantial files. However, it is not just the power of a briefcase-sized computer that makes possible the ability to process these large files. Equally important is the processing power of statistical software operating on today's small computers. Major systems such as SAS and SPSS are available across the range of computing platforms and there are a number of smaller scaled packages that also carry a major processing punch, including NSDStat.
Data analysis includes data management operations necessary to organize a data file for statistical analysis. This entails identifying the observations in a file that are most appropriate for an analysis, which in some instances requires defining and extracting a subset of the observations from the original file. In other instances, the structure of the file may have to be aggregated to achieve the proper unit of analysis, for example, job-level observations in the Labour Market Activity Surveys may have to be summarized to create person-level records.
In addition to working with the case or observation structure of a file, data management tasks also involve strategies in manipulating or transforming variables. This includes creating new variables from existing variables or combining categories in an existing variable, for example, grouping age into five to ten year intervals. Together, case and variable manipulation play an important role in the completion of an analysis.
Description: summary of a sample or data collection
There are times when the primary purpose of an analysis is simply to describe the characteristics of a certain subpopulation or special group. Three types of numeric summaries are especially useful in these instances. First, accurate population estimates may be desired. While the data will be descriptive of the time period for which they were gathered, this information may be useful in describing or identifying the approximate size of a group or severity of a problem. Statistics Canada typically provides a variable (commonly referred to as a weight variable) that will produce population estimates from their sample. For example, the sample size for the number of women in the 1989 Survey of Literacy Skills Used in Daily Activities is 4,600. The population estimate applying the Statistics Canada weight variable is 7,284,422 women.
Because a population estimate is calculated from a sample, confidence intervals may be reported to provide a sense of the range of values within which the 'true' population value is likely to fall. Statistics Canada documentation often includes sampling variability tables that permit the calculation of a population estimate's confidence interval. In many ways, a confidence interval is more meaningful to discuss than an actual population estimate because of natural fluctuations within social phenomena.
The useful second numeric summary is based on the relative size of categories within a Categorical Variable. Percentages are typically used for this purpose. For example, instead of reporting that 12,301,788 Canadians were married in 1991, this fact is reported as 45.6 percent of Canadians. This figure adjusts the frequencies to a 100 point scale, which easily permits comparisons of the relative size of each category within a variable.
The third numeric summary often reported is a "typical value" (also known as a measure of central tendency) of an Analytic Variable, i.e., variables with a continuous distribution. The most common "typical values" are the mean and median, although, the most frequently occurring value (or mode) might also be reported. In summarizing a continuous distribution it is also important to know the spread of a distribution, i.e., how widely the values are dispersed along a variable's continuous scale. The spread of a distribution may be reported as the range, i.e., the distance between the minimum and maximum values, or as the standard deviation of the mean.
Comparison: looking at difference in means and percentages
A second strategy involves comparing the summary figures, i.e., percentages or "typical values", of important dependent variables across groups. Instead of describing a particular population or group, phenomena are examined by contrasting differences or similarities among groups.
The comparison approach is commonly used in controlled experiments. One group, designated the experimental group, is subjected to a change, while a control group receives no treatment. The outcome, which is measured in a key dependent variable, is then compared between the control and experimental groups to see if the treatment made a difference.
Since survey research rarely involves the use of experimental controls, statistical control of variables is used during an analysis of the data. These comparisons usually entail examining differences among naturally occurring groups, e.g., income differences between women and men with similar jobs. Such differences are commonly evaluated by subtracting the percentages or means among groups and comparing the outcome to the numeric value, zero. Groups that are very similar will have differences close to zero, while large differences indicate dissimilar groups.
In addition to comparing percentages or "typical values", the distribution of an observed variable may be compared to a theoretical distribution. For example, a gini ratio, which functions as a theoretical index of equality, may be calculated and compared to the distribution of an observed variable, such as income. Differences in this instance are usually evaluated by examining the correlation between the observed variable and a variable containing the expected values from the theoretical distribution. A Chi-Squared test between expected and observed frequencies also permits this type of evaluation.
Modeling: building and testing statistical representations
A third strategy in quantitative analysis consists of building and testing models to represent social phenomena. Two general tasks are involved in working with models. First, to build a model, a causal relationship must be defined and variables must be identified to represent the dependent variable, i.e., the response variable or variable being explained in the model, and the independent variables, i.e., the causal agents or predictor variables. Data are then used to estimate the values of the coefficients for the independent variables and to predict values for the dependent variable. An overall representation of this task is reflected in the following formula:
Once these estimates are determined, an assessment of the model's fit with the data is possible. The literature about model assessment falls under the heading of regression diagnostics, which also includes methods for testing violations of the assumptions of linear regression. Most of these techniques focus on the residuals of the model building exercise, i.e., the differences between the observed values of the dependent variable and the predicted values resulting from the model.
The second task - model testing - focuses on applying a model to new data. Some common statistical techniques actually test the closeness of fit between newly applied data and the model upon which they are based. Analysis of variance is such an example.
Model testing is similar to the diagnostic tests used in building models. The overall test assessment is based on comparing the actual observations in the data with the predicted fit from the model.
Charles K. Humphrey
1996
All Rights Reserved