AcaStat Statistical Software

AcaStat statistical software includes this Handbook, a search-and-expand statistics glossary, and an affordable  easy to use analytical tool.

Available on CD-ROM or instantly as a download.

  Click to learn more about AcaStat

AcaStat Software, All Rights Reserved http://www.acastat.com

Contents  Introduction Descriptive Hypothesis Tables Appendix

Data File Basics

There are two general sources of data. One source is called primary data. This source is data collected specifically for a research design developed by you to answer specific research questions. The other source is called secondary data. This source is data collected by others for purposes that may or may not match your research goals. An example of a primary data source would be an employee survey you designed and implemented for your organization to evaluate job satisfaction. An example of a secondary data source would be census data or other publicly available data such as the General Social Survey.

Designing data files

The best way to envision a data file is to use the analogy of the common spreadsheet software. In spreadsheets, you have columns and rows. For many data files, a spreadsheet provides an easy means of organizing and entering data. In a rectangular data file, columns represent variables and rows represent observations. Variables are commonly formatted as either numerical or string. A numerical variable is used whenever you wish to manipulate the data mathematically. Examples would be age, income, temperature, and job satisfaction rating. A string variable is used whenever you wish to treat the data entries like words. Examples would be names, cities, case identifiers, and race. Many times variables that could be considered string are coded as numeric. As an example, data for the variable "sex" might be coded 1 for male and 2 for female instead of using a string variable that would require letters (e.g., "Male" and "Female"). This has two benefits. First, numerical entries are easier and quicker to enter. Second, manipulation of numerical data with statistical software is generally much easier than using string variables.

Data file format

There are many different formats of data files. As a general rule, however, there are data files that are considered system files and data files that are text files. System files are created by and for specific software applications. Examples would be Microsoft Access, dBase, SAS, and SPSS. Text files contain data in ASCII format and are almost universal in that they can be imported into most statistical programs.

Text files

Text files can be either fixed or free formatted.

Fixed: In fixed formatted data, each variable will have a specific column location. When importing fixed formatted data into a statistical package, you must specify these column locations. The following is an example:

++++|++++|++++|++++|++++|

10123HoustonTX12Female1

Reading from left to right, the variables and their location in the data file are:
 
Variable
Column location
Data
 
Case
1-3
101
 
Age
4-5
23
 
City
6-12
Houston
 
State
13-14
TX
 
Education
15-16
12
 
Sex
17-22
Female
 
Marital status
23
1
1=single 2=married

Free: In free formatted data, either a space or special value separates each variable. Common separators are tabs or commas. When importing free formatted data into a statistical package, the software assumes that when a separator value is read that it is the end of the previous variable and the next character will begin another variable. The following is an example of a comma separated value data file (know as a csv file):

101,23,Houston,TX,12,Female,1 Reading form left to right, the variables are: Case, Age, City, State, Education, Sex, Marital status When reading either fixed or free data, statistical software counts the number of variables and assumes when it reaches the last variable that the next variable will be the beginning of another observation (case).

Data dictionary

A data dictionary defines the variables contained in a data file (and sometimes the format of the data file). The following is an example of a description of variable coding for a three-question survey.
 

Employee Survey     Response #: 

Q1: How satisfied are you with the current pay system?

  1. Very satisfied
  2. Somewhat satisfied
  3. Satisfied
  4. Somewhat dissatisfied
  5. Very dissatisfied
Q2: How many years have you been employed here? 

Q3: Please fill in your department name: 
 

To properly define and document a data file, you need to record the following information:
 
Variable name: An abbreviated name used by statistical software to represent the variable (generally 8 characters or less)
Variable type:  String, numerical, date
Variable location: if a fixed data set, the column location and possibly row in data file
Variable label:  Longer description of the variable
Value label:  If a numerical variable is coded to represent categories, the categories represented by the values must be identified

 

For the employee survey, the data dictionary for a comma separated data file would look like the following:
 
Variable name:  CASEID
Variable type:  String
Variable location:  First variable in csv file
Variable label:  Response tracking number
Value labels:  (not normally used for string variables)

 
Variable name: Q1
Variable type:  Numerical (categorical)
Variable location:  Second variable
Variable label:  Satisfaction with pay system
Value labels: 
  1. Very satisfied
  2. Somewhat satisfied
  3. Satisfied
  4. Somewhat dissatisfied
  5. Very dissatisfied

 
Variable name:  Q2
Variable type:  Numerical
Variable location:  Third variable
Variable label:  Years employed
Value labels:  None

 
Variable name:  Q3
Variable type:  String
Variable location:  Fourth variable
Variable label:  Department name
Value labels:  None

If the data for four completed surveys were entered into a spreadsheet, it would look like the following:
 
 
A
B
C
D
1
CASEID
Q1
Q2
Q3
2
1001
2
5
Admin
3
1002
5
10
MIS
4
1003
1
23
Accounting
5
1004
4
3
Legal

The data would look like the following if saved in a text file as comma separated (note: the total number of commas for each record equals the total number of variables):

CASEID, Q1, Q2, Q3,

1001, 2, 5, Admin,

1002, 5, 10, MIS,

1003, 1, 23, Accounting,

1004, 4, 3, Legal,

Some statistical software will not read the first row in the data file as variable names. In that case, the first row must be deleted before saving to avoid computation errors. The data file would look like the following: 1001, 2, 5, Admin,

1002, 5, 10, MIS,

1003, 1, 23, Accounting,

1004, 4, 3, Legal,

This is a very efficient way to store and analyze large amounts of data, but it should be apparent at this point that a data dictionary would be necessary to understand what the aggregated data represent. Documenting data files is very important. Although this was a simple example, many research data files have hundreds of variables and thousands of observations.

Hint: Use the Output Viewer to either open one of the data files provided with StatCalc or to practice creating your own data file. If you open one of StatCalc's data files, try adding a few observations (rows) and save as "practice.csv".  Import the data into the continuous data module for analysis.

Practice Data Files