Measure Dhs MEASURE DHS: Quality Information to plan, monitor and improve population, health, and nutrition programs
spacer
spacer spacer
 
spacer
spacer
Working with Datasets
spacer spacer

Dataset FAQ(s)

Dataset Files

There are currently 7 FAQ(s) available:

  Q. How do I unzip the files I have downloaded from your site?

  A: Please use:

PKUNZIP or WINZIP

  Q. I am using SAS with a SAS dictionary and DHS data files. After reading the data, I am getting a record length error. What is going on?

  A: Older versions of the SAS Export routine do not include the parameter for maximum record length (LRECL) with the INFILE statement.

If you get a SAS error message such as "One or more lines were truncated. NOTE: SAS went to a new line when INPUT statement reached past the end of a line," look at your INFILE statement. The default value is 256, but if your longest record is greater than 256, you need to include the parameters INFILE {datasetname} LRECL={maximum record length} MISSOVER.

  Q. I am using the Windows version of SPSS and I am getting an error message when I try to read the .SPS file included with the DHS data. This does not happen when I use the DOS version. Why is this happening?

  A: The syntax for "include files" for the Windows version of SPSS is slightly different from the DOS version.

With the Windows version, if you are using data files with more than one record per case, as is true for virtually all ISSA exported rectangular data files, you need to inform SPSS of the number of records per case. On the same line as the DATA LIST command in the SPS file (this should be the first line of the file), you must add a RECORDS clause to specify the number of records per case.

Use a text editor to count the number of forward slashes (/) that are in the DATA LIST section of the include file. This is the number of records. The DATA LIST section ends at the first period. Sample syntax:

DATA LIST FILE ='DXIR31RT.DAT' RECORDS = 30

  Q. I am using Stata and I have trouble reading the dataset I downloaded from your website. How do I read your data into Stata to create a .DTA system file?

  A:

There are three ways to read the data into Stata:

1) The data file comes with .DO and .DCT files Two files are provided with many of the DHS datasets: the .DO file and the .DCT file. The .DO file contains instructions for reading the data and the .DCT file contains specifications for the layout of the data. The .DO file refers directly to the .DCT file, which in turn refers to the data (.DAT) file. You will need to modify both the .DO file and the .DCT file to change the filenames at the beginning of each file to specify the correct drive and the full path name of the directory in which the data file resides. If any directory name includes spaces, be sure to surround that directory name with double quotes. These modifications can be done using a text editor such as Wordpad or Edit. Once these have been modified, you simply run the .DO file in Stata by starting Stata and then clicking on File, Do from the menu and selecting the .DO file. This will read the data into Stata but it may take some time, as the data files are large. After the data have been read into Stata do not forget to save the file as a Stata dataset (.DTA extension).

Warning: Because the datasets are large you may have problems with memory limitations. Memory available to Stata can be increased (up to the capacity of your computer) using the run time parameters -- see your Stata manual for details. Also, the maximum number of variables that Intercooled Stata can process is 2047. If the number of variables in the dataset you are using exceeds this number, you can edit the .DO and .DCT files to delete variables in excess of 2047 that you will not be using in your analysis.

2) There is no .DO or .DCT file included with the data file, but you have SPSS available to you. Use the .SPS file that comes with the data file to read the data into SPSS. Then Save the data as a .SAV file and use Stat/Transfer (available from StatTransfer) or DBMSCOPY (available from Conceptual Software) to convert the .SAV file into a Stata .DTA file. Note that there may be costs related to either of these products.

3) There is no .DO or .DCT file included with the data file, and you do not have SPSS available to you. Write your own .DO and .DCT files.

A small example (4K, ZIP) is provided for each of these files to give the general structure. Do not use the variable names that are in the example, but rather use the names and locations of variables as they are specified in the .SPS file. Be very careful about the use of upper and lower case in variable names as Stata variable names are case sensitive. In the example (and in files created by DHS), we use lower case letters in variable names and upper case letters in label definitions statements. Once you have created the .DO and .DCT files for the data set, you can the follow the steps in 1) above to read the data into Stata

  Q. I am using SAS and I have trouble reading the dataset I downloaded from your website. How do I read in your data into SAS to create a SAS system file?

  A: The DHS data file comes with .SAS and .DAT files. The .SAS file is the SAS program to be executed in order to read an ASCII dataset and create a SAS dataset, and the .DAT file is the ASCII file used in the .SAS file. The modification of the .SAS file depends on whether a temporary or permanent SAS data set is needed. In either case, the .DAT file does not need any modification.

CREATING A PERMANENT SAS DATA SET 1.

In the LIBNAME statement (line 1 of the .SAS file), specify the path of the directory where the permanent SAS data set will be stored.

Example: LIBNAME user 'C:\COUNTRY'; To call up the format associated with the variables in the SAS data file, add the following statement in a blank line following line

1: OPTIONS FMTSEARCH=(user);

2. Create a permanent format catalog by adding the name of the library reference used in the LIBNAME statement, to the PROC FORMAT statement (in line 3) as follows: PROC FORMAT LIB = user;

In this example, a permanent format catalog will be stored in the directory called 'C:\COUNTRY'.

3. On the DATA statement line, precede the name of the data file being created with a period and the library reference name specified in the LIBNAME statement. The LIBNAME, period and data file name should form one word. DATA user.datasetname;

4. In the INFILE statement, add the path for the directory in which the .DAT file is located. INFILE 'C:\COUNTRY\datasetname.dat';

5. Add a RUN statement at the end of the .SAS file if there is no other SAS statement following the DATA step. RUN; A permanent SAS data file will be created in the specified directory, which will also contain the format catalog. The formats associated with the variables in the SAS data file are recalled in a new SAS session by adding the following statement to the SAS program: OPTIONS FMTSEARCH = (user); where user is the library reference name given in the LIBNAME statement. This statement should appear before the SAS data file is used in a DATA step or PROC statement.

CREATING A TEMPORARY SAS DATA SET

1. Delete the LIBNAME statement.

2. In the INFILE statement, add the path for the directory in which the .DAT file is located.

3. Add a RUN statement at the end of the .SAS file if there is no other SAS statement following the DATA step.

A small example (4K, ZIP) is provided to give you the general structure.

NOTE: The DATA statement (Data datasetname;) must always be on the line immediately before the very first ATTRIB statement. If the statement is at a different location, please MOVE it. Also, the default value for maximum record length is 256, but if the longest record in the dataset is greater than 256, you need to edit the INFILE statement. Please include the parameters INFILE {datasetname} LRECL={maximum record length} MISSOVER. INFILE 'C:\COUNTRY\datasetname.dat' LRECL=max.rec.length# MISSOVER;

  Q. I am using SPSS and I have trouble reading the dataset I downloaded from your website. How do I read your data into SPSS to create a system file?

  A: The DHS data file comes with .SPS and .DAT files. The .SPS file is the syntax program to be executed in order to read an ASCII dataset and to create a SPSS data set, and the .DAT file is the ASCII file used in the .SPS file.

Open SPSS for Windows. From the File menu, choose Open then Syntax. Select the drive and path where the .dat and .sps files are located and highlight the .SPS file. Click on Open. Go to the end of line 1 in the .sps file and press the DELETE key. This should move the slash (/) on line 2 to the end of line 1. The first three lines of the .sps file will now look like this: DATA LIST FILE='C:\EXAMPLE.DAT' RECORDS=2/
HHID 1-12 (A)
HV000 18-20 (A)

Now from the menu, choose RUN, then All. Depending on the size of the data, it might take a while for SPSS to return the message "SPSS Processor is Ready. Transformations pending". Once you get this message, go to the SPSS Data Editor screen. from its menu, choose TRANSFORM, then Run Pending Transforms. Your data should now be presented in the data editor screen. You may now save the data as a system file by selecting File, Save. The default extension for SPSS system file is .sav.

A small example (3K, ZIP) is provided to give the general structure of the files.

  Q. Some of the data files I have downloaded are extremely large and I don't need to use all of the variables in the dictionary. How can I create a smaller subset of the data?

  A: The new version of the Select Utility can create a user-defined, selected subset of variables from an SPSS, SAS or STATA data file. The data descriptions files (syntax) are provided as input to the SELECT program along with a file created by the user with the names of the desired variables. This utility includes a documentation file.

Please download from www.measuredhs.com/accesssurveys/technical_assistance.cfm.



Back to top

Dataset Indicators

There are currently 8 FAQ(s) available:

  Q. Why do my age distributions differ from those in the reports?

  A: This problem will not occur when using the DHS Standard Recode files. The raw files, however, contain both reported age and date, and imputed age and century month code (CMC) of date.

Always use imputed age and CMC, rather than reported age and date, because there may be inconsistencies and/or missing data.

The imputed age in the raw data files has the same name as the age variable, but with a "C" appended. For example, if the age variable is Q104, the imputed age is Q104C. In most cases, the imputed age and the reported age will be the same; only inconsistent or missing data will cause them to differ.

Whether using raw or recoded data, it is easiest to use the CMCs for any event, instead of month and year variables. In addition, in the raw data, only the reported month and year of an event are present, not final imputed month and year values, and there will be missing or inconsistent data. The CMC for each event is the number of months from January 1900 to the date of the event.

For example, a respondent was born in November 1972, gave birth in June 1992, and was interviewed in April 1994. The CMCs for these events are:

her birth (72*12 + 11 = 875)
her child's birth (92*12+6 = 1110)
and her interview (94*12 + 4 = 1132)
These CMCs can be used to calculate the following:
Her age (1132-875)/12 = 21 years
Her age at her child's birth (1110-875)/12 = 19 years
Her child's age (1132-1110) = 22 months

Although the CMCs and ages are labeled "imputed," most have simply been calculated from reported data after editing has been completed. To find the percentage that were imputed from incomplete data, see the "date flag" variables.

In the Standard Recode files, for example, the variable V014 is a date flag. In raw data files, these flags usually have the same name as the date variable, but with an "F" appended. For example, the variables for a date of birth are Q103M and Q103Y; the CMC is Q103C and the flag is Q103F.

  Q. I cannot match your results for fertility rates. How does DHS calculate them?

  A: The DHS program has converted the ISSA fertility rate programs into SAS and SPSS. The programs have been broken down into separate components and are included in downloadable Zip files.

The principle of calculating the fertility rate is as follows:

Compute the number of births in the period of interest (usually the 5 years preceding the interview) `to each woman in each 5-year age group, according to the woman's age at the time of the birth.

Compute the exposure of the woman, which is the number of years spent in each 5-year age group during the period of interest.

For example, a woman who is 26 years and 2 months old has spent 15 months (1.25 years) in the 5-year age category 25-29 (25 years and 0 months through 26 years and 2 months, inclusive). The woman has also spent 45 months (3.75 years) in the 5-year age category 20-24 in the 5 years preceding the interview. Divide the number of births by the number of years of exposure to produce the age-specific fertility rate (ASFR) for each age group. To obtain the total fertility rate (TFR), sum the ASFRs for the 5-year age groups and then multiply the sum by the number of years in each 5-year age group (5). Y

ou can download the fertility programs you need:

SPSS Fertility Programs (7K, ZIP)
SAS Fertility Programs (6K, ZIP)

  Q. I am having trouble reproducing the mortality rates from the DHS report. How were they calculated?

  A: The DHS program has converted the ISSA mortality rate programs into SAS and SPSS. The programs have been broken down into separate components and are included in downloadable Zip files.

The mortality rate is calculated using the principles mentioned in the WFS Comparative Studies number 43, December 1984, by Shea Oscar Rutstein, Infant and Child Mortality: Levels, Trends and Demographic Differentials.

The mortality rate is calculated using synthetic cohort probabilities of death. The probabilities of death are calculated for subintervals of exposure (0 months, 1-2 months, 3-5 months, 6-11 months, 12-23 months, 24-35 months, 36-47 months, 48-59 months). The probability of death for a cohort and for a given period is the result of dividing the number of deaths for that period occurring between the limits of the subinterval to children who were exposed to death during the period, by the number of children exposed (children entering the subinterval alive).

The calculations of the deaths and the exposure are similar in manner. The procedure for calculating the deaths is explained below. The procedure for calculating the exposure uses the same process, but children contribute to the exposure in every subinterval in which they enter the subinterval alive, whereas children contribute to the deaths only in the subintervals in which the children die.

For children who die between ages a and b:

  1. For children born between the dates [p-a] and [p'-b], all deaths are counted as occurring in the period.
  2. For children born between the dates [p'-b] and [p'-a], half the deaths are assigned to the period [p,p'] and the other half are assigned to the following period. The exception to this is when period [p,p'] is the last period immediately preceding the interview, in which case all deaths are assigned to the period.
  3. For children born between [p-a] and [p-b], half the deaths are assigned to the period [p,p'] and the other half are assigned to the preceding period.

    The mortality rates are calculated using the probability of survival of the children of all of the relevant age groups, which is 1 minus the probability of death of these age groups. The mortality rate is calculated using the following formula:

    (n)q(x) = 1 - Product(1-q[i])

    where q[i] is the probability of dying in subinterval i, and i ranges from x to x+n

All the mortality rates are calculated in the same manner except for the post neo-natal mortality (PNN), which is defined as the infant mortality rate minus the neo-natal mortality rate (NN).

You can download the mortality programs you need:
SPSS Infant Mortality Programs (11K, ZIP)
SAS Infant Mortality Programs (10K, ZIP)

  Q. When I read DHSI datasets into SPSS, variables and values are being mislabeled and missing values are not being assigned. Is this a problem with the data file or the data description file?

  A: The problem lies with the data description file (the SPSS syntax file that has an SPS extension). The SPSS data description files for rectangular data files for DHSI surveys are based on SPSS for DOS. SPSS for Windows uses slightly different syntax rules, and these new rules are causing the problem.

The problem is associated with variables that are indexed, that is, variables whose names contain the $ character. The easiest way to solve the problem is to use the flat data file instead of the rectangular data file. Alternatively, you can correct the syntax of the data description file so that it is compatible with SPSS for Windows.

  Q. What is the variable weight?

  A: In DHS Standard Recode files, the household sample weight variable is HV005 and the individual weight variable is V005.

In raw data files, the names will vary, but they generally include the word WEIGHT or WT.

This is an eight-digit variable with six implied decimal places. Always divide this variable by 1,000,000 before applying it.

When the sample is designed for a DHS survey, there is often interest in analyzing data for regional subsets within the sample population. When the expected number of cases for some of these regions is too small for analysis, it is necessary to oversample those areas.

During analysis, it is then necessary to "weight down" the oversampled areas and "weight up" the undersampled areas. Most DHS surveys require the use of sample weights during analysis. Always use the weight variable found in the DHS data set. Even surveys that come from a self-weighting sample have the value 1,000,000 stored in the weight variable.

Here is an example of code for an SPSS program in which the results are to be weighted:

WTVAR=V005/1000000

WEIGHT BY WTVAR

The weights in DHS data files have been normalized so that the total weighted number of cases equals the total number of cases unweighted.

  Q. What are the inflation factors called AWFACTT, AWFACTU, etc.?

  A: Although most DHS samples are drawn from all women of reproductive age, quite a few are samples of ever-married women only. Because certain types of analysis (marital patterns and fertility) require samples of all women, DHS constructs inflation factors to permit ever-married women to also represent never-married women.

To calculate these factors, all de facto women listed in the households are tabulated for single year of age by background characteristics, according to ever-married status.

Household weights are used in the tabulations. The factor is calculated by dividing all women of a particular age and characteristic by the ever-married women of that age and characteristic. If no ever-married women are available for the denominator, the never-married women are accumulated with those in the previous or next single year of age.

These factors are stored as five-digit variables with two implied decimal places. The variables must be divided by 100 before being used.

Once divided, the factors have a minimum value of 1.0, meaning that all women of that age and characteristic category have been married. The factors can be quite large. A factor of 189.07 indicates that each ever-married woman of that age and characteristic category is representing herself and 188.07 never-married women.

These factors should not be confused with the sample weight variable. They should only be used to inflate the total n from ever-married women to all women for the analysis.

For example, a respondent in an ever-married sample has a value of 13.04 for the variable AWFACTT. If a mean number of children born for all women is desired, the number of her children will be added to the numerator, and 13.04 (rather than 1) will be added to the denominator.

In DHS-I Standard Recode files, these all-women expansion factors were country-specific variables and only appeared in data files of ever-married samples. In DHS-II, these factors were made part of the Standard Recode. For countries in which all women were interviewed, these factors have a value of 100, producing a factor of 1.0 after division by 100.

  Q. How do I merge household, women, men, and wealth DHS data files?

  A: When merging women’s and men’s data files with their households, you need to use the cluster and household numbers. Since there is a “one-to-many” relationship between households and individuals, you should start with the individual data, women or men, as your “base” (or ‘unit of analysis’) and locate the correct household for each person.

In the household data, the cluster number is stored in HV001, and the household is in HV002. In the women’s data, the cluster is V001, the household is V002. In the men’s data, the equivalent variables are MV001 and MV002.

Another alternative is to use the household and individual case ID variables. The household ID is HHID, and is 12 characters long. In general, this variable will consist of the cluster and household numbers, but there are exceptions (see notes below). The case ID variable for women is CASEID, which is 15 characters long. It consists of the household ID for that person, with the person’s 3-digit line number appended to the end. So in SPSS, you could create a variable TMPID:

COMPUTE TMPID = SUBSTR(CASEID,1,12).

Then you would use TMPID to match with HHID in the household.

The wealth index is computed for households, not for individuals. Therefore, the case ID for these files, WHHID, is the equivalent to the household ID. To merge the household and wealth index files, set the two ID variables (HHID and WHHID) equal. To merge an individual data file with the wealth index, follow the same procedure as described above for merging individual and household data, but instead of HHID, use WHHID.

When merging files from the India NFHS-2 survey, you can still use HHID, WHHID, and CASEID as described above. But if you want to use the individual variables for cluster and household number, you must also use the State variables (HV024 in the households, and V024 in the women’s files). When matching the households or individuals to the village data, you must also use the village number (since there can be more than one village in a cluster), which are SHLOCAL in the households, and SLOCAL in the women’s files. The village number variable in the village data is VVILLAGE.

  Q. How do I analyze the data at the district level? Where can I get information about district names?

  A: A vast majority of the data sets available from this website cannot be analyzed at the district level. This is because the samples were drawn to support analysis only at regional, urban/rural, or higher levels of disaggregation. It is usually an inappropriate use of the data to analyze them at the district level, so information about district names is not available.



Back to top

GPS Data

There are currently 2 FAQ(s) available:

  Q. How can I link geographic data to the DHS data?

  A: The geographic datafile contains the cluster ID that corresponds to the cluster ID in each of the survey datasets (individual woman, child, births, etc.). Depending on the type of analysis, you may choose to aggregate the DHS data or simply attach new information to the DHS data files within your GIS.

  Q. Why do the geographic locations appear to plot outside of my boundaries?

  A: The GPS data are likely to be much more precise than publicly available administrative boundaries, which have often been digitized from maps drawn at very small scales. When the highly precise GPS data are compared to less precise boundary data, it may appear that the GPS data are in the wrong place. It is advisable to use the best possible data; the same problem can also occur with road networks and other layers.

The datums of the data must also match. The GPS data are collected in WGS84 datum. If your other data are in a different datum, you should convert either the DHS data or your other data so that the datums are consistent. Most GIS software packages have a datum conversion function.



Back to top
spacer
spacer spacer
vertical line
spacer
spacer spacer spacer spacer