Recoding variables

Using recode

The most frequent use of recode is to recode the numbers that represent missing values to proper “missing value” as understood by Stata.

Very often at the coding stage, missing values (e.g. non-response, no available data) are coded as extreme numbers such as 99, -99. However, without telling Stata those numbers represent missing data, Stata will treat them as numerical values, which will create problems in analysis. So we need to recode those values as ., which tells Stata to treat those observations as “missing”.

Different datasets will have different conventions in how they initiall code the missing data, so we will need to examine the data first to determine which numbers represent missing data.

codebook female

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
female                                                                                                                                                                                                                                                      Sex
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                  type:  numeric (int)
                 label:  V240, but 3 nonmissing values are not labeled

                 range:  [-5,2]                       units:  1
         unique values:  4                        missing .:  0/89,565

            tabulation:  Freq.   Numeric  Label
                            40        -5  
                            51        -2  No answer
                        42,723         1  
                        46,751         2  

In this case, we have missing values coded as -5 and -2, and there are 91 observations that have missing data.

To recode the values of a variable, we can use recode var rule, or recode var (rule) (rule), where the syntax for rule takes the form original value = recoded value.

// Recode -5 and -2 to missing value
recode female (-5 -2 = .)

Always check to see if recoding was done correctly. Use tab var, missing to display a frequency table including . the missing data.

tab female, missing

                Sex |      Freq.     Percent        Cum.
--------------------+-----------------------------------
                  1 |     42,723       47.70       47.70
                  2 |     46,751       52.20       99.90
                  . |         91        0.10      100.00
--------------------+-----------------------------------
              Total |     89,565      100.00

Here we see that we no longer have -5 and -2 in the data, and all 91 missing values have been properly recoded to .

We can also choose to recode the variable to something that makes more intuitive sense, or something we prefer, if the recoding does not change what the value represents.

One such case is when we have a nominal variable. Since nominal variable has categories with no inherent order or ranking, we can freely change the value that represents each category, without affetcing the substantive meaning.

For example, the variable female initially has 1 representing category “Male”, and 2 representing category “Female”. Very often, it is more intuitive to code a dichotomous variable “Yes/No” as 1/0.

// Recode 2 to 1, 1 to 0
recode female (2 = 1) (1 = 0)
tab female, missing
(female: 89474 changes made)


                Sex |      Freq.     Percent        Cum.
--------------------+-----------------------------------
                  0 |     42,723       47.70       47.70
                  1 |     46,751       52.20       99.90
                  . |         91        0.10      100.00
--------------------+-----------------------------------
              Total |     89,565      100.00

Using replace

Another way to recode variable is using the replace command, combining with logical operators to subset the data.

// Recode all negative values to missing values
replace female = . if female <=0

// Recode 1 to 0
replace female = 0 if female == 1

// Recode 2 to 1
replace female = 1 if female == 2

tab female, missing
(91 real changes made, 91 to missing)

(42,723 real changes made)

(46,751 real changes made)


                Sex |      Freq.     Percent        Cum.
--------------------+-----------------------------------
                  0 |     42,723       47.70       47.70
                  1 |     46,751       52.20       99.90
                  . |         91        0.10      100.00
--------------------+-----------------------------------
              Total |     89,565      100.00