Political Science Research Methods on Political Science Research Methods

When Does Box Plot Hide Information?

Fri, 22 Mar 2019 00:00:00 -0400

Box plot is a powerful way to visualize the distribution of a continuous variable. However, it hides crucial information when our data is not uni-modal (i.e. has more than one peak in the distribution).

Box plot is a very information-rich. From the graph, we can see:

The median value, as shown by the bar in the middle.
The inter-quartile range, shown by the total length of the box.
The 1st quartile (25th percentile) and the 3rd quartile (75th percentile), indicated respectively by the lower boundary and the upper boundary of the box.
Outlier values, as indicated by individual dots plotted outside of the whiskers range.
The approximate degree of dispersion in the data, shown by the length of the box
- Shorter box indicates a smaller variance in the data, and longer box indicates a larger variance.
Whether the distribution is symmetrical or skewed
- If the position of the median bar is closer to the middle, then the distribution is approximately symmetrical; and if the bar is positioned towards the side, then the distribution is skewed.

Symmetrical distribution; Relatively low variance; Outlier value

Skewed distribution; Slightly higher variance; No outlier

However, box plot has one drawback – it hides the shape of the distribution if our data is bi-modal (or multi-modal).

For example, here we have some data that has a bi-modal distribution – the size of the Christian population as a percentage of a country’s total population.

If we draw a box plot for this data, this bi-modal property is completely hidden.

So if our data has more than one peak, then box plot would not be the most appropriate graph to display the distrbution shape. Good old histogram is a better choice in this context.

Generating variables

Wed, 20 Mar 2019 00:00:00 +0000

Cloning existing variables

I prefer to keep the orignal dataset untouched, so I would usually create a copy of the variables that I’m interested in, and work with the copy. There are two ways to do this:

clonevar orignal_varName clone_varName (preferred)
- Exact clone, including data values, labels etc.
gen orignal_varName clone_varName or generate
- Only clones the data, not labels

Let’s try using the World Value Survey (Wave 6) data. And make a copy of V10, a question about subjective happiness.

use WV6_Data.dta, clear

gen happiness = V10
codebook happiness V10, compact



Variable     Obs Unique      Mean  Min  Max  Label
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
happiness  89565      7  1.827209   -5    4  
V10        89565      7  1.827209   -5    4  Feeling of happiness
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

We see that the values for happiness (our copy) and V10 are the same, but happiness does not have any variable labels. Of course, we can always manually create labels for the new variables.

Now let’s try clonevar.

clonevar happiness = V10
codebook happiness V10, compact



Variable     Obs Unique      Mean  Min  Max  Label
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
happiness  89565      7  1.827209   -5    4  Feeling of happiness
V10        89565      7  1.827209   -5    4  Feeling of happiness
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Both values and labels are preserved in our cloned copy of V10.

Creating categorical variable

Let’s create a dichotomous variable for having children (Yes/No) from the original variable that shows how many children someone has.

We can do this by recode the original variable.

gen have_children = V58
recode have_children (-5/-1 = .) (1/8 = 1)

Always check to see the recoding was done correctly.

tab V58 have_children, missing


 How many children do |          have_children
             you have |         0          1          . |     Total
----------------------+---------------------------------+----------
                   -5 |         0          0         29 |        29 
                   -4 |         0          0      1,000 |     1,000 
                   -2 |         0          0        529 |       529 
                   -1 |         0          0        109 |       109 
          No children |    26,142          0          0 |    26,142 
              1 child |         0     14,297          0 |    14,297 
           2 children |         0     21,579          0 |    21,579 
           3 children |         0     12,356          0 |    12,356 
           4 children |         0      6,292          0 |     6,292 
           5 children |         0      3,230          0 |     3,230 
           6 children |         0      1,775          0 |     1,775 
                    7 |         0        991          0 |       991 
                    8 |         0      1,236          0 |     1,236 
----------------------+---------------------------------+----------
                Total |    26,142     61,756      1,667 |    89,565

Or, we can do the same by using replace

gen have_children = .
replace have_children = 1 if V58 > 1
replace have_children = 0 if V58 == 0

Again, check to see the if new variable was created correctly.

tab V58 have_children, missing


 How many children do |          have_children
             you have |         0          1          . |     Total
----------------------+---------------------------------+----------
                   -5 |         0          0         29 |        29 
                   -4 |         0          0      1,000 |     1,000 
                   -2 |         0          0        529 |       529 
                   -1 |         0          0        109 |       109 
          No children |    26,142          0          0 |    26,142 
              1 child |         0          0     14,297 |    14,297 
           2 children |         0     21,579          0 |    21,579 
           3 children |         0     12,356          0 |    12,356 
           4 children |         0      6,292          0 |     6,292 
           5 children |         0      3,230          0 |     3,230 
           6 children |         0      1,775          0 |     1,775 
                    7 |         0        991          0 |       991 
                    8 |         0      1,236          0 |     1,236 
----------------------+---------------------------------+----------
                Total |    26,142     47,459     15,964 |    89,565

Labeling variables

Wed, 20 Mar 2019 00:00:00 +0000

Variable label

Variable label helps us to know what the variable is about. This label will also conviently shows up as axis name if we were to draw a graph,

describe happiness


              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
happiness       float   %9.0g

We can create labels to describe what the variable is measuring using label variable var_name.

label variable happiness "Feelings of happiness"
describe happiness



              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
happiness       float   %9.0g                 Feelings of happiness

Value label

For categorical variables, we can create labels to show what does each level of the variable represents. This is helpful when we do a frequency table or fraw a graph.

To define the labels, we first use the command label define label_name to create a new label and give it a name. Then we specify the numerical value representing the category/level, then specify the label using a character string enclosed in " " double quotes.

Lastly, we need to apply the label we have created (happiness_label) to the corresponding variable (happiness).

// first define the label
label define happiness_label 1 "Very Happy" 2 "Rather Happy" 3 "Not very happy" 4 "Not at all happy"

// then apply the label to the variable
label values happiness happiness_label

tab happiness




     Feelings of |
       happiness |      Freq.     Percent        Cum.
-----------------+-----------------------------------
              -5 |          6        0.01        0.01
              -2 |        238        0.27        0.27
              -1 |        514        0.57        0.85
      Very Happy |     29,256       32.66       33.51
    Rather Happy |     45,786       51.12       84.63
  Not very happy |     11,214       12.52       97.15
Not at all happy |      2,551        2.85      100.00
-----------------+-----------------------------------
           Total |     89,565      100.00

Recoding variables

Wed, 20 Mar 2019 00:00:00 +0000

Using `recode`

The most frequent use of recode is to recode the numbers that represent missing values to proper “missing value” as understood by Stata.

Very often at the coding stage, missing values (e.g. non-response, no available data) are coded as extreme numbers such as 99, -99. However, without telling Stata those numbers represent missing data, Stata will treat them as numerical values, which will create problems in analysis. So we need to recode those values as ., which tells Stata to treat those observations as “missing”.

Different datasets will have different conventions in how they initiall code the missing data, so we will need to examine the data first to determine which numbers represent missing data.

codebook female


---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
female                                                                                                                                                                                                                                                      Sex
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                  type:  numeric (int)
                 label:  V240, but 3 nonmissing values are not labeled

                 range:  [-5,2]                       units:  1
         unique values:  4                        missing .:  0/89,565

            tabulation:  Freq.   Numeric  Label
                            40        -5  
                            51        -2  No answer
                        42,723         1  
                        46,751         2

In this case, we have missing values coded as -5 and -2, and there are 91 observations that have missing data.

To recode the values of a variable, we can use recode var rule, or recode var (rule) (rule), where the syntax for rule takes the form original value = recoded value.

// Recode -5 and -2 to missing value
recode female (-5 -2 = .)

Always check to see if recoding was done correctly. Use tab var, missing to display a frequency table including . the missing data.

tab female, missing


                Sex |      Freq.     Percent        Cum.
--------------------+-----------------------------------
                  1 |     42,723       47.70       47.70
                  2 |     46,751       52.20       99.90
                  . |         91        0.10      100.00
--------------------+-----------------------------------
              Total |     89,565      100.00

Here we see that we no longer have -5 and -2 in the data, and all 91 missing values have been properly recoded to .

We can also choose to recode the variable to something that makes more intuitive sense, or something we prefer, if the recoding does not change what the value represents.

One such case is when we have a nominal variable. Since nominal variable has categories with no inherent order or ranking, we can freely change the value that represents each category, without affetcing the substantive meaning.

For example, the variable female initially has 1 representing category “Male”, and 2 representing category “Female”. Very often, it is more intuitive to code a dichotomous variable “Yes/No” as 1/0.

// Recode 2 to 1, 1 to 0
recode female (2 = 1) (1 = 0)
tab female, missing

(female: 89474 changes made)


                Sex |      Freq.     Percent        Cum.
--------------------+-----------------------------------
                  0 |     42,723       47.70       47.70
                  1 |     46,751       52.20       99.90
                  . |         91        0.10      100.00
--------------------+-----------------------------------
              Total |     89,565      100.00

Using `replace`

Another way to recode variable is using the replace command, combining with logical operators to subset the data.

// Recode all negative values to missing values
replace female = . if female <=0

// Recode 1 to 0
replace female = 0 if female == 1

// Recode 2 to 1
replace female = 1 if female == 2

tab female, missing

(91 real changes made, 91 to missing)

(42,723 real changes made)

(46,751 real changes made)


                Sex |      Freq.     Percent        Cum.
--------------------+-----------------------------------
                  0 |     42,723       47.70       47.70
                  1 |     46,751       52.20       99.90
                  . |         91        0.10      100.00
--------------------+-----------------------------------
              Total |     89,565      100.00

Univariate Distribution

Tue, 19 Mar 2019 00:00:00 +0000

Bar plot

To draw a bar plot, we simply use the command graph bar var.

The default setting for graph bar is to set y-axis as percent. The full command behind the scene is in fact graph bar (percent), where the percent option is omitted by default.

// Default bar plot, percent
graph bar, over(Cheibub4Type)

We can change the default setting, and change the y-axis to frequency / count.

// Frequency bar plot
graph bar (count), over(Cheibub4Type)

We can also rotate the graph to display horizontal bars, using graph hbar. This is helpful when we want to plot a variable with many categories. If we have too many categories, the category names tends to get crowded in a vertical bar plot, whereas the horizontal display gives us enough space to display the category names properly.

// Horizontal bar plot
graph hbar, over(Cheibub6Type)

Histogram

To draw a histogram, we can use histogram or the abbreviated hist command.

// Percentage of women in lower house, 2015 (IPU)
hist UNDP_Life2014

The default is density. We can change it to frequency, fraction, or percent.

hist UNDP_Life2014, freq

Some prefer to draw a frequency histogram with overlaid normal density curve to see if the observed distribution is aprroximatley symmetrical.

hist UNDP_Life2014, normal

Density plot

Density plot is similar to histogram, but is more “smoothed over”. To draw this, we use kdensity, stands for “kernel density”.

kdensity UNDP_Life2014

Similarly, we can overlay a normal density plot over the kernel density plot.

kdensity UNDP_Life2014, normal

Box plot

See this post for more discussions on how to read a box plot, and its drawback.

We can use graph box to draw a box plot.

// Life expectancy at birth, 2014 (UNDP 2014)
graph box UNDP_Life2014

For a one variable box plot, the default graph does not look very nice. There are various aesthetic changes we can make. For example, we can use outergap() to increase the gap between the box and the margin (i.e. makes the box narrower), and use intensity() to change the intensity/transparency of the fill color of the box.

// Life expectancy at birth, 2014 (UNDP 2014)
graph box UNDP_Life2014, outergap(100) intensity(50)

We can also rotate the box plot horizontally by telling Stata to draw graph hbox.

graph hbox UNDP_Life2014, outergap(100) intensity(50)

Dot plot

We can think of a uni-variate dot plot as a one-way scatter plot, where each observation is represented as a dot and plotted individually.

dotplot UNDP_Life2014

While dot plot is a good way to display all the data (we can see each observation individually), it tends to get cluttered when we have a large sample size.

Q&A Week 8: Sampling and Survey Research

Fri, 01 Mar 2019 00:00:00 -0500

In the class we talked about surveys having high external validity, but weak in internal validity. Does external validity take precedence (over internal validity) in terms of importance, or vice versa?

I would say that in general, it is more important to establish internal validity than external validity. If we can ensure internal validity, at the very least, we can claim to have gained some localized knowledge ($X$ causes $Y$ in the sample we have studied), even if this knowledge might not hold in another context.

However, if we cannot be sure that the findings in our current study is internally valid (i.e. if we are unable to establish a credible claim that it is indeed $X$ that caused a change in $Y$, rather than other confounding factors), then what’s the point of generalizing this invalid claim? Only when we have confidence in the internal validity of a study (more localized knowledge), then having external validity will be useful (allow us to expand on this knowledge). Otherwise, generalizing a wrong-headed conclusion only compounds the initial error, like adding more heights to a building with a faulty foundation.

Is random sampling and randomization the same thing?

Random sample refers to a sample (i.e. the subset of population that we include in the study) where each unit is chosen randomly. This concerns the cases or subjects in the study.

Randomization (a.k.a random assignment) refers the process of randomly assigning each unit in our study to receive the treatment or not. This concerns whether the units/subjects (who are already included in the study) is receiving the treatment, or will they in the control group.

How can we account for coverage error in experimental studies?

Depending on how the subjects are recruited into the experiment, coverage errors in experimental studies can be difficult to avoid. Recall that many experiments, especially lab experiments, rely on convenience sample, which usually leads to part of the population not being covered in the sampling process. If subjects are recruited among the college undergraduates, then anyone who is not a undergraduate from that university is excluded from the sample.

This problem can be difficult to “account for” if we are using convenience sample, since it is built-in to the sampling process. However, other types of non-laboratory based experiments (e.g. survey experiments or field experiments) often have better coverage, which mitigates (though does not 100% eliminate) the problems of non-representative sample that comes with coverage errors.

Q&A Week 7: Comparative Studies

Fri, 22 Feb 2019 00:00:00 -0500

Can you explain more about the connection between Mill’s method of difference and experiments?

The two have very similar causal logic. They both try to establish a causal claim (difference in $Y$ can be attributed to changes in $X$) by leveraging on the fact that the treatment group (X = 1) and the control group (X = 0) are similar on other confounding variables ($Z$), except the treatment variable ($Y$) — since the two groups are similar in other aspects except with regard to $X$, any observed difference in $Y$ must be caused by the difference in $X$.

Method of difference try to approximate a comparable treatment and control group by selecting cases with similar attributes except $X$ (mostly based on theory and domain knowledge about what factors could be potentially confounding variables). Experiments try to achieve this by randomly assigning the treatment.

Is the concern for external validity problems only apply to method of difference, or method of agreement as well?

It is a problem in both types of designs. External validity issue is present in all studies where we only have a small number of non-randomly selected cases.

Is selecting on dependent variable only a problem for method of agreement?

Yes it is only a problem for comparative designs that select cases using methods of agreement.

We say a study is “selecting on dependent variable” when the decision criterion to include certain units into (or exclude from) the study sample is correlated with the value of the dependent variable.

For method of agreement, we are comparing cases with the same outcome but differs in the value of independent variable. In another word, the reason we are including these cases in the comparison is because that they share the same outcome, and other cases are excluded because they have a different outcome — the decision criterion for sample selection is directly related to the status of dependent variable.

Designs using methods of difference for case selection are not selecting on dependent variable. In this method, the criterion to select cases to be included in the sample is not related to what the outcomes are. Instead, we are selecting cases based on the independent variables — we are comparing the cases that are similar in all the independent variables, except one crucial explanatory factor that we are interested in.

The lecture mentioned that method of difference has trouble estimating “multiple causes”. What are some examples of “multiple causes” cases?

An event or outcome has multiple causes when there are more than one factors that could have lead to the outcome. For example, why the U.S has low voter turnout? There could be multiple factors for this: no compulsory voting; low interests in politics; election day is not a national holiday; two party system; winner-take-all system etc.

Most of the phenomenon we are interested are quite complex, so we should expect there to be multiple-causes most of the time.

Q&A Week 6: Formal Models and Game Theory

Fri, 15 Feb 2019 00:00:00 -0500

Costs of Voting

Can you explain a bit more about the table for costs of voting?

This is the table I have in the recitation slides:

	Voted = Yes	Voted = No
Election outcome = Preferred candidate won	Benefits - Costs of Voting	Benefits
Election outcome = Preferred candidate lost	- Costs of Voting	Zero

First, we have a few assumptions when analyzing the decision to vote from a rational choice framework:

Cost of voting is negative if we vote; and is zero if we do not voting
Benefit is positive if our preferred candidate wins; and is zero if our preferred candidate loses
Chance of any individual vote changing the outcome is very low (close to zero)

The intuition behind the table is that no matter what is the election outcome (preferred candidate win or lose), for us personally, the net benefit is always higher if we do not vote, than if we vote.

If our preferred candidate wins (first row), Benefits > Benefits - Costs of Voting. Net benefit is higher if we do not vote.
If our preferred candidate loses (second row), Zero > - Costs of Voting. Net benefit is higher if we do not vote.

Game Theory and Government Shutdown

Trump was prolonging the shutdown in order to get funding for the wall, is that a game theory/strategic interaction scenario?

Yes! Threatening or prolonging government shutdown in order to leverage a “better deal”, when viewed from a strategic interaction lens, is quite similar to the game of chicken (a sort of brinksmanship).

This is a situation where both players will benefit if both sides yield (take a compromise budget deal), both players will lose if neither side yield (government shutdown), but if only one player yields and the other doesn’t (Trump gives up, Congressional Democrats do not), then the player that yields loses and the other player benefits (no funding for wall, government shutdown ends).

See this NPR article for more in depth discussion on the incentives both sides faced that shaped this negotiation into a political brinksmanship, and this FiveThirtyEight article on why we (the voters) are partly to blame for this.

Q&A Week 4: Natural Experiments and Observational Studies

Fri, 01 Feb 2019 00:00:00 -0500

Natural Experiments

In class you mentioned “Natural experiments based on geographical boundaries can be complicated by human factors”. Can you explain a bit more what this means?

Recall that the key assumption in a natural experiment design that ensures internal validity is that the treatment assignment is random or “as-if” random. In another word, we have to ask, is the treatment assignment correlated with any other factors that could potentially cause the observed difference between treatment and control group? If yes, then the assumption does not hold and the study’s internal validity is weakened. If no, then the assumption of “as-if” randomization holds.

In the study on whether money from lottery will increase happiness, the assumption is that the treatment (winning money from lottery) is randomly assigned among lottery buyers, hence whether someone is in the treatment group (lottery winners) or the control group (lottery losers) is not correlated with other factors that affects their happiness. In another word, treatment assignment (whether someone gets money) is independent of other confounding factors that could have affected the outcome (happiness).

In studies that leverage on geographical boundaries for natural experiment opportunities, the generic set-up is to compare Area A (treatment group) on one side of the geographical boundary that have received the treatment, with Area B (control group) on the other side of the boundary that have not received the treatment. This means that we have to ask, is the treatment assignment (being on one side of the boundary vs the other side) correlated with any other factors that could explain the difference in outcomes between Area A and Area B?

So what I meant by “natural experiments based on geographical boundaries can complicated by human factors“ was that, sometimes how the geographical boundaries are drawn, is not independent of the characteristics of the humans/political actors that draw these boundaries (i.e. the division introduced by the boundary is not random). If the reasons for how boundaries are drawn correlates with reasons that could explain the outcome, then the “as-if” randomization assumption would not hold.

Think about Posner (2004) we read for class, where Posner found that the relative size of the two ethnic groups (treatment) within each country explained why the cultural differences between the Chewa and Tumbuka ethnic groups are politically salient in Malawi but not in Zambia (outcome). He argued that the treatment assignment (being in a country where the two ethnic groups is relatively large vs relatively small) is “as-if“ random (assignment is uncorrelated with other factors that could explain the outcome), because “like many African borders, the one that separates Zambia and Malawi was drawn purely for [colonial] administrative purposes, with no attention to the distribution of groups on the ground” (Posner 2004: 530).

If however, the boundary that separates Zambia and Malawi are drawn for reasons that potentially correlate with factors affecting inter-group interaction (say for example, natural resource availability), then the treatment assignment is no long “as-if” random.

How would we know if the “as-if randomization” assumption is valid?

Since we have no control over the treatment assignment process in natural experiments, we cannot really “prove” whether this “as-if” randomization assumption is valid. All we can do is provide evidence to show that this assumption is plausible.

For example, we can rely on theory and background knowledge to make the case: assignment through lottery is plausibly random because we know how the winner are chose.

And for the Posner (2004) study, if there were some qualitative evidence (e.g. written records of how boundaries were decided) showing that the boundary was indeed “drawn purely for [colonial] administrative purposes, with no attention to the distribution of groups on the ground”, then that would be an important piece of evidence to support the “as-if” randomization claim.

We can also provide empirical evidence. Recall that randomly assignment treatment will give us comparable treatment and control groups, i.e. the groups on average, would be similar to each other in terms of any potential confounding variables. So we should expect that “as-if” randomization process should give us such comparable groups as well.

Researchers can measure the potential confounding variables and empirically test if the treatment and control groups are similar in those aspects. If we do not find any significant difference between the two groups in terms of those potential confounders, then that would be a piece of evidence supporting the “as-if” randomization assumption.

Observational studies

Is there any way to get rid of confounding variables in observational studies?

Confounding variables will always be present (we cannot “get rid of them” per se), but we can reduce the bias to our inference/conclusion introduced by any confounding variables.

Whenever we want to investigate if $X \rightarrow Y$, there will be confounding variables $Z$ lurking behind the scenes, that’s just the feature of the world we live in. These confounding variables will introduce bias to our inference, if we mistakenly conclude that the change in $Y$ is caused by $X$, while in fact the change in $Y$ was caused by $X$ and $Z$ (or $Z$ alone). This bias is often known by the jargon omitted variable bias.

When designing a study to investigate if $X \rightarrow Y$, one of our goals is to reduce any potential bias introduced by confounding variables, in order to isolate the effects of $X$ on $Y$ (how much of the change in $Y$ can be attributed to $X$, instead of $Z$).

Two common ways to reduce this bias in observational studies:

Statistically adjusting/controlling for observable confounding variables (i.e. include the “omitted” confounding variables in the statistical model, at least for those we have the data for).
If our data has multiple time points (i.e. panel data or time series data), statistically adjusting/controlling for observable and unobservable confounding variables by leveraging on the temporal nature of the data.

The jargon for these different techniques to isolate the effects of $X$ on $Y$ is “identification strategy” — strategies that help us to identify the effects of $X$ on $Y$. Randomized experiment, natural experiments, statistically adjusting for confounders are different types of identification strategies we can use.

How are longitudinal studies and cross-sectional studies different?

We have a longitudinal study if we have data for each unit at multiple time points, i.e. every unit is measured more than once. For example, a study on the effects of emergency events boosting presidential approval ratings (i.e. rally-the-flag effects) would be a longitudinal study (or more specifically, time series) — the unit of analysis is presidential approval ratings, and we have measures for this unit at multiple time points, before and after the emergency events.

A cross-sectional study is one where we only have data for each unit at one time points. If we were to examine whether partisanship affects how individuals evaluate the president’s response to a emergency event, say a devastating hurricane, using a survey conducted after the hurricane, then that would be a cross-sectional study — the unit of analysis is individual survey respondents, and we only have measures for the same person at one point in time (the time they responded to the survey).

Q&A Week 3: Experiments and Ethics

Fri, 25 Jan 2019 00:00:00 -0500

Yes, informed consent is an essential element of research ethics.

Generally speaking, we have to inform the participants the purpose of our research (e.g. “This is a study about attitudes towards political candidates), though we do not have to tell them the exact hypothesis of the study.

It is also important that the informed consent form has to let the participants know if there is any potential benefits or harms by taking part in the study, any compensations or incentives, confidentiality or privacy of the data, their rights to decline and to withdraw, so they can make an informed decision about participating. In most cases, political science experiments only involve “minimal risks”, i.e. about the same probability and magnitude of harm we would experience in daily life.

For more details on the important elements to include when obtaining informed consent, see this guide from Pitt IRB, or American Psychological Association (APA) ethics code (Section 8.02). It is possible to request waivers with adequate justification (see here for an overview of the requirements).

Does knowing you are part of an experiment affect how they respond? Is there a way to minimize the effects of this on the outcome?

Quite likely! One possibility is Hawthorne effect: simply being part of the experiment and knowing that you are being observed might change your behavior or how you respond, compare to everyday life scenario.

A more general phenomenon (some argue subsumes the Hawthorne effect) is called demand characteristics (also see textbook p.178), referring to how participants’ interpretation of the experiment’s purpose could potentially change their behaviors (e.g. behave in ways conforming to what they think the researchers want to observe, or they might behave in ways contradicting to what they perceived as the researchers’ hypothesis).

It is worth noting however, that not all experiments are equally affected by this potential problem. We might expect that experiments looking at behaviors that are more susceptible to social desirability bias are more vulnerable to bias introduced by demand characteristics, than those looking at more benign phenomenons.

While it is difficult to eliminate this effect completely in most experiments, some strategies exist. For example, researchers can devise a design that uses covert or unobtrusive treatments, so the participants are unaware that they are part of an experiment (e.g. Enos 2014, Sands 2017).

Deception is another common, though deeply controversial strategy. For example audit experiments often rely on deception to examine socially undesirable behaviors such as discrimination (e.g. Butler and Broockman 2011), norms or rules violation (e.g. Findley, Neilson and Sharman 2014).

Of course, the use of deception always has to be justified in the ethics review process. See this newsletter (p.13-19) for further discussion on the ethics of using deception in field experiment involving public officials as subjects.

About the Montana GOTV Experiment

The Montana experiment misled the people by using official seal. How did they get the project approved in the first place?

Only the people involved in the process would ever know! If I were to hazard a guess (take it with many many grains of salt), it is possible that the review process did not see the mailer as being intentionally misleading. Among the commotion in the follow-up to this controversy, one detail about the mailer did not get much attention — there was in fact a disclaimer line disclosing that the mailer is part of a academic study (below the boxes indicating candidate ideology).

Mailer from the experiment. Squint a little to see the disclaimer. Retrieved from Internet Archive

Maybe it’s too much of a fine print, but it’s there. So you could make the argument that they are not actively trying to deceive the recipient about who is sending the mailer out, and this might be part of reason why the proposal was approved. Again, I have to emphasize that this is all speculations on my part.

How did the Montana experiment affected people’s decision? I don’t see a discussion on how it actually influenced the turnout or election outcome.

We might never know! After the whole debacle, the study is unpublishable. Partly due to the ethical issue, partly because the data is likely unusable, given the spillover/contamination effect caused by the news coverage. After the news outlets reported about the story, those in the treatment group who have received the mailer would have known about where this mailer comes from and why they are receiving it (treatment is contaminated by extraneous factors that the researchers did not intend to provide), and those in the control group would also have known about the information in the mailer despite not receiving one (treatment spillover).

About Experiments on Development Programs

Are there examples of ethical and effective anti-poverty experiments?

There are many examples of using randomized experiments to evaluate the impacts of anti-poverty programs. Some good places to look for them: Poverty Action Lab (J-PAL) (research center at MIT), GiveWell (nonprofit focused on effective charities).

Not a question, just an interesting observation – in the US during the 1960s-70s, there was a similar program to Universal Basic Income. It was ended after there was an increase in divorce rate.

Hmm, this is really interesting to know. So if the experiment shows that UBI improves some aspects of life quality (e.g. household income, children’s education), but also has other “side-effects” such as increases divorce rate, from the policy-makers’ position, what should they make of this? What kind of “side-effects”, or how much, would be considered as a “reasonable” level of trade-off? Back to what we discussed in the beginning of the course, empirical evidence does not always lead to a neat solution to normative questions.

Example: Testing for measurement validity and reliability

Sat, 19 Jan 2019 00:00:00 -0500

Example: Racial Resentment Scale

Racial resentment scale is commonly used to measure symbolic racism. The scale contains four items, for each question, respondents indicate whether they agree or disagree with the statement on a five-point scale. The question wording and the respective variable number as appeared in American National Election Studies (ANES) 2016 are given below:

V162211: ‘Irish, Italians, Jewish and many other minorities overcame prejudice and worked their way up. Blacks should do the same without any special favors.’
V162212: ‘Generations of slavery and discrimination have created conditions that make it difficult for blacks to work their way out of the lower class.’
V162213: ‘Over the past few years, blacks have gotten less than they deserve.’
V162214: ‘It’s really a matter of some people not trying hard enough, if blacks would only try harder they could be just as well off as whites.’

The assumption is that agreeing with statement 1 and 4 (or disagreeing with statement 2 and 3) are indications of resentment towards African Americans.

Validity Test

Construct validity

To test for construct validity, we need to demonstrate that the indicator predicts what it is supposed to predict.

One aspect of construct validity is convergent validity: if theoretically we expect X and Y to be positively related, do we see a positive correlation between the indicator for X and Y?

In this case, theoretically we might expect that feelings of resentment towards African Americans would correlate with negative affective attitudes towards the group.

For illustration purposes, let’s just use a single statement from the resentment scale, statement 2 (V162212) about the effects of slavery and see if people’s answer to this questions correlates with their feeling thermometer score towards Blacks (V162312).

. twoway (scatter V162312 V162212) (lfit V162312 V162212)

We see that higher disagreement with the statement (1 = Strongly Agree, 5 = Strongly Disagree) correlates with lower scores on the feeling thermometer (higher value means warmer feeling towards the group, lower value means colder feeling). Resentment towards African Americans (as indicated by denying the effects of slavery on their current day hardship) indeed predicts a more negative attitudes towards them (as indicated by expressing less warm feelings).

Another way to demonstrate construct validity is to show divergent/discriminant validity: if theoretically we do not expect X and Y to be related, do we then see a low or weak correlation between them?

For instance, perhaps we do not expect feelings of racial resentment to be correlated with feelings towards the Supreme Court (V162102).

. graph twoway (scatter V162102 V162212) (lfit V162102 V162212)

We see that there is no discernible correlation between responses to statement 2 and feelings towards the Supreme Court.

Reliability Test

One common way to quantify the reliability of a multiple indicator scale is to calculate the Cronbach’s alpha $\alpha$.

This can be done in Stata using a simple command alpha, followed by the list of variables used in the scale.

. alpha V162211 V162212 V162213 V162214

Test scale = mean(unstandardized items)
Reversed items:  V162211 V162214

Average interitem covariance:     1.090995
Number of items in the scale:            4
Scale reliability coefficient:      0.8451

In the output, we can see a “Scale reliability coefficient”, which is ~0.8 in this case. A general rule of thumb is that > 0.8 indicates rather high reliability, and anything below 0.7 is a sign of unreliable scale. So in this case, the four-item Racial Resentment Scale has rather high reliability.

Q&A Week 2: Measurement

Fri, 18 Jan 2019 00:00:00 -0500

Types of measurement Errors

How to differentiate between systematic vs random error? Do you have any examples of systematic errors in measurement?

Systematic errors affects all units in the sample in the same direction (all measured values are consistently more positive or more negative than true value). For example, self-reported measure of turnout is likely to have positive systematic error – (most) people tend to over-report, saying that they have voted even though they have not.

Random errors do not affect all units in the sample in a consistent way – some units will be more positive than true value, some units will be more negative than true value. Let’s use the self-reported turnout as an example again. Perhaps people’s transient feelings about the current election affect whether they are likely to say they have voted or not – those who happened to read a positive news story about the election are more likely to over-report having voted, and others who happened to read a negative news story are more likely to under-report.

Very often both types of errors could be present, so we need to think carefully about the sources of potential errors. For example, crime statistics can be very noisy, with a lot of random errors introduced at various stages of collecting such data. Furthermore, statistics on certain types of crimes might additionally have systematic errors: for example, domestic abuse might be systematically biased downwards if victims under-report due to fear of retaliation.

Which error (systematic vs random) is worse? Which one should we try to avoid more?

Both types of errors are bad news! But they affect our analysis in different ways.

High random errors will add more noise/variability to our data, which will make it harder to detect the presence of a significant correlation between X and Y. In another word, noisy measures are bad because it increases the likelihood of false negative – we are likely to mistakenly infer there is no relationship between X and Y, when in fact there is.

For systematic errors, recall that indicators with high systematic errors are invalid, i.e. they are not capturing the concept of interest accurately. In such case, an invalid indicator will never lead us to the right conclusion (think of a road sign that points to the wrong direction), even if the indicator is measured with zero random error.

One of them has got to be invalid..

In terms of which type of error is worse, one way I think about this is that invalid indicator is more like a fatal disease, and unreliable indicator is more like a non-fatal but chronic disease that requires lots of care. So if a study is using invalid indicators, we cannot draw any meaningful inferences about the phenomenon we are investigating (the study is “dead”), while unreliable indicators make it harder to detect a true positive (increases uncertainty, but does not spell doom).

The textbook also has a good discussion on the different problems associated with measurement reliability and validity in political science (p.143-145).

	Systematic error = High	Systematic error = Low
Random error = High	Very very bad • Invalid and unreliable measure • Lots of noise, and signal is pointing at the wrong direction	Problematic, but can live with • Valid, but unreliable measure • Lots of noise, harder to detect the signal; More likely to get false negative
Random error = Low	Problematic • Invalid, but reliable measure • Measure does not capture the concept of interests; Conclusion does not bear on the actual phenomenon of interest	Awesome! • Valid and reliable measure • Move along

Ideally, we should try to minimize both types of measurement errors. Degree of random error can be empirically assessed (e.g. using Cronbach’s alpha, see example here), and can be reduced (e.g. using multiple indicators). Systematic error however, is harder to detect, harder to quantify, and harder to correct for.

About the True Score Theory $T = X + \epsilon$, how do we know how close our measured value $X$ is close to the true score $T$, if we cannot truly know $T$?

Unfortunately, we can never be 100% sure what the value of $T$ is. As mentioned above, while we can detect and correct for random errors, systematic errors cannot be corrected using statistical procedures. After we’ve done our best to minimize random error, it is up to the strength of our theory, clarity of conceptualization, and a small leap of faith to convince others (and ourselves), that our measures are indeed valid ones. This is also part of the reason why social science research can only establish a probabilistic relationship (confident within a certain range) and never a deterministic relationship. Embrace the uncertainty!

If all indicators are measured with some degrees of random errors, can too many indicators introduce more random errors?

Although every single indicator would be measured with some random errors, if we combine the multiple indicators as an index, or take the average value, we should have lower random errors compared to using a single indicator.

Measurement Reliability and Validity

Is there a good analogy to help remembering the difference between validity and reliability?

In class, I’ve made the analogy comparing a valid indicator as a correct label (indicator) matching the content of a box (concept) that you wanted to buy but cannot see what is inside.

Houston, we have a invalid indicator problem.

I don’t really have a good one for reliability, so let’s stretch the same label-on-a-box analogy a bit further. Suppose we have a machine printing the label for the box, although the label correctly matches the box content (valid indicator), the machine sometimes misprints a letter or two, so not all labels look the same (unreliable). And if we have Machine A that produces 5% misprinted labels, and Machine B that produces 15% misprinted labels, then we can say that B is less reliable (produces less consistent outcomes).

What are some examples of face validity?

Whenever you see a indicator used to measure a particular concept, simply ask yourself: does the measure appear to capture the concept you care about? If yes, then the measure has high face validity; if not, then it has low face validity.

Say I want to measure whether a country’s level of human rights protection, which of the following indicators has a higher face validity?

Gini coefficient
Number of political imprisonment

You probably have an answer in your mind. Let’s try another one: now I want to measure a country’s income inequality, which indicator has a higher face validity?

Gini coefficient
Number of political imprisonment

Again, you have an answer, and you are probably right.

A few things I’d like to highlight from this example:

Assessing face validity relies on domain knowledge. We need to first know what “human rights protection” means, only then we can see that more political imprisonment is an indicator for low levels of human rights protection.
Assessing face validity is largely based on judgment based on domain knowledge, rather than empirical demonstration.
Indicator validity is always assessed relative to the concept we are trying to capture, rather than something inherent to the indicator itself. Gini coefficient is a valid indicator for income inequality, but not human rights protection.

When do we test for construct vs face validity?

Ideally both, and more if possible. Since having invalid measures are really bad news, assessing the validity of a measure in multiple ways would increase the confidence

Face validity is rarely explicitly tested for – we already implicitly test for face validity when we are making the choice of which indicators to use to measure the concept. Although having face validity is important, high face validity alone is a rather weak evidence.

Construct validity can be empirically assessed in two ways: convergent validity and divergent validity. See here for an example.

Generally, if we are using the measures that have been used in published literature, we do not have to conduct separate validity test. The assumption is that they have already been previously validated (though we should still remain critical). If we are using new measures in our study, instead of established ones used in published literature, then it is recommended to first conduct a pilot study to test the measure’s validity and reliability. Use the measures as part of the actual study only after we know it’s valid and reliable.

About the article we read on using IAT/video games to measure implicit racial bias, how is the reliability of the measure determined? If the same respondent takes the test twice and gets different scores but in the same direction (e.g at first longer, then shorter time), is the measure considered reliable?

For IAT, the actual computation of the score takes quite a few steps, but to simplify it a bit, it is the reaction time differential that is used as a measure of implicit racial bias (see the test procedure here). So in this case, the test can be considered as reliable if respondent has the same directional preference (e.g. consistently faster at White-Pleasant association, than Black-Pleasant association) when taking the test multiple times.

For other tests however, it could be the case that time difference itself, rather than directional difference is used as the measure.

In general, test-retest reliability is measured as degree of correlation between the different test scores, rather than absolute difference. In psychology, rule of thumb is that test-retest reliability > 0.7 is an acceptable level, though this is no more than a convention used by researchers. IAT has a test-retest reliability of about 0.6.

This Podcast has a pretty interesting discussion on the use and critique of IAT.

Levels of Measurement

Can you elaborate more on meaningful vs relative/arbitrary zero point, and how that relates to interval and ratio measures?

A variable with meaningful zero point means that we can interpret the zero value as the absence of that variable. For example, income measured in dollars has a meaningful zero — we can interpret income = 0 to mean an absence of income. So if someone reported zero on this measure, we know this person has no income.

On the other hand, if the variable has a relative, or arbitrary zero points, we cannot interpret the zero value on that variable as the absence of that variable. Say we have a set of 5 questions to measure people’s political knowledge. Every correct answer gets you 1 point, and every wrong answer gets you 0 point, which gives us a range of possible score from 0 to 5. If Ann gets score = 0 on this scale, we cannot say that Ann has no political knowledge at all. The zero here is simply an arbitrary point to signal a very low level of political knowledge.

So how does this relate to interval vs ratio measures? Interval measures have relative/arbitrary zero points, and ratio measures have absolute/meaningful zero points. For the most part, the difference is only apparent (or we only need to pay attention to the difference) when we analyze and interpret the data.

For interval measures, since the zero point is arbitrary and lacks any meaningful interpretation, we cannot compare any differences in terms of proportion. It only make sense to compare the difference in magnitude. Going back to the political knowledge example, if Beth gets score = 2 on the political scale, and Cathy gets score = 4 on the same scale, we know that: 1) Cathy is more knowledgeable than Beth, and 2) the magnitude of difference is 2 more correct answers. However, since the zero point is arbitrary in this case, we cannot say Cathy is two times more knowledgeable than Beth. Or if we observe that Beth’s score increased from 2 to 3 after attending a civics education workshop, we cannot cay that Beth’s political knowledge increased by 50%.

Let’s compare to a ratio scale, income, which has a meaningful zero point. If Abe reported income = 20k, and Ben reported income = 40k, we know that 1) Ben has higher income than Abe, 2) Ben’s income is 20k higher than Abe, and 3) that Ben has an income twice as much as Abe.

Can a measure be both interval and ratio?

The four levels of measurement are mutually exclusive categories. The flow chart below should help you to distinguish the four categories.

Political Science Research Methods on Political Science Research Methods

When Does Box Plot Hide Information?

Generating variables

Cloning existing variables

Creating categorical variable

Labeling variables

Variable label

Value label

Recoding variables

Using recode

Using replace

Univariate Distribution

Bar plot

Histogram

Density plot

Box plot

Dot plot

Q&A Week 8: Sampling and Survey Research

Table of Contents

In the class we talked about surveys having high external validity, but weak in internal validity. Does external validity take precedence (over internal validity) in terms of importance, or vice versa?

Is random sampling and randomization the same thing?

How can we account for coverage error in experimental studies?

Q&A Week 7: Comparative Studies

Table of Contents

Can you explain more about the connection between Mill’s method of difference and experiments?

Is the concern for external validity problems only apply to method of difference, or method of agreement as well?

Is selecting on dependent variable only a problem for method of agreement?

The lecture mentioned that method of difference has trouble estimating “multiple causes”. What are some examples of “multiple causes” cases?

Q&A Week 6: Formal Models and Game Theory

Table of Contents

Costs of Voting

Can you explain a bit more about the table for costs of voting?

Game Theory and Government Shutdown

Trump was prolonging the shutdown in order to get funding for the wall, is that a game theory/strategic interaction scenario?

Q&A Week 4: Natural Experiments and Observational Studies

Table of Contents

Natural Experiments

In class you mentioned “Natural experiments based on geographical boundaries can be complicated by human factors”. Can you explain a bit more what this means?

How would we know if the “as-if randomization” assumption is valid?

Observational studies

Is there any way to get rid of confounding variables in observational studies?

How are longitudinal studies and cross-sectional studies different?

Q&A Week 3: Experiments and Ethics

Table of Contents

About Informed Consent

Does knowing you are part of an experiment affect how they respond? Is there a way to minimize the effects of this on the outcome?

About the Montana GOTV Experiment

The Montana experiment misled the people by using official seal. How did they get the project approved in the first place?

How did the Montana experiment affected people’s decision? I don’t see a discussion on how it actually influenced the turnout or election outcome.

About Experiments on Development Programs

Are there examples of ethical and effective anti-poverty experiments?

Not a question, just an interesting observation – in the US during the 1960s-70s, there was a similar program to Universal Basic Income. It was ended after there was an increase in divorce rate.

Example: Testing for measurement validity and reliability

Example: Racial Resentment Scale

Validity Test

Construct validity

Reliability Test

Q&A Week 2: Measurement

Table of Contents

Types of measurement Errors

How to differentiate between systematic vs random error? Do you have any examples of systematic errors in measurement?

Which error (systematic vs random) is worse? Which one should we try to avoid more?

About the True Score Theory $T = X + \epsilon$, how do we know how close our measured value $X$ is close to the true score $T$, if we cannot truly know $T$?

If all indicators are measured with some degrees of random errors, can too many indicators introduce more random errors?

Measurement Reliability and Validity

Is there a good analogy to help remembering the difference between validity and reliability?

What are some examples of face validity?

When do we test for construct vs face validity?

Levels of Measurement

Can you elaborate more on meaningful vs relative/arbitrary zero point, and how that relates to interval and ratio measures?

Can a measure be both interval and ratio?

Using `recode`

Using `replace`