Table of Contents
Types of measurement Errors
How to differentiate between systematic vs random error? Do you have any examples of systematic errors in measurement?
Systematic errors affects all units in the sample in the same direction (all measured values are consistently more positive or more negative than true value). For example, self-reported measure of turnout is likely to have positive systematic error – (most) people tend to over-report, saying that they have voted even though they have not.
Random errors do not affect all units in the sample in a consistent way – some units will be more positive than true value, some units will be more negative than true value. Let’s use the self-reported turnout as an example again. Perhaps people’s transient feelings about the current election affect whether they are likely to say they have voted or not – those who happened to read a positive news story about the election are more likely to over-report having voted, and others who happened to read a negative news story are more likely to under-report.
Very often both types of errors could be present, so we need to think carefully about the sources of potential errors. For example, crime statistics can be very noisy, with a lot of random errors introduced at various stages of collecting such data. Furthermore, statistics on certain types of crimes might additionally have systematic errors: for example, domestic abuse might be systematically biased downwards if victims under-report due to fear of retaliation.
Which error (systematic vs random) is worse? Which one should we try to avoid more?
Both types of errors are bad news! But they affect our analysis in different ways.
High random errors will add more noise/variability to our data, which will make it harder to detect the presence of a significant correlation between X and Y. In another word, noisy measures are bad because it increases the likelihood of false negative – we are likely to mistakenly infer there is no relationship between X and Y, when in fact there is.
For systematic errors, recall that indicators with high systematic errors are invalid, i.e. they are not capturing the concept of interest accurately. In such case, an invalid indicator will never lead us to the right conclusion (think of a road sign that points to the wrong direction), even if the indicator is measured with zero random error.
One of them has got to be invalid..
In terms of which type of error is worse, one way I think about this is that invalid indicator is more like a fatal disease, and unreliable indicator is more like a non-fatal but chronic disease that requires lots of care. So if a study is using invalid indicators, we cannot draw any meaningful inferences about the phenomenon we are investigating (the study is “dead”), while unreliable indicators make it harder to detect a true positive (increases uncertainty, but does not spell doom).
The textbook also has a good discussion on the different problems associated with measurement reliability and validity in political science (p.143-145).
| Systematic error = High | Systematic error = Low | |
|---|---|---|
| Random error = High | Very very bad • Invalid and unreliable measure • Lots of noise, and signal is pointing at the wrong direction |
Problematic, but can live with • Valid, but unreliable measure • Lots of noise, harder to detect the signal; More likely to get false negative |
| Random error = Low | Problematic • Invalid, but reliable measure • Measure does not capture the concept of interests; Conclusion does not bear on the actual phenomenon of interest |
Awesome! • Valid and reliable measure • Move along |
Ideally, we should try to minimize both types of measurement errors. Degree of random error can be empirically assessed (e.g. using Cronbach’s alpha, see example here), and can be reduced (e.g. using multiple indicators). Systematic error however, is harder to detect, harder to quantify, and harder to correct for.
About the True Score Theory $T = X + \epsilon$, how do we know how close our measured value $X$ is close to the true score $T$, if we cannot truly know $T$?
Unfortunately, we can never be 100% sure what the value of $T$ is. As mentioned above, while we can detect and correct for random errors, systematic errors cannot be corrected using statistical procedures. After we’ve done our best to minimize random error, it is up to the strength of our theory, clarity of conceptualization, and a small leap of faith to convince others (and ourselves), that our measures are indeed valid ones. This is also part of the reason why social science research can only establish a probabilistic relationship (confident within a certain range) and never a deterministic relationship. Embrace the uncertainty!
If all indicators are measured with some degrees of random errors, can too many indicators introduce more random errors?
Although every single indicator would be measured with some random errors, if we combine the multiple indicators as an index, or take the average value, we should have lower random errors compared to using a single indicator.
Measurement Reliability and Validity
Is there a good analogy to help remembering the difference between validity and reliability?
In class, I’ve made the analogy comparing a valid indicator as a correct label (indicator) matching the content of a box (concept) that you wanted to buy but cannot see what is inside.
Houston, we have a invalid indicator problem.
I don’t really have a good one for reliability, so let’s stretch the same label-on-a-box analogy a bit further. Suppose we have a machine printing the label for the box, although the label correctly matches the box content (valid indicator), the machine sometimes misprints a letter or two, so not all labels look the same (unreliable). And if we have Machine A that produces 5% misprinted labels, and Machine B that produces 15% misprinted labels, then we can say that B is less reliable (produces less consistent outcomes).
What are some examples of face validity?
Whenever you see a indicator used to measure a particular concept, simply ask yourself: does the measure appear to capture the concept you care about? If yes, then the measure has high face validity; if not, then it has low face validity.
Say I want to measure whether a country’s level of human rights protection, which of the following indicators has a higher face validity?
- Gini coefficient
- Number of political imprisonment
You probably have an answer in your mind. Let’s try another one: now I want to measure a country’s income inequality, which indicator has a higher face validity?
- Gini coefficient
- Number of political imprisonment
Again, you have an answer, and you are probably right.
A few things I’d like to highlight from this example:
- Assessing face validity relies on domain knowledge. We need to first know what “human rights protection” means, only then we can see that more political imprisonment is an indicator for low levels of human rights protection.
- Assessing face validity is largely based on judgment based on domain knowledge, rather than empirical demonstration.
- Indicator validity is always assessed relative to the concept we are trying to capture, rather than something inherent to the indicator itself. Gini coefficient is a valid indicator for income inequality, but not human rights protection.
When do we test for construct vs face validity?
Ideally both, and more if possible. Since having invalid measures are really bad news, assessing the validity of a measure in multiple ways would increase the confidence
Face validity is rarely explicitly tested for – we already implicitly test for face validity when we are making the choice of which indicators to use to measure the concept. Although having face validity is important, high face validity alone is a rather weak evidence.
Construct validity can be empirically assessed in two ways: convergent validity and divergent validity. See here for an example.
Generally, if we are using the measures that have been used in published literature, we do not have to conduct separate validity test. The assumption is that they have already been previously validated (though we should still remain critical). If we are using new measures in our study, instead of established ones used in published literature, then it is recommended to first conduct a pilot study to test the measure’s validity and reliability. Use the measures as part of the actual study only after we know it’s valid and reliable.
About the article we read on using IAT/video games to measure implicit racial bias, how is the reliability of the measure determined? If the same respondent takes the test twice and gets different scores but in the same direction (e.g at first longer, then shorter time), is the measure considered reliable?
For IAT, the actual computation of the score takes quite a few steps, but to simplify it a bit, it is the reaction time differential that is used as a measure of implicit racial bias (see the test procedure here). So in this case, the test can be considered as reliable if respondent has the same directional preference (e.g. consistently faster at White-Pleasant association, than Black-Pleasant association) when taking the test multiple times.
For other tests however, it could be the case that time difference itself, rather than directional difference is used as the measure.
In general, test-retest reliability is measured as degree of correlation between the different test scores, rather than absolute difference. In psychology, rule of thumb is that test-retest reliability > 0.7 is an acceptable level, though this is no more than a convention used by researchers. IAT has a test-retest reliability of about 0.6.
This Podcast has a pretty interesting discussion on the use and critique of IAT.
Levels of Measurement
Can you elaborate more on meaningful vs relative/arbitrary zero point, and how that relates to interval and ratio measures?
A variable with meaningful zero point means that we can interpret the zero value as the absence of that variable. For example, income measured in dollars has a meaningful zero — we can interpret income = 0 to mean an absence of income. So if someone reported zero on this measure, we know this person has no income.
On the other hand, if the variable has a relative, or arbitrary zero points, we cannot interpret the zero value on that variable as the absence of that variable. Say we have a set of 5 questions to measure people’s political knowledge. Every correct answer gets you 1 point, and every wrong answer gets you 0 point, which gives us a range of possible score from 0 to 5. If Ann gets score = 0 on this scale, we cannot say that Ann has no political knowledge at all. The zero here is simply an arbitrary point to signal a very low level of political knowledge.
So how does this relate to interval vs ratio measures? Interval measures have relative/arbitrary zero points, and ratio measures have absolute/meaningful zero points. For the most part, the difference is only apparent (or we only need to pay attention to the difference) when we analyze and interpret the data.
For interval measures, since the zero point is arbitrary and lacks any meaningful interpretation, we cannot compare any differences in terms of proportion. It only make sense to compare the difference in magnitude. Going back to the political knowledge example, if Beth gets score = 2 on the political scale, and Cathy gets score = 4 on the same scale, we know that: 1) Cathy is more knowledgeable than Beth, and 2) the magnitude of difference is 2 more correct answers. However, since the zero point is arbitrary in this case, we cannot say Cathy is two times more knowledgeable than Beth. Or if we observe that Beth’s score increased from 2 to 3 after attending a civics education workshop, we cannot cay that Beth’s political knowledge increased by 50%.
Let’s compare to a ratio scale, income, which has a meaningful zero point. If Abe reported income = 20k, and Ben reported income = 40k, we know that 1) Ben has higher income than Abe, 2) Ben’s income is 20k higher than Abe, and 3) that Ben has an income twice as much as Abe.
Can a measure be both interval and ratio?
The four levels of measurement are mutually exclusive categories. The flow chart below should help you to distinguish the four categories.