Skip to content
Blog

5 steps to constructing a composite data quality index to assess overall surveyor performance

Mitali Mathur Shreya More 11 November 2021

This is the first blog of a two-part series. In this blog, IDinsight’s Data on Demand team discusses how it constructed a unified, composite data quality index that can be used to assess surveyor performance. Part 2 of this series will discuss how the team used the composite data quality index to create an incentive system for surveyors to encourage improved performance. We hope that the steps described in this blog are useful to other practitioners involved in data collection.

Photo credits: Markus Spiske on Unsplash

Motivation

The Data on Demand (DoD) team has made significant innovations and investments in its data quality management systems to holistically address different sources of error that could arise during data collection. 

The DoD team monitors data quality at each stage of the data collection process. Before a survey launches, we carefully code our survey forms to minimize illogical or unfeasible responses. Additionally, we train surveyors on protocols to ensure that questions are asked and recorded properly. During data collection, we have a dedicated team of monitors to conduct back checks, spot checks, and audio audits while our team runs basic high-frequency checks on the data (see figure 1). Finally, after data collection, we account for data inconsistencies (ex: replacing values above the 95th percentile to be the 95th percentile’s value).

In order to be precise, the daily outputs of our data quality system measure flags for each question. These checks are more actionable as they provide insights into how surveyors can improve their data quality specifically. However, it is difficult to interpret such a variety of data points and understand a given surveyor’s overall performance. 

To that end, the DoD team constructed a unified data quality index to quantify surveyor performance during data collection.

The benefits of constructing an index are three-fold:

  1. Usability: An index is an easier way to interpret multiple question-level checks to understand overall performance. The index ranges from 0-100%, where 100% indicates perfect data quality performance.
  2. Generalizability: Given that data quality checks differ across surveys depending on the questionnaire, a methodology in which surveyors are given one score regardless of the question level checks can be applied to different surveys. This enables us to track the index across surveys. 
  3. Incentives: A single score can be used to incentivize surveyors to improve their performance. Part 2 of this blog post series discusses incentives in more detail.

Methodology

The aforementioned checks yield 10 data quality indicators. A more detailed description of each of these indicators can be found in the table below:

Buckets

Indicator

Description

Spot Checks Spot Check Protocol Violation Rate Number of protocols violated/Total number of protocols checked
Spot Check Score Score out of 3 on the basis for whether the surveyor requires retraining
Audio Audits Audio Audit Mismatch Rate Number of mismatches with main survey/Total no of questions audio audited
Audio Audit Protocol Violation Rate Number of protocols violated/Total number of protocols checked
In-person Back Checks In-person Back Check Mismatch Rate Number of mismatches with main survey/Total no of questions back checked
Phone Back Checks Phone Back Check Mismatch Rate Number of mismatches with main survey/Total no of questions back checked
High-Frequency Checks Proportion of Don’t Knows Number of questions with don’t know as a response /Total no of questions
Proportion of Refusals Number of questions with Refuse as a response/Total no of questions
Proportion of Logic Violations 1 Number of questions with a logic violation/Total no of questions
Proportion of Outlier Violations 2 Number of questions with an outlier violation/Total no of questions

With 10 data quality indicators, the main challenge we foresaw was that not each indicator would be as important when it comes to surveyor-level data quality. As a result, we created weights for each component using a mix of data-driven strategies and subjective preferences. We decided to proceed with a mixed approach because:

  1. We wanted to ensure that our field experiences were captured, especially given that the data we would use would only reflect a single survey.
  2. The data collected would incorporate quality issues that could percolate from pre-data collection activities like surveyor training, for which we wouldn’t want to penalise our surveyors. 
  3. While we had subjective preferences on different buckets of indicators, we did not have a good idea of how to quantify weights, differentiate between indicators within buckets, or confirm whether or not certain indicators explained other indicators. Using data would help inform some of these decisions.

We discuss the five steps we took to create the data quality index below.

Step 1: Collect Subjective Weights

We first wanted to align internally on which indicators would be the most important to each person on our team (who all had experience with data quality). We employed the budget allocation process method in which different “experts” independently distributed a total of 30 points to different indicators. 3 4 Then, we revealed our preferences to each other and had a team discussion to align on the importance of different indicators. 

This discussion revealed that our preferences were based on buckets; indeed, we did not have very strong preferences for the indicators within each bucket. Our teammates largely believed in a hierarchy in which audio audits and spot checks should weigh the most, then back checks, and finally, high frequency checks. This was largely driven by the fact that spot checks and audio audits help us track surveyor-driven errors. In addition to this, audio audits give us a more objective mismatch calculation than back checks because back checks invite the possibility of respondents changing answers when resurveyed. 5 Finally, high frequency checks were weighed the least because they were more reflective of questionnaire framing rather than surveyor performance. We kept this in mind as we continued our approach.

Step 2: Clean Indicators

To generate data-driven weights, we used data for the above-defined data quality checks from a previous round of data collection involving 480 surveyors. We compiled data from spot checks, back checks, audio audits, and high-frequency checks for each surveyor for each question flagged for checks. To calculate mismatches, we compared the data inputted by monitors from back checks and audio audits against the main survey data inputted by surveyors by matching on the unique survey unit identifier. We calculated the protocol violations for each question from the spot check and audio audit data. For spot check scores, we calculated question-level averages at the surveyor level. Finally, for the high frequency checks, we collapsed checks at the surveyor level. 

We then used these question-level checks to generate our data quality indicators at a surveyor level. For the proportion-based indicators, we added the number of violations across questions and divided them by total questions to generate unified proportions. For the spot check score, we took an average of all the scores received by a surveyor. The final dataset we produced contained all indicators at a surveyor level.

Step 3: Build Correlation Matrices

Next, we built correlation matrices at two levels – buckets and indicators (as defined in the table above). 

The correlation matrix of data quality buckets was used to derive data-driven weights at a high level. We took an approach called inverse covariance weighting (ICW), in which we produced a correlation matrix of the buckets, inverted the values, summed the row entries for all buckets, and finally, scaled up each sum by a common multiplier to arrive at the final bucket weights.6 For example, if one row in the inverse correlation matrix for a bucket added up to 1.34, we scaled them up by a multiplier of 11.5 to arrive at a weight of 15. 

The major finding from the second correlation matrix at the indicator level was that the spot check overall score was highly correlated with the spot check speed score, probing score, comfort score, and protocol score. As a result, we ended up dropping the four granular scores to minimize double penalties to surveyors and used the overall score in our final index.

Step 4: Brainstorm Different Weighting Options

Now that we had both the subjective and the data-driven weights, the team got together to brainstorm different weighting options. 

Ultimately we used the inverse covariance weighting method described in Step 3 to derive the bucket weights. We noticed high frequency checks were weighed higher than in-person and phone back checks through this method, but made a decision to down-weigh them as the team had unanimously agreed that they should be weighed the least. We up-weighed in-person back checks and phone back checks (both of which were weighed equally to start with) because we agreed that both are important measures of surveyor performance, not far behind spot checks and audio audits. Within each bucket, we followed a similar approach to assign weights that would add up to the overall bucket weight. The audio audit mismatch rate was weighed higher than the audio audit protocol violation, and the outlier and logic checks were weighed higher than don’t know and refusal checks. The graphic below summarizes the weights we assigned to each indicator (in yellow) and bucket (light blue).

Step 5: Apply Different Weights to Data

Before applying the indices to the surveyor level data-set we had compiled, there were two issues that we needed to tackle.

  1. Unintuitive index: The inputs to our data quality index were structured in a way that “good” data quality implies a higher spot check score but a lower mismatch and protocol violation rate. As a result, we redesigned the violation and mismatch indicators to [1 – violation rate or mismatch rate]. In this way, the higher the score, the better! Ultimately, the surveyor level index was converted into a percentage score for easy interpretation.
  2. Missing indicators for surveyors: There were instances where certain indicators were missing for surveyors. This can be a common scenario where for instance, a surveyor wasn’t back checked in person even once throughout the data collection period – so they don’t have an in-person back check mismatch rate. For these cases, we removed that particular indicator from the numerator and denominator to calculate the final data quality index for the surveyor. However, as a forward-looking solution, we aim to structure our assignments to ensure that all surveyors undergo all data quality checks.

Our Final Index

After deriving the index, we calculated what the data quality index score would be for each of our surveyors and analyzed the distribution. On average, surveyors have a mean data quality score of 80.44% and a median of 80.81%. Scores ranged from 59.74% to 92.12%.

We believe that our data quality index score will be a simple and helpful way to assess a surveyor’s data quality. The index takes into account a suite of data quality checks done in each survey and weighs some checks above others according to their importance. We plan on using this score to track surveyor data quality performance over time and create bonus structures to incentivize better performance.

 

  1. 1. For example: If a respondent says that they don’t have a bank account, but when asked if they received x amount in their bank account via a scheme, they say yes – it’s considered a logic violation
  2. 2. For example: When a respondent is asked their age, an entry of 200 is considered an outlier violation
  3. 3. Greco et al., “On the Methodological Framework of Composite Indices: A Review of the Issues of Weighting, Aggregation, and Robustness”, https://link.springer.com/article/10.1007/s11205-017-1832-9
  4. 4. European Commission, “Competence Centre on Composite Indicators and Scoreboards”, https://knowledge4policy.ec.europa.eu/composite-indicators/10-step-guide/step-6-weighting_en#budget-allocation
  5. 5. Hochstim et al., “Reliability of response in a sociomedical population study” , https://doi.org/10.1086/267867
  6. 6. Micheal L.Anderson,”Multiple Inference and Gender Differences in the Effects of Early Intervention”, https://are.berkeley.edu/~mlanderson/pdf/Anderson%202008a.pdf