Improve Sampling Accuracy with Weighted Random Selections

Peter Lubell-Doughtie
April 24, 2017

Every data collector eventually runs into this issue at some point — you know the makeup of your population as a whole, but you only have access to a small group that isn’t representative. For example, you know that as a whole, the population pizza topping preference for plain cheese:pepperonni is 1:1. However, you only have access to a population heavy in vegetarians and you need opinions that reflect everyone.

In this article, we’ll build a survey function that randomly selects people to survey in a weighted manner. The weighting adjustment is a common statistical correction technique that compensates for the presence of bias. It gives underrepresented people or elements in your sample a larger weight than those over-represented. In the pizza example, you know you’ll need to weight the non-vegetarian open-to-eating-pepperoni people more in order to avoid interviewing too many vegetarians.

We’ll be using the XLSForm function random (), which returns numbers from 0.0 to 1.0, and combining it with weighted values to create random weighted selections in a survey.

Surveying citizen and refugee families

In the example below, we want to randomly survey a sample of citizen and refugee families. Everyone lives in houses with 3 families. We want to go to each house, register each family living in it, then have the survey randomly choose which family to survey. However, we know the population is mostly citizens, and we want to adequately capture both groups. Since refugees are under-represented, we will assign them a larger weight, or higher probability, of being selected to be surveyed compared to citizens.

We will assign the refugees and citizens weights of 0.6 and 0.4, respectively. This results in the refugees having a half more chance of being selected over the citizens ((0.6-0.4)/0.4 = 0.5). So, if we have 3 families in a household (two citizen families and one refugee family), we want to assign weights of 0.4, 0.4 and 0.6, respectively.

Then, for each family in the house, since we want this selection to be random, we multiply the weight by a random number generated using the random () function. The family with the highest value will be selected for a survey. Since 0.6 is larger than 0.4, the possibility of a refugee family being selected is higher than a citizen family.

Authoring an XLSForm with weighted random selections

Below is an XLSForm survey sheet demonstrating how this is done. You can also view this XLSForm in a Google Sheet. Note: To provide clarity, we added note fields after every calculate field to show the results of the calculation. Note fields are not required to make the survey work.

Functions used in the XLSForm

  • if(${A1}=1,0.6,0.4) – This function assigns weights for a refugee and a citizen.
  • once(random()) – This function generates a random number from 0.0 to 1.0 for each family.
  • ${randomise}*${weights} – This function computes the product of the weights and the random number generated for each family.
  • max(${product}) – This function returns the maximum product from the type_check repeat group.
  • position(..) – This function returns the current index position of the check_max repeat group.
  • indexed-repeat(${A3},${type_check},${pos_prod}) – This function extracts the family name matching the position in the _type_check_ repeat group.
  • indexed-repeat(${product},${type_check},${pos_prod}) – This function extracts the product of the weight and random number of the family matching the position in the type_check repeat group.
  • if(${prod_comp}=${max_product},${name_comp},"") – This function evaluates if the product of the weight and random number of a family is the maximum. If it is, the family name is generated, else a blank is returned.
  • join(' ',${display_comp}) – This function displays the results of the above if function for the variable display_comp, separated by a space.

This survey works well for both Enketo and ODK Collect because the number of repeats can be set and repeat groups automatically opened without manual manipulation. The number of families is used as the repeat count for the two repeat groups used in this form.

How the survey appears in a web form

  1. Enter the number of families (three in this case): 
  2. Three family repeat groups appear. Enter the family name: 
  3. Mark the family as citizen or refugee. A weight will be assigned accordingly: 
  4. A random number will be generated: 
  5. The random number will be multiplied by the weight: 
  6. After repeating steps two and three for the remaining families, the product from step five will be compared across families and the maximum product identified: 
  7. The family name corresponding to the maximum product will be selected for the interview: 

Filling this form out in ODK Collect will work exactly the same as Enketo.

If you would like to test this out, please go to the Weighted Random Selection Google Sheet. Simply download the XLS file and add it as a form to your project.

We wish you extremely accurate random sampling with your projects! Please post questions and helpful hints to the Ona Community Forum.

Tags