Support #2260
openRequest for explaining the different missing rates of variables across all waves
90%
Description
Dear Understanding Society Team,
I am currently working with data from the UKHLS and have some questions regarding response patterns across different waves.
Specifically:
1. For the GHQ12 variable (scghq2_dv), there appear to be large differences in the rates of missing, inapplicable, and proxy responses across waves. For example, the total rate of these responses in Wave 1 is about 20%, but the total rate in Wave 14 is about 5%.
2. Similarly, for the variables jbsoc00, jbsoc10, and jbsoc20, we observed significant differences in the rates of missing responses across waves. For example, the missing rate of jbsoc20 in Wave 14 is about 27%, the missing rate of jbsoc10 in Wave 14 is about 6%, and the missing rate of jbsoc00 in Wave 14 is about 1%.
We have reviewed the "Main Survey User Guide" but could not find specific explanations for these discrepancies. We would appreciate it if you could clarify the reasons for these differences. Thank you for your time and assistance.
Best regards,
Evan Zhang
Files
Updated by Understanding Society User Support Team 23 days ago
- File clipboard-202506181601-dnhew.png clipboard-202506181601-dnhew.png added
- Category set to Data documentation
- Status changed from New to Feedback
- % Done changed from 0 to 50
- Private changed from Yes to No
Hello Evan
In Waves 1 and 2, the self-completion questionnaire, including the GHQ questions, was administered on paper. We’ve observed a relatively high level of missing responses for certain GHQ items in these early waves. However, the proportion of missing data decreases steadily in later waves, as shown in the table below:
Variable: scghq2_dv
To identify valid responses, it’s important to review the Question Universe, which specifies eligibility for each question. In this case, scghq2_dv is a derived variable created using the following items:
scghqa to scghql.
You can find information on the universe for each question using the Mainstage Variable Search. For example, here is the link for scghqa, and if you scroll the page, under “Question asked in the latest wave,” you will see the universe specification:
https://www.understandingsociety.ac.uk/documentation/mainstage/variables/scghqa/
If you're interested in the syntax used to construct the scghq2_dv variable, it’s available here:
https://www.understandingsociety.ac.uk/wp-content/uploads/documentation/main-survey/syntax/stata/ghq_dv.do
Regarding occupation variables like jbsocXX, these are provided by the fieldwork agency and are already coded. At the moment, there is no official process for mapping SOC00 codes to SOC10 or SOC20, but it is something that could be explored further. If you'd like to carry out the conversion yourself, you can use the coding index provided by the ONS here: https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc/soc2020/soc2020volume2codingrulesandconventions
Lastly, if you're interested in survey response rates, further details can be found in the following resources:
• Main survey user guide: https://www.understandingsociety.ac.uk/documentation/mainstage/user-guides/main-survey-user-guide/response-rates/
• Response tables: https://understandingsociety.ac.uk/wp-content/uploads/documentation/user-guides/6614_main_survey_user_guide_response_tables.pdf
I hope this information is helpful.
Best wishes,
Roberto Cavazos
Understanding Society User Support Team
Updated by Evan Zhang 17 days ago
Hi Roberto,
Thanks for your reply. I have the further question about the reasons behind the steadily decreasing proportion of missing data in later waves. I would like to further understand what factors have contributed to this decline in missing responses. Thank you very much.
Best wishes,
Evan
Updated by Understanding Society User Support Team 17 days ago
Dear Evan,
Is there a particular element you'd like us to explain further?
Best wishes,
UKHLS User Support
Updated by Evan Zhang 12 days ago
Hi,
Sorry for the late reply. Regarding the variable scghq2_dv, you previously explained that “In Waves 1 and 2, the self-completion questionnaire, including the GHQ questions, was administered on paper. We’ve observed a relatively high level of missing responses for certain GHQ items in these early waves.”
I would like to further understand why the rate of missing responses gradually decreased in later waves. Was this due to specific policies or measures taken to encourage responses, or were there other reasons? Thank you very much.
Best wishes,
Evan
Updated by Understanding Society User Support Team 7 days ago
- % Done changed from 50 to 90
Hi Evan,
The gradually increasing missing values in these variables are due to the mode effect. The highest number appears in waves 1 and 2, because that section of the questionnaire was administered on paper (as Roberto pointed out above). The next noticeable decrease happens in wave 6, when web interviews were introduced. If you tabulate w_scghq2_dv by w_indmode you'll see fewer missing values for web interviews and more for face-to-face. Since the share of web interviews has been gradually growing, the number of missing values in scghq2_dv has gradually declined. This trend is driven by two factors: 1) web routing was more liberal – unlike in face-to-face mode, web respondents didn’t need to give separate consent to complete the self-completion part (web is self-completion by default; you can check the universe/routing changes by wave in the PDFs of the questionnaires: https://www.understandingsociety.ac.uk/documentation/mainstage/questionnaires/); 2) there are no proxy web interviews.
I hope this clarifies the issue.
Best wishes,
Piotr Marzec
UKHLS User Support