Weights when pooling data from different waves
Dear Understanding society support team,
I am currently working on a study for which I would like to pool data from all USOC waves (1-11) for a cross-sectional analysis of labour market, educational and health outcomes for different minority groups, comparing first and second generation.
From what I have read in previous posts if I pool data from all waves in the "long format", treating the same individuals across different waves as different persons, I can use the corresponding cross-sectional weight from each wave. Is that correct?
However, my initial idea was to pool the most recent available information for all individuals in Understanding Society (waves 1 - 11), so I would have one entry for each person that ever participated in the study taken from whatever the most recent wave was where they took part. Would you be able recommend what weights I should use in this case?
Updated by Understanding Society User Support Team over 1 year ago
- Category set to Weights
- Status changed from New to In Progress
- Assignee set to Olena Kaminska
- Private changed from Yes to No
Many thanks for your enquiry. The Understanding Society team is looking into it and we will get back to you as soon as we can.
We aim to respond to simple queries within 48 hours and more complex issues within 7 working days.
Understanding Society User Support Team
Updated by Olena Kaminska over 1 year ago
Thank you for your question. The first approach is correct and you can use it with relevant cross-sectional weights.
The second approach is mainly not legitimate and there won't be a ready weight for the analysis. Our weights are created for representing 'real' subpopulations defined by substantive variables - i.e. a subpopulation that you can describe if our panel never existed. Once you start using definition related to survey process there are many problems with this related to sampling concept. Following your description at wave 2 you would only select young people, for example, who never participated again. But if a young person participated at wave 3 you won't include them with wave 2 data. The weight from wave 2 can represent all young people, or all young people who left home etc., but can't represent some of them based on simply survey participation process.
Having said that a solution may be in creating your own tailored weight. This will be a longitudinal weight with a base weight at wave 1, where everyone is present at wave 1 and represented as in the population and by wave 12 those who dropped out can be treated as missing data and those who are in the model as respondents. Such approach would let you create your own correction for attrition specific to your model. This analysis though won't be pooled but a simple longitudinal analysis.
Just a note, if you pool over waves, make sure you correct for clustering within PSU.
Hope this helps,
Updated by Jonas Kaufmann over 1 year ago
Thank you very much for your clear and quick response.
I will go with the first approach. I just have two more questions:
1) Am I right in assuming that I do not have to rescale cross-sectional weights when pooling entire waves?
2) You mentioned correcting for clustering within PSU: I assumed that this would be automatically taken into account when including psu in svyset (svyset psu [weight], strata (strata)) or is there more to this?
I hope that these questions do not overstep the remit of this forum and apologise in advance if they do.
Thanks again for all your help!
Updated by Olena Kaminska over 1 year ago
2) Yes, svyset command is sufficient;
1) We do recommend rescaling, especially if you include BHPS data. This example will help you:
In pooled analysis and sometimes in other types of analysis you may need to apply an additional scaling to our weights. Our weights have a mean of 1 in each wave, which means that if combined in a pooled analysis the waves with smaller sample size will have a smaller contribution in your analysis. This includes BHPS waves and later waves (as sample size decreases with attrition). Ideally, when combining events / states over 30 years (for example) you want each year to have the same importance. To ensure this follow this example to calculate an additional scaling for your weights.
For example, you are looking at job quality and therefore are pooling information from wave 2, 4, 6 & 8 as these are the waves when the questions are asked. Here is how to create a scaled weight for this analysis.
replace weightscaled=b_indpxub_xw if wave=2
sum ind [aw=b_indpxub_xw] if wave=2
sum ind [aw=d_indpxub_xw] if if wave=4
sum ind [aw=f_indpxub_xw] if if wave=6
sum ind [aw=h_indpxub_xw] if if wave=8
replace weightscaled=d_indpxub_xw*(bwtdtot/dwtdtot) if wave=4
replace weightscaled=f_indpxub_xw*(bwtdtot/fwtdtot) if wave=6
replace weightscaled=h_indpxub_xw*(bwtdtot/hwtdtot) if wave=8
You can double check by looking at the sum of ind with weightscaled for each wave – it should be the same.
sum ind [aw=weightscaled] if wave==2
sum ind [aw=weightscaled] if wave==4
sum ind [aw=weightscaled] if wave==6
sum ind [aw=weightscaled] if wave==8