Support #1520

Pooling cross-sectional data of UKHLS - Waves 1 to 7

Added by Marie Mueller over 1 year ago. Updated 11 months ago.

Start date:
% Done:




This issue relates to issue #1472.


I will use UKHLS youth data of Waves 1 to 7. My outcome of interest is the SDQ. I will use data on youth living in Greater London only. I am not interested in change in SDQ over time. My main goal is to retain as many observations as possible, as the number of observations drops dramatically due to my focus on Greater London. Therefore, I would like to use all the observations available across Waves 1 to 7. Depending on age at study entry, some participants complete the youth self-completion questionnaire only once, others twice, and a few even three times. Also, some participants complete the youth self-completion questionnaire in the earlier waves, others in the later waves. In other words, I have an unbalanced panel design in that some (but not all) participants have data at multiple time points. The main problem: finding an appropriate weight.

In #1472, we came to the conclusion that using a longitudinal person enumeration weight at the last time point would be the suboptimal weight to use (here: g_psnenus_lw). However, looking at the data, I find that with this approach I would lose too many observations. Of 2,056 individuals, 749 have a missing weight and 551 have a weight of zero, leaving me with only 756 individuals included in my analysis.

Ultimately, I want to use all observations of all 2,056 individuals. I assume the only solution would be to pool cross-sectional information of all seven waves, correct?

If yes, I did see point 12 in Weighting FAQs, however, I can't really follow. I also had a look at #1374 and #1257 which seem to be related. From these two I take that I do not have to worry about scaling the weights. I was wondering if you could give some advice about how to approach the pooled analysis.


1. How do I best pool my data? Do I run seven separate models (one for each wave using the according youth cross-sectional weight of that wave) and pool my estimates afterwards? Or do I combine data of all seven waves in one analysis, using a long format? From the issues linked above, I take that the latter is the right way. Then I would have seven rows for each individual and my long weight variable would contain the corresponding cross-sectional weights (i.e. row 1 for individual 1 would contain the cross-sectional weight of individual 1 at the first wave, row 2 for individual 1 would contain the cross-sectional weight of individual 1 at the second wave...). Is this correct?

2. If the above is correct (i.e. I combine all information of all individuals across all waves in one data set, transformed into long format), how do I then analyse these data? I was planning to run a mixed model to adjust for clustering in LSOAs, households, and individuals. (Note that I may omit the household level because I have too many levels for the relatively small number of observations.) However, I already anticipate the problem that an individual will have different weights at different time points, which will lead to an error in mixed and, I assume, svyset too. How do I go about this problem? My idea would be to include only one observation per individual which would lead to loss of data but would solve the problem of inconsistent weights within the individual (and would also allow me to omit the individual level from my multilevel structure). I guess then I would not need a long format after all because I would only have one observation per individual and I do not actually need to know what wave this was taken from. Is there a better way to do this and to actually keep all observations of all individuals?

Thank you very much in advance!

Best wishes,

Also available in: Atom PDF