Weighting linked USoc-NPD dataset
I’m exploring the wellbeing of secondary school children with a Special Educational Need (SEN) using the linked Understanding Society (USoc) – National Pupil Database (NPD) dataset. The linked dataset contains NPD information (including SEN) matched to children who were part of USoc in wave 1 , living in England (NPD is only for English schools), whose parent gave consent to matching. Because the actual data matching took place a couple of years after consent was given, most of the NPD data is from a couple of years after USoc wave 1 (USoc wave 1 is from 2009/10, most of the NPD data is from 2012/13).
A child’s SEN status can change year on year, so to ensure a child’s SEN status is recorded from a time period near to their wellbeing data, I’ve decided to use wellbeing data from USoc wave 3 (2011/12). This is a year before most of the NPD data, which isn’t ideal but it’s the closest wave of USoc that contains the wellbeing data we need (some of which is only collected every other year in USoc).
So my analytical sample is children who were part of USoc wave 1, who completed the youth self-completion questionnaire in wave 3 (wellbeing data) and who have matched NPD data (SEN data). I need to create a weight that accounts for the various types of non-response that can happen to this sample. So if the starting group is  children living in England aged 8-13 in 2009/10 (USoc wave 1),  some non-respond to the initial survey (there’s a cross-sectional weight for this on the survey),  some drop out of the USoc survey by 2011/12 (wave 3, the wave we’re using the wellbeing data from) (there’s a longitudinal weight for this on the survey),  of the remaining some don’t complete the self-completion survey (there’s a weight for this on the survey),  of the remaining some parents refused consent to matching (there’s no weight for this),  of the remaining some didn’t get matched to the NPD (most of data from which is from 2012/13)(there’s no weight for this).
My approach has been to identify children up to and including stage  above, and create a weight that readjusts these back to the initial population . This is created by multiplying the wave A-C longitudinal weight (c_psnenus_lw) by the wave C youth self-completion weight (c_ythscub_xw) i.e. new weight = c_psnenus_lw X c_ythscub_xw. I then need an additional component of the weight that adjusts for children whose USoc data did not get matched to the NPD. This is achieved by running a logistic regression model for the sample at stage  with the dependent variable indicating whether they were matched to the NPD or not. The logistic regression analysis uses the following predictor variables: age of child, sex of child, ethnicity of child, household income, family work status, highest educational qualification of parents, rurality, government office region. The logistic regression analysis is weighted by ‘new weight’ (above). The ‘NPD weight’ is calculated as the inverse of the predicted probabilities. This has been scaled so the mean is 1 and trimmed. The final weight is then ‘new weight’ X ‘NPDweight’.
Any comments gratefully received! Thank you.