One of the nice things about NHS Digital datasets is that there is a consistent use of identifier codes across multiple datasets. For example, GP Practice Codes are used to index particular GP practices across multiple datasets listed on both the GP and GP practice related data and General Practice Data Hub directory pages.
Information about GPs is also recorded by the CQC, who publish quality ratings across a wide range of health and social care providers. One of the nice things about the CQC data is that it also contains information about corporate groupings (and Companies House company numbers) and “Brands” with which a particular location is associated, which means you can start to explore the make up of the larger commercial providers.
Unfortunately, the identifier scheme used by the CQC is not the same as the once used by NHS Digital. This wouldn’t provide much of a hurdle if a lookup table was available that mapped the codes for GP practices rated by the CQC against the NHS Digital codes, but such a lookup table doesn’t appear to exist – or at least, is not easily discoverable.
So if we do want to join the CQC and NHS Digital datasets, what are we to do?
One approach is to look for common cribs across both datasets to bring them into partial alignment, and then try to do some do exact matching within nearly aligned sets. For example, both datasets include postcode data, so if we match on postcode, we can then try to find a higher level of agreement by trying to exactly match location names sharing the same postcode.
This gets us so far, but exact string matching is likely to return a high degree of false negatives (i.e. unmatched items that should be matched). For example, it’s easy enough for us to assume that THE LINTHORPE SURGERY and LINTHORPE SURGERY are the same, but they aren’t exact matches. We could improve the likelihood of matching by removing common stopwords and stopwords sensitive to this domain – THE, for example, or “CENTRE”, but using partial or fuzzy matching techniques are likely to work better still, albeit with the risk of now introducing false positive matches (that is, strings that are identified as matching at a particular confidence level but that we would probable rule out as a match, for example HIRSEL MEDICAL CENTRE and KINGS MEDICAL CENTRE.
Anyway, here’s a quick sketch of how we might start to go about reconciling the datasets – comments appreciated about how to improve it further either here or in the repo issues: CQC and NHS Code Reconciliation.ipynb