For the first time in way too long, I went to a data dive over the weekend, facilitated by DataKind on behalf of Global Witness, for a couple of days messing around with the UK Companies House Significant Control (“beneficial ownership”) register.
One of the data fields in the data set is the nationality of a company’s controlling entity, where that’s a person rather than a company. The field is a free text one, which means that folk completing a return have to write their own answer in to the box, rather than selecting from a specified list.
The following are the more popular nationalities, as declared…
Note that “English” doesn’t count – for the moment, the nationality should be declared as “British”…
And some less popular ones – as well as typos…:
So how can we start to clean this data?
One the libraries I discovered over the weekend was fuzzyset, that lets you add “target” strings to a set and then do a fuzzy match retrieval from the set using a word or phrase you have been provided with.
If we find a list of recognised nationalities, we could add these to a canonical “nationality” set, and then try to match supplied nationalities against them.
The UK Foreign & Commonwealth Office register of country names, a register that lists formalised country names for use in government, also includes nationalities – so maybe we can use that?
Adding the FCO nationalities to a fuzzyset list, and then matching nationalities from the significant control register against those nationalities, gives a glimpse into the cleanliness (or otherwise!) of the data. For example, here’s what was matched against “British”:
British | Britsh | Bristish | Brisith | Scottish | Britsih | British/Greek | Greek/British | Briitish | British/Czech | Bitish | Brtisih | British/Welsh | Brirish | Brtish | British. | British Norfolk | British Cornish | British Subject | British English | Uk British | British/Irish | Britiah | British/Swedish | Biitish | Brititsh | British/English | Briish | British/Persian | Britiish | Brittish | French British | British/German | British/Syrian | Britihs | Briitsh | British /English | British / English | Brits | Kenyan/British | Britis | American British | Btitish | British/Bahrain Dual | Brtitish | Polish/British | Dual British/Irish | Brirtish | British- | British Uk | Brutish | Britich | British (Naturalised) | British (Canada Born) | Brithish | British Irish | British & Usa | Britisch | British/French | British/Israeli | Britrish | Britsh - English | American/British | Britisb | White British | Birtish | English / British | British/Turkish | Dual Usa/British | British/Swiss | Biritish | Britishu | Britisah | European British | British / Scottish | British & Israeli | British Swiss | Scotish | British Welsh | Britisn | Briti | Britihs & Irish | Britishi | Brfitish | Usa And British | American / British | British-United Kingdom | British Usa | Britisg | Israeli/British | Britih | Welsh British | Us & British | British Indian | British Asian | B Ritish | Emaratis | British/Bosnian | White Brtitish | British - English | Welsh/British | German/British | British & Irish | British-Israeli | British / Greek | Great British | Beitish | White Uk British | Belizean & British | Brithish English | Brituish | Britiash | Indian British | British Caribbean | Swedish/British | Britisjh | British Amercian | Britisk | Turkish/British | Brtiish | Br5itish | Brritish | Welsh, British | Brtitsh | U.K British | Britidh | Kurdish/British | English British | Brith | Irish/British | Britisj | British/Pakistan | I'M British | Britisih | American & British | British / Welsh | British / Swiss | Brittsh | British Icelandic | Swiss / British | Brotish | British Sikh | English/British | Britiswh | Bristsh | British European | British And Usa | British / Israeli | British Bengali | British Afghan | Brithsh | Brit6ish | British/Indian | British/Libyan | British/Polish | British Israeli | British National | Swiss British | Briritsh | Britishh | British / Irish | Brithis | Britshi | British And Thai | Britush | Britiss | British, English | Bfritish | Btritish | Brisitsh | White English | British/Mosotho | Usa & British | British/ Eu National | Finnish/British | Israeli + British | British And Polish | Bartish | Nritish | Brishish | British Manx | German And British | Britiosh | British (Bermudian) | Britishbritish | Naturalised British | English - British | Welsh - British | Dual American/British | British,Uk | British And Us | Uk Brittish | British Overseas | British & Swiss | English-British | British & Polish | Us/British | Swiss & British | British And Greek | Iraqi, British | Breitish | Black British | U.K. British | Afghan British | Brit / English | British/Asian | Awhite British | Asian British | British / Polish | Caucasian British | Britosh | Bristih | Britsish | British Libyan | Britisth | Brisish | British & Spanish | Britinsh | Britisht | Britsith | Britash | Irish / British | Brisitish | Brirtsh | Bruitish | Dutch / British | Bristis | Ritish | Welsh, Bristish | British Resident | British And French | British/ English | British (Welsh) | French/British | Dual British - French | Bristiah | Great Britain & Usa | British & Us | Uk Scottish | British Scott | Brititish | Dual: British, Usa | .British | British (Scots) | Scottish Uk | British/Scottish | Brittiish | British-Irish | Btittish | Scottish. | Britisy | Bruttish | Dual British Irish | Scottish/British
In passing, English matched best with Bangladeshi, so we maybe need to tweak the lookup somewhere, perhaps adding English, Scottish, Northern Irish, Welsh, and maybe the names of UK counties, into the fuzzyset, and then in post-processing mapping from these to British?
Also by the by, word had it that Companies House didn’t consider there to be any likely significant data quality issues with this field… so that’s alright then….
PS For various fragments of code I used to have a quick look at the nationality data, see this gist. If you look through the fuzzy matchings to the FCO nationalities, you’ll see there are quite a few false attributions. It would be sensible to look at the confidence ratings on the matches, and perhaps set thresholds for automatically allocating submitted nationalities to canonical nationalities. In a learning system, it may be possible to bootstrap – add high confidence mappings to the fuzzyset (with a map to the canonical nationality) and then try to match again the nationalities still unmatched at a particular level of confidence?