Forget Fake News – Worry About the Chaff…

According to the Encyclopedia Britannica (online edition) there are several sorts of electronic countermeasure used against opponents’ radar:

Electronic countermeasures (electronic warfare)

The purpose of hostile electronic countermeasures (ECM) is to degrade the effectiveness of military radar deliberately. ECM can consist of (1) noise jamming that enters the receiver via the antenna and increases the noise level at the input of the receiver, (2) false target generation, or repeater jamming, by which hostile jammers introduce additional signals into the radar receiver in an attempt to confuse the receiver into thinking that they are real target echoes, (3) chaff, which is an artificial cloud consisting of a large number of tiny metallic reflecting strips that create strong echoes over a large area to mask the presence of real target echoes or to create confusion, and (4) decoys, which are small, inexpensive air vehicles or other objects designed to appear to the radar as if they are real targets. Military radars are also subject to direct attack by conventional weapons or by antiradiation missiles (ARMs) that use radar transmissions to find the target and home in on it. A measure of the effectiveness of military radar is the large sums of money spent on electronic warfare measures, ARMs, and low-cross-section (stealth) aircraft.

These are worth bearing in mind when using Twitter and other social media, as well as keyword driven news search alerts, as your own, personal news radar. In this analogy, the things I want to detect are “true” news stories (whatever that means…); here are some countermeasures you could take to try to prevent high quality news signals, or news signals that inform me about the things you are doing that you don’t want me to know about, or that you need to spin because they paint you in an unfavourable light, getting through to me:

  • noise jamming: pollute my feed with noise that makes me filter out certain forms of traffic (your noise) and, as a side effect, legitimate news; reference me in e.g. tweets and swamp my mentions feed with noise; if I’ve subscribed to one of the accounts you control, feed that stream with random retweets, auto-generated rubbish, etc;
  • false target generation: try to get me to subscribe to an account you control, thinking it’s a legitimate news source;
  • chaff: chaff masks your current “location”, or a story about you; if I make a search or want to follow a particular topic, try to make sure all I can ever find are empty pages that attract those search terms, or your spin on the story;
  • decoys: push out your own news story or, even better, a ridiculous claim that gets widely reshared and that pulls interest away form a legitimate story breaking about you; if I’m only going to read one thing about you today, better it’s the one you put out rather than the one that shows you for what you are…

(If you can think of better examples, please share them in the comments; this was just a quick coffee break post… didn’t really try to think the examples through…)

Remember, folks, this is information war… We should all be reading up on psyops too…

Amazon Webservices Move Up a Level

Way back when, companies such as Amazon and Google realised that they could leverage the large amounts of computing infrastructure developed to support their own operations by selling their spare compute and memory capacity as self-service resources.

The engineering effort used to guarantee the high service quality levels for their core businesses could be sold on to startups, and established companies alike, who did not have the engineering expertise to develop and run their own scalable, and resilient, cloud services. (You’d know if Amazon Web Services (AWS) went down completely: so would large parts of the web that are hosted there.)

In the last couple of years, the likes of Google, Amazon and IBM have moved up a level, and now offer “commodity AI” services – recognising faces and and objects in photographs, performing entity extraction on the contents of large texts, generating speech from text and text from speech, and so on. (Facebook seems to prefer to remain inward looking.)

In a spate of announcements today, Amazon joined the part with the release of their own AI services, reviewed in a post by Amazon CTO, Werner Vogels, Bringing the Magic of Amazon AI and Alexa to Apps on AWS. (I’ll post my own summary review when I’ve had a chance to play with them…)

But it seems that AWS have been shopping too. As well as providing a range of different server sizes and base operating systems, the machine instances that Amazon provides now includes FPGAs (Field Programmable Gate Arrays; which is to say, programmable chips…) and (soon) GPUs.

The FPGA machine instance, the suitably named F1 includes one to eight [Xilinx UltraScale+ VU9P?] FPGAs dedicated to the instance, isolated for use in multi-tenant environments. to support the development the machine instance also incudes
a 2.3GHz Intel Broadwell E5 2686 v4 processors, up to 976 GiB of memory and up to 4 TB of NVMe SSD storage. So that looks alright, then… Gulp. (For more, see the product announcement, Developer Preview – EC2 Instances (F1) with Programmable Hardware.)

The pre-announcement for the GPU instances (In the Works – Amazon EC2 Elastic GPUs), which have been a long time coming, look set to offer Windows support for Open GL, followed by support for other versions of OpenGL, DirectX and Vulkan. This means you’ll be able to render and stream your own 3D models, at scale. (Anyone think this may be gearing up to support AR and VR apps, as well as online streaming games? Or support for GPU crunched Deep Learning/AI models?)

(All the new machine instance offerings are described in the summary announcement post, EC2 Instance Type Update – T2, R4, F1, Elastic GPUs, I3, C5</a.)

As well as offering more physical machine types, Amazon have also upgraded their Aurora relational database product so that it is now compliant with PostgreSQL as well as MySQL (Amazon Aurora Update – PostgreSQL Compatibility).

But it doesn’t stop there. For the consumer, just wanting to run their oiwn web hosted instance of WordPress, Amazon virtual personal servers are now available: Amazon Lightsail – The Power of AWS, the Simplicity of a VPS (though it looks a bit pricey compared to something like Reclaim Hosting…)

Back to the big commercial users, another of the benefits of using Amazon Web Services, whose resources far exceed the capacity of all but the largest technology operating companies, is that you can avail yourself of the large amounts of computing resource that might be required to analyse and process large datasets. Very large datasets. Huge datasets, in fact. Datasets so huge that you need a freight container to ship the data to Amazon because you’re unlikely to have the bandwidth to get it there via any other means. Freight containers like AWS Snowmobile (H/T Les Carr for the pointer).

According to the FAQ, each Snowmobile is a secure data truck with up to 100PB storage capacity in a 45-foot long High Cube tamper-resistant, water-resistent, temperature controlled and GPS-tracked shipping container. On arrival at your datacentre, it needs a 350KW power supply (Amazon can supply a generator, if required). Physical access to your datacentre is achieved using the supplied removable connector rack (up to two kilometers of networking cable are provided too).

Once you have completed the data transfer using your local data connect, the Snowmobile is returned to a designated AWS region datacentre. It’s not clear how the data is then uploaded – maybe they just wheel the container into a spare bay and hook it up?

This is all starting to get really silly now…

Algorithmic Truthiness

With a media who failed to hold jokers to account when they had their chance, preferring “balanced” reporting that biases news reports and gives equal measure to unequally validated ideas, and social media opting for truthiness rather than fact to generate momentum for spreading (fake) news, it seems we’re told by commentators we’re now in a “post-truth”/”post-factual” world.

As the OED define it, truthiness is The quality of seeming or being felt to be true, even if not necessarily true.

Although the definition could be debated…


Sound familiar?

A few years ago, at the dawn of the age of Big Data, the idea that segmenting and modelling large datasets in a “theory-free” way (Big data and the end of theory?) perhaps gave an inkling that truthiness was on its way in, big time. (Compare this also with anti-expert rhetoric over the last couple of years. I’m all for slamming certain classes of academic outlook and activity, but I also think there are reasons for trusting certain sorts of claims more than others…)

The fact that data processing algorithms are likely to have ever increasing power of what we read – not only in terms of selecting which stories to show us in our personalised news feeds, but also because other machines may themselves have written the stories we’re reading – means that we need to start getting a feel for what sorts of biases are likely to be baked into these algorithms.

In contrast to earlier generation of rile based expert systems that could be asked to “explain” their reasoning, today’s systems are often black box statistical machines. Whereas rule based systems used logical reasoning to come up with answers, Deep Learning algorithms and their ilk have gut reactions: rule based expert systems reasoned towards a truth associated with the logical statements asserted into them in an explainable way; black boxes have gut reactions and deal in truthiness.

But whereas we might be suspicious about a person making a truthy claim (“that doesn’t sound quite right to me…”) once we start to trust machine – because they appear to be right-ish, most of the time – we start to over-trust them. I think – I haven’t checked. Sounds truthy to me…

So with a tech news report doing the rounds at the moment that a “Neural Network Learns to Identify Criminals by Their Faces”, it seems that the paper authors “have demonstrated that via supervised machine learning, data-driven face classifiers are able to make reliable inference on criminality” as well as identifying “a law of normality for faces of noncriminals. After controlled for race, gender and age, the general law-biding public have facial appearances that vary in a significantly lesser degree than criminals”. (It’s not hard to imagine this being used a ranking factor for something…) The (best) false positive rate looked on one of the charts (figure 4 in the paper) to be around 6%. Are the decisions “true”, then, or just “truthy”? What level of false positivity makes the difference? (Bear in mind behaviourist training  – partial reinforcement can be really powerful…) I also wonder if the researchers ran the same training schedule against IQ? Or etc etc

(In passing, another recent preprint report on arXiv – Lip Reading Sentences in the Wild reports on an automated lip reading system trained on several hours of people talking on BBC television (the UK based researchers were license fee payers, I suspect, but the Google Deepmind sponsor..?!) (If you’d rather read a pop sci write up, New Scientist has one here: Google’s DeepMind AI can lip-read TV shows better than a pro.) For reference, the best word error rate the researchers report is 3.3%. So are the outputs true or truthy?)

So… I’m wondering… algorithmic truthiness: the extent to which the outputs of an algorithm feel as if they could be true, even if not necessarily true. … a useful conceit, or not?

Or maybe we need an alt definition, such as “The extent to which you believe the output of an algorithm to be true rather than what you know to be true”?!

“Local stories of national interest” – New Johnston Press (Data Journalism) Investigations Unit

Complementing the approach of Trinity Mirror, who launched a cross-group data journalism unit back in 2013, Johnston Press has pulled together a (virtual?) Investigations Unit made up from several investigative and data skilled reporters from across the Johnston Press regional titles (press release).

The unit’s first campaign is focussed on sentences awarded for causing death by dangerous driving. The campaign allows the unit to report on national datasets, as such, as well as developing local stories based on examples taken from the national dataset, bubbling up local stories to wider national interest as campaign hooks. From the press release announcing the launch of the unit, it seems as if this campaigning style of national/local investigative reporting will be underpin the unit’s activities.

“As well as carrying out investigations, and telling powerful human interest stories, the unit has a campaigning and lobbying role at its heart” – Johnston Press press release.

The use of campaigns means the same theme can be kept alive and repeatedly reported on as on ongoing series over an extended period of time, tracked nationally but reported in a local context on the one hand, promoting local campaigns and then reporting them widely on the other.

The national/local model is one that I’ve long thought makes sense, though I’ve not really considered it in terms of the local to national twist. Instead, I’ve been framing it as an opportunity to address centrally common pain points that may be experienced trying to produce a story from data at a local level, as discussed in these thoughts on a locally targeted, nationally scoped datawire.

National dataset local story

One advantage of this approach is scale: graphics communicating national level statistics can be produced centrally and reused across local titles, perhaps with local customisation; local stories can be used to provide relevance to generic “national context” inserts reused across titles; and story templates can be customised to generate local reports from the same national dataset.

Another advantage with looking at national datasets is that they can help flag the newsworthiness of a local story given its national context (for example, national rankings generate story points for the top M, bottom N rankings).

I haven’t spent much time thinking about the campaign aspect, but on quick reflection I think that campaigns can act as nice wrappers for a wider range of templated activities an outputs.

For example, I’ve written a couple of times about the notion of story templates, noting how these have been rolled out in previous years by at least the Johnston Press and Trinity Mirror (Local News Templates – A Business Opportunity for Data Journalists?).

And eighteen months or so ago, I was fortunate enough to spend a couple of days seeing how Ruby Kitchen, then of the Harrogate Advertiser, now of the Yorkshire Post / Yorkshire Evening Post and the Johnston Press Investigations Unit, worked on a Food Standards Agency story on (Data Journalism in Practice). One of the takeaways for me from that was what was involved in actually making use of leads thrown up from a data trawl and then chasing down people for comment. The work involved in putting together an investigation at a single local level may need to be repeated for other locales, but the process can be reused – the investigatory process can be templated.

On the way back home from Harrogate, I’d started fantasising about putting together a training pack based on the the Food Standards Agency food hygiene ratings data (h/t Andy Dickinson for tangentially reminding me of this a couple of days ago :-), with a dual objective in mind: firstly, to produce a training pack for demonstrating various aspects of how to practically work with national datasets at a local level; secondly, to template a data journalism investigation that could be worked through by local or hyperlocal journalists, or journalism students, to produce a feature local food hygiene ratings. (It’s still sitting on the to do pile… Maybe I should have tried kickstarter!)

(Note that it’s not just news organisations that can scale templated systems, or reuse locally developed solutions for national benefit. For example, see the post Putting Public Open Data to Work…? for several examples of online services developed by local councils and used to publish local data that can also be scaled across other council areas.)

Whilst newspaper groups such as Trinity Mirror or Johnston Press have the scale in terms of the number of local outlets to merit a co-ordinated centre reducing the pain once for working with national datasets and then scaling out the benefits across the regional and local titles, independent hyperlocals are often more resource bound when it comes to pursuing investigations (though The Bristol Cable among others repeatedly shows how hyperlocal led investigations are possible).

Whilst I keep not starting to properly scope a hyperlocal datawire service, Will Perrin’s  Local News Engine seems to have gained some traction in its development recently (Early proof of concept for Local News Engine [code]). This service “is testing the theory that story leads can be found in local data where a newsworthy person or place is engaged in a newsworthy activity”, searching local datasources (license applications, planning applications) for notable names (see for example What data are we using in Local News Engine? and Who, what and where is newsworthy for Local News Engine?). The approach taken – named entity extraction cross-referenced with the names of local notables – complements an alternative approach that I favour for the datawire that would flag local stories from national datasets based on things like top N, bottom M rankings, outliers, notable trends or dramatic change in statistics for a local area from a national dataset based on a comparison with previous data releases, other locales and national averages.

PS you can tell this is a personal blog post, not a piece of journalism – I didn’t reach out to anyone from the Johnston Press, or Trinity Mirror, or get in touch with Will Perrin to check facts or ask for comment. It’s all just my personal comment, bias, interpretation and opinion….

PPS See also Archant’s Investigations Unit (2015 announcement) – h/t Andy Dickinson.

Be Wary Of Simulations

An old (well, relatively speaking – from March this year) video has recently resurfaced on the Twitterz describing how researchers are (were) using virtual worlds to train Deep Learning systems for possible use in autonomous vehicles:

It reminded me of a demo by Karl Sims at a From Animals to Animats conference years ago in which he’d evolved creatures in a 3D world to perform various forms of movement:

One thing I remember, but not shown in the video above, related to one creature being evolved to jump as high as it could. Apparently, it found a flaw in the simulated physics of the world within which the critters were being evolved that meant it could jump to infinity…

In turn, Sims critters reminded me of a parable about neural networks getting image recognition wrong*, retold here: Detecting Tanks. In trying to track down the origins of that story, references are made to this November 1993 Fort Carson RSTA Data Collection Final Report. In passing, I note that the report (on collecting visual scene information to train systems to detect military vehicles in natural settings) refers to a Surrogate Semiautonomous Vehicle (SSV) Program; which in turn makes me think: how many fits and starts has autonomous vehicle research gone through prior to it’s current incarnation?

* In turn, this reminds me of another possibly apocryphal story – of a robot trained to run a maze being demoed for some important event. The robot ran the maze fine, but then the maze was moved to another part of the lab for the Big Important Demo. At which point, the robot messed up completely: rather than learning the maze, the robot had trained its escape based on things it could see in the lab – such as the windows – that were outside the maze. The problem with training machines is you’re never quite sure what they’re focussing on…

PS via Pete Mitton, another great simulation snafu story: the tale of the kangaroos. Anyone got any more?:-)

Datadive Reproducibility – Time for a DataBox?

Whilst at the Global Witness “Beneficial Ownership” datadive a couple of weeks ago, one of the things I was pondering  – how to make the weekend’s discoveries reproducible on the one hand, useful as a set of still working legacy tooling on the other – blended into another: how to provide an on-ramp for folk attending the event who were not familiar with the data or the way in which t was provided.

Event facilitators DataKind worked in advance with Global Witness to produce an orientation exercise based around a sample dataset. Several other prepped datasets were also made available via USB memory sticks distributed as required to the three different working groups.

The orientation exercise was framed as a series of questions applied to a core dataset, a denormalised flat 250MB or so CSV file containing just over a million or so rows, with headers. (I think Excel could cope with this – not sure if that was by design or happy accident.)

For data wranglers expert at working with raw datafiles and their own computers, this doesn’t present much of a problem. My gut reaction was to open the datafile into a pandas dataframe in a Jupyter notebook and twiddle with it there; but as pandas holds dataframes in memory, this may not be the best approach, particularly if you have multiple large dataframes open at the same time. As previously mentioned, I think the data also fit into Excel okay.

Another approach after previewing the data, even if just by looking at it on the command line with a head command, was to load the data into a database and look at it from there.

This immediately begs several questions of course  – if I have a database set up on my machine and import the database without thinking about it, how can someone else recreate that? If I don’t have a database on my machine (so I need to install one and get it running) and/or I don’t then know how to get data into the database, I’m no better off. (It may well be that there are great analysts who know how to work with data stored in databases but don’t know how to do the data engineering stuff of getting the database up and running and populated with data in the first place.)

My preferred solution for this at the moment is to see whether Docker containers can help. And in this case, I think they can. I’d already had a couple of quick plays looking at getting the Companies House significant ownership data into various databases (Mongo, neo4j) and used a recipe that linked a database container with a Jupyter notebook server that I could write my analysis scripts in (linking RStudio rather than Jupyter notebooks is just as straightforward).

Using those patterns, it was easy enough to create a similar recipe to link a Postgres database container to a Jupyter notebook server. The next step – loading the data in. Now it just so happens that in the days before the datadive, I’d been putting together some revised notebooks for an OU course on data management and analysis that dealt with quick ways of loading data into a Postgres data, so I wondered whether those notes provided enough scaffolding to help me load the sample core data into a database: a) even if I was new to working with databases, and b) in a reproducible way. The short answer was “yes”. Putting the two steps together, the results can be found here: Getting started – Database Loader Notebook.

With the data in a reproducibly shareable and “live” queryable form, I put together a notebook that worked through the orientation exercises. Along the way, I found a new-to-me HTML5/d3js package for displaying small  interactive network diagrams, visjs2jupyter. My attempt at the orientation exercises can be found here: Orientation Activities.

Whilst I am all in favour of experts datawranglers using their own recipes, tools and methods for working with the data – that’s part of the point of these expert datadives – I think there may also be mileage in providing a base install where the data is in some sort of immediately queryable form, such as in a minimal, even if not properly normalised, database. This means that datasets too large to be manipulated in memory or loaded into Excel can be worked with immediately. It also means that orientation materials can be produced that pose interesting questions that can be used to get a quick overview of the data, or tutorial materials produced that show how to work with off-the-shelf powertool combinations (Jupyter notebooks / Python/pandas / PostgreSQL, for example, or RStudio /R /PostgreSQL ).

Providing a base set up to start from also acts as an invitation to extend that environment in a reproducible way over the course of the datadive. (When working on your own computer with your own tooling, it can be way too easy to forget what packages (apt-get, pip and so on) you have pre-installed that will cause breaking changes to any outcome code you show with others who do not have the same environment. Creating a fresh environment for the datadive, and documenting what you add to it, can help with that, but testing in a linked container, but otherwise isolated, context really helps you keep track of what you needed to add to make things work!

If you also keep track of what you needed to do handle undeclared file encodings, weird separator characters, or password protected zip files from the provided files, it means that others should be able to work with the files in a reliable way…

(Just a note on that point for datadive organisers – metadata about file encodings, unusual zip formats, weird separator encodings etc is a useful thing to share, rather than have to painfully discover….)

Using tools like Docker is one way of improving the shareability of immediately queryable data, but is there an even quick way? One thing I want to explore on my to do list is the idea of a “databox”, a Raspberry Pi image that when booted runs a database server and Jupyter notebook (or RStudio) environment. The database can be pre-seeded with data for the datadive, so all that should be required is for an individual to plug the Raspberry Pi into their computer with an ethernet cable, and run from there. (This won’t work for really large datasets – the Raspberry Pi lacks grunt – but it’s enough to get you started.)

Note that these approaches scale out to other domains, such as data journalism projects (each project on its own Raspberry PI SD card or docker-compose setup…)

What Nationality Did You Say You Were, Again?

For the first time in way too long, I went to a data dive over the weekend, facilitated by DataKind on behalf of Global Witness, for a couple of days messing around with the UK Companies House  Significant Control (“beneficial ownership”) register.

One of the data fields in the data set is the nationality of a company’s controlling entity, where that’s a person rather than a company. The field is a free text one, which means that folk completing a return have to write their own answer in to the box, rather than selecting from a specified list.

The following are the more popular nationalities, as declared…


Note that “English” doesn’t count – for the moment, the nationality should be declared as “British”…

And some less popular ones  – as well as typos…:


So how can we start to clean this data?

One the libraries I discovered over the weekend was fuzzyset, that lets you add “target” strings to a set and then do a fuzzy match retrieval from the set using a word or phrase you have been provided with.

If we find a list of recognised nationalities, we could add these to a canonical “nationality” set, and then try to match supplied nationalities against them.

The UK Foreign & Commonwealth Office register of country names, a register that lists formalised country names for use in government, also includes nationalities – so maybe we can use that?

Adding the FCO nationalities to a  fuzzyset list, and then matching nationalities from the significant control register against those nationalities, gives a glimpse into the cleanliness (or otherwise!) of the data. For example, here’s what was matched against “British”:

British | Britsh | Bristish | Brisith | Scottish | Britsih | British/Greek | Greek/British | Briitish | British/Czech | Bitish | Brtisih | British/Welsh | Brirish | Brtish | British. | British Norfolk | British Cornish | British Subject | British English | Uk British | British/Irish | Britiah | British/Swedish | Biitish | Brititsh | British/English | Briish | British/Persian | Britiish | Brittish | French British | British/German | British/Syrian | Britihs | Briitsh | British /English | British / English | Brits | Kenyan/British | Britis | American British | Btitish | British/Bahrain Dual | Brtitish | Polish/British | Dual British/Irish | Brirtish | British- | British Uk | Brutish | Britich | British (Naturalised) | British (Canada Born) | Brithish | British Irish | British & Usa | Britisch | British/French | British/Israeli | Britrish | Britsh - English | American/British | Britisb | White British | Birtish | English / British | British/Turkish | Dual Usa/British | British/Swiss | Biritish | Britishu | Britisah | European British | British / Scottish | British & Israeli | British Swiss | Scotish | British Welsh | Britisn | Briti | Britihs & Irish | Britishi | Brfitish | Usa And British | American / British | British-United Kingdom | British Usa | Britisg | Israeli/British | Britih | Welsh British | Us & British | British Indian | British Asian | B Ritish | Emaratis | British/Bosnian | White Brtitish | British - English | Welsh/British | German/British | British & Irish | British-Israeli | British / Greek | Great British | Beitish | White Uk British | Belizean & British | Brithish English | Brituish | Britiash | Indian British | British Caribbean | Swedish/British | Britisjh | British Amercian | Britisk | Turkish/British | Brtiish | Br5itish | Brritish | Welsh, British | Brtitsh | U.K British | Britidh | Kurdish/British | English British | Brith | Irish/British | Britisj | British/Pakistan | I'M British | Britisih | American & British | British / Welsh | British / Swiss | Brittsh | British Icelandic | Swiss / British | Brotish | British Sikh | English/British | Britiswh | Bristsh | British European | British And Usa | British / Israeli | British Bengali | British Afghan | Brithsh | Brit6ish | British/Indian | British/Libyan | British/Polish | British Israeli | British National | Swiss British | Briritsh | Britishh | British / Irish | Brithis | Britshi | British And Thai | Britush | Britiss | British, English | Bfritish | Btritish | Brisitsh | White English | British/Mosotho | Usa & British | British/ Eu National | Finnish/British | Israeli + British | British And Polish | Bartish | Nritish | Brishish | British Manx | German And British | Britiosh | British (Bermudian) | Britishbritish | Naturalised British | English - British | Welsh - British | Dual American/British | British,Uk | British And Us | Uk Brittish | British Overseas | British & Swiss | English-British | British & Polish | Us/British | Swiss & British | British And Greek | Iraqi, British | Breitish | Black British | U.K. British | Afghan British | Brit / English | British/Asian | Awhite British | Asian British | British / Polish | Caucasian British | Britosh | Bristih | Britsish | British Libyan | Britisth | Brisish | British & Spanish | Britinsh | Britisht | Britsith | Britash | Irish / British | Brisitish | Brirtsh | Bruitish | Dutch / British | Bristis | Ritish | Welsh, Bristish | British Resident | British And French | British/ English | British (Welsh) | French/British | Dual British - French | Bristiah | Great Britain & Usa | British & Us | Uk Scottish | British Scott | Brititish | Dual: British, Usa | .British | British (Scots) | Scottish Uk | British/Scottish | Brittiish | British-Irish | Btittish | Scottish. | Britisy | Bruttish | Dual British Irish | Scottish/British

In passing, English matched best with Bangladeshi, so we maybe need to tweak the lookup somewhere, perhaps adding English, Scottish, Northern Irish, Welsh, and maybe the names of UK counties, into the fuzzyset, and then in post-processing mapping from these to British?

Also by the by, word had it that Companies House didn’t consider there to be any likely significant data quality issues with this field… so that’s alright then….

PS For various fragments of code I used to have a quick look at the nationality data, see this gist. If you look through the fuzzy matchings to the FCO nationalities, you’ll see there are quite a few false attributions. It would be sensible to look at the confidence ratings on the matches, and perhaps set thresholds for automatically allocating submitted nationalities to canonical nationalities. In a learning system, it may be possible to bootstrap – add high confidence mappings to the fuzzyset (with a map to the canonical nationality) and then try to match again the nationalities still unmatched at a particular level of confidence?