Data Ethics – Guidance and Code – And a Fragment on Machine Learning Models

By chance, I notice that the Department for Digital, Culture, Media & Sport (DDCMS) have published a guidance on a Data Ethics Framework, with an associated workbook, that is intended to guide the design of appropriate data use in government and the wider public sector.

The workbook is based around providing answers to questions associated with seven principles:

  1. A clear user need and public benefit
  2. Awareness of relevant legislation and codes of practice
  3. Use of data that is proportionate to the user need
  4. Understanding of the limitations of the data
  5. Using robust practices and working within your skillset
  6. Making work transparent and being accountable
  7. Embedding data use responsibly

It’s likely that different sector codes will start to appear, such as this one from the Department of Health & Social Care (DHSC): Initial code of conduct for data-driven health and care technology. In this case, the code incorporates ten principles:

1 Define the user
Understand and show who specifically your product is for, what problem you are trying to solve for them, what benefits they can expect and, if relevant, how AI can offer a more efficient or quality-based solution. Ensure you have done adequate background research into the nature of their needs, including any co-morbidities and socio-cultural influences that might affect uptake, adoption and ongoing use.

2 Define the value proposition
Show why the innovation or technology has been developed or is in development, with a clear business case highlighting outputs, outcomes, benefits and performance indicators, and how exactly the product will result in better provision and/or outcomes for people and the health and care system.

3 Be fair, transparent and accountable about what data you are using
Show you have utilised privacy-by-design principles with data-sharing agreements, data flow maps and data protection impact assessments. Ensure all aspects of GDPR have been considered (legal basis for processing, information asset ownership, system level security policy, duty of transparency notice, unified register of assets completion and data privacy impact assessments).

4 Use data that is proportionate to the identified user need (data minimisation principle of GDPR)
Show that you have used the minimum personal data necessary to achieve the desired outcomes of the user need identified in 1.

5 Make use of open standards
Utilise and build into your product or innovation, current data and interoperability standards to ensure you can communicate easily with existing national systems. Programmatically build data quality evaluation into AI development so that harm does not occur if poor data quality creeps in.

6 Be transparent to the limitations of the data used and algorithms deployed
Show you understand the quality of the data and have considered its limitations when assessing if it is appropriate to use for the defined user need. When building an algorithm be clear on its strengths and limitations, and show in a transparent manner if it is your training or deployment algorithms that you have published.

7 Make security integral to the design
Keep systems safe by integrating appropriate levels of security and safeguarding data.

8 Define the commercial strategy
Purchasing strategies should show consideration of commercial and technology aspects and contractual limitations. You should only enter into commercial terms in which the benefits of the partnerships between technology companies and health and care providers are shared fairly.

9 Show evidence of effectiveness for the intended use
You should provide evidence of how effective your product or innovation is for its intended use. If you are unable to show evidence, you should draw a plan that addresses the minimum required level of evidence given the functions performed by your technology.

10 Show what type of algorithm you are building, the evidence base for choosing that algorithm, how you plan to monitor its performance on an ongoing basis and how you are validating performance of the algorithm.

One of the discussions we often have when putting new courses together is how to incorporate ethics related  issues in a way that makes sense (which is to say, in a way that can be assessed…) One way might to apply things like the workbook or the code of conduct to a simple case study. Creating appropriate case studies can be a challenge, but via an O’Reilly post, I note that in a joint project between the Center for Information Technology Policy and the Center for Human Values, both at Princeton, has recently produced a set of fictional case studies that are designed to elucidate and prompt discussion about issues in the intersection of AI and Ethics that cover a range of issues: an automated healthcare app (foundations of legitimacy, paternalism, transparency, censorship, inequality);  dynamic sound identification (rights, representational harms, neutrality, downstream responsibility); optimizing schools issues: (privacy, autonomy, consequentialism, rhetoric); law enforcement chatbots (automation, research ethics, sovereignty).

I also note that a recent DDCMS Consultation on the Centre for Data Ethics and Innovation has just closed… One of the topics of concern that jumped out at me related to IPR:

Intellectual Property and ownership Intellectual property rights protect – and therefore reward – innovation and creativity. It is important that our intellectual property regime keeps up with the evolving ways in which data use generates new innovations. This means assigning ownership along the value chain, from datasets, training data, source code, or other aspects of the data use processes. It also includes clarity around ownership, where AI generates innovations without human input. Finally, there are potentially fundamental questions around the ownership or control of personal data, that could heavily shape the way data-driven markets operate.

One of the things I think we are likely to see more of is a marketplace in machine learning models, either sold (or rented out?) as ‘fixed’ or ‘further trainable’, building on the the shared model platforms that are starting to appear; (a major risk here, of course, is that models with built in biases – or vulnerabilities – might be exploited if bad actors know what models you’re running…). For example:

  • Seedbank [announcement], “a [Google operated] place to discover interactive machine learning examples which you can run from your browser, no set-up required“;
  • TensorFlow Hub [announcement], “a [Google operated] platform to publish, discover, and reuse parts of machine learning modules in TensorFlow“;
  • see also this guidance on sharing ML models on Google Cloud
  • Comet.ml [announcement], “the first infrastructure- and workflow-agnostic machine learning platform.“. No, me neither…

I couldn’t offhand spot a marketplace for Amazon Sagemaker models, but I did notice some instructions for how to import Your Amazon SageMaker trained model into Amazon DeepLens, so if model portability is a thing, the next thing Amazon will likely to is to find a way to take a cut from people selling that thing.

I wonder, too, if the export format has anything to do with ONNX, “an open format to represent deep learning models?

ONNX

(No sign of Google there?)

How the IPR around these models will be regulated can also get a bit meta. If data is licensed to one party so they can train a model, should the license also cover who might make use of any models trained on that data, or how any derived models might be used?

And what counts as “fair use” data when training models anyway? For example, Google recently announced Conceptual Captions. “A New Dataset and Challenge for Image Captioning“. The dataset:

consist[s] of ~3.3 million image/caption pairs that are created by automatically extracting and filtering image caption annotations from billions of web pages.

So how were those images and text caption training data gathered / selected? And what license conditions were associated with those images? Or when compiling the data set, did Google do what Google always does, which is conveniently ignore copyright because it’s only indexing and publish search results, not actually re-publishing material (allegedly…).

Does that sort of consideration fall into the remit of the current Lords Communications Committee inquiry into The Internet: to regulate or not to regulate?, I wonder?

A recent Information Law and Policy Centre post on Fixing Copyright Reform: How to Address Online Infringement and Bridge the Value Gap, starts as follows:

In September 2016, the European Commission published its proposal for a new Directive on Copyright in the Digital Single Market, including its controversial draft Article 13. The main driver behind this provision is what has become known as the ‘value gap’, i.e. the alleged mismatch between the value that online sharing platforms extract from creative content and the revenue returned to the copyright-holders.

This made me wonder, is there a “mismatch” somewhere between:

a) the data that people share about themselves, or that is collected about them, and the value extracted from it;
b) the content qua data that web search engine operators hoover up with their search engine crawlers and then use as a corpus for training models that are used to provide commercial services, or sold / shared on?

There is also a debate to be had about other ways in which the large web cos seem to feel they can just grab whatever data they want, as hinted at in this report on Google data collection research.

One comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.