Generating Fake Data – Quick Roundup

When developing or testing a data system, it often makes sense to try it out with some data that looks real, but isn’t real, just in case something goes wrong…

It also means you can test as much as you want without having to expose any real data.

According to this article — Synthetic data generation — a must-have skill for new data scientists — knowing how to create effective test data is one of those new skills folk are going to have to learn.

(We’re about to start looking at producing a new machine learning course, so stumbling across that sort of possible requirement is quite timely…)

So what data can you use?

By chance, whilst searching for something else, I spotted this article describing pydbgen, a simple Python package for generating fake data tables to test simple database systems.

A quick trawl turns up other packages for doing similar things, such as mimesis, or faker, which also inspired this more general R package, charlatan.

You can also generate numerical data with various statistical properties. For example, you can generate test datasets using SciKit Learn; and here’s one of my early attempts at generating 2D numerical data to demonstrate different correlation coefficients.

For text strings, generators are often referred to as ‘lorem ipsum’ generators  (why?). For example, loremipsum or collective.loremipsum. Searching for “sentence generator” will also turn up some handy packages…: markovify or markovipy, for example. If you prefer using neural network models, there are those too: textgenrnn. If you like waffle, here’s a WaffleGenerator.

If it’s technical waffle you want, here’s a classic pure fake computer science paper generator, SciGen is a classic, though it may take a bit of digging to find the required dependencies to run it…

Sometimes a real world source document can be used to bootstrap the production of a related, fake, item. For example, this “semiautomated scientific survey writing tool” that will create a scientific review paper for you: HackingScience. (I wonder, are there educational possibilities there that may help draft materials for, or support researching, new courses?)

The state of the art in text generation was evidenced by a blog post from OpenAI that was doing the rounds this week. gpt-2 looks like a related repo, but with a smaller model and no examples in the README…

Another use of text is for testing OCR (optical character recognition) systems: TextRecognitionDataGenerator. If you need to put some text into images in order to test text extraction from images: SynthText.

If it’s faces you need, then deep learning networks may help. For example, stylegan generates the sort of faces you can see on ThisPersonDoesNotExist. And here are some tips on how to spot fake face photos…

More general image synthesis from text is still a bit ropey, at least in some of the repos I found, but some look okay: text-to-image. If you do get a grainy image, though, I wonder what happens if you then tidy it up using something like this? deep-image-prior.

If you can find a way of generating semantic image sketches, you can generate real images by letting a network fill in the detail: PhotographicImageSynthesis.

If you need text surrounding an image, there are lots of examples of generating tags from images, but how about then using that to generate more sentence-like captions? image2story.

Many of these packages can generate plausible looking data for a wide definition of data, although they won’t necessarily model the mess of the real data; (any mess you build in will be a model of messy data, but not necessarily a realistic one). This is something to bear in mind when testing. You should be particularly careful with how you use them if you are testing machine learning models against them, and expect weird things to happen if you make like Ouroboros and use them to train models…

PS maybe useful in related contexts: generate your own Anscombe’s Quartet style data R code.

PPS this is handy – pandas testing module has methods for creating quite fake data dataframes… Read more on the Real Python site…
Make Toy Data Structures With Pandas’ Testing Module
.

PPPS another fake data generator, fakir, which seems to be based on charlatan: “create fake data in R for tutorials, suitable to introduce to the {tidyverse} and to provide examples for main functions”.

PPPPS Complementary to fake data, here’s a handy package for adding different sorts of noise, including “huamn error” (miskeying etc) to all sorts of dataset: noisify.

PPPPPS from the datasette/sqlite ecosysten Simon Willison is creating, sqlite-generate, a “Tool for generating demo SQLite databases using faker.

PPPPPPS related… DSTL: Synthetic data: Unlocking the power of data and skills for machine learning

DSTL also seem to have commissioned a review: Synthetic Data (BAe review for DSTL)

ONS Datascienc campus have also done some work that is perhaps more relevant to ONS style datasets: Synthetic data for public good.

More generators: https://github.com/sdv-dev/SDV “The Synthetic Data Vault (SDV) is a Synthetic Data Generation ecosystem of libraries that allows users to easily learn single-tablemulti-table and timeseries datasets to later on generate new Synthetic Data that has the same format and statistical properties as the original dataset.”

Creating fake noise: https://blog.ephorie.de/pseudo-randomness-creating-fake-noise

More: https://dmey.github.io/synthia/overview.html

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...