Generating Fake Data – Quick Roundup

When developing or testing a data system, it often makes sense to try it out with some data that looks real, but isn’t real, just in case something goes wrong…

It also means you can test as much as you want without having to expose any real data.

According to this article — Synthetic data generation — a must-have skill for new data scientists — knowing how to create effective test data is one of those new skills folk are going to have to learn.

(We’re about to start looking at producing a new machine learning course, so stumbling across that sort of possible requirement is quite timely…)

So what data can you use?

By chance, whilst searching for something else, I spotted this article describing pydbgen, a simple Python package for generating fake data tables to test simple database systems.

A quick trawl turns up other packages for doing similar things, such as mimesis, or faker, which also inspired this more general R package, charlatan.

You can also generate numerical data with various statistical properties. For example, you can generate test datasets using SciKit Learn; and here’s one of my early attempts at generating 2D numerical data to demonstrate different correlation coefficients.

For text strings, generators are often referred to as ‘lorem ipsum’ generators  (why?). For example, loremipsum or collective.loremipsum. Searching for “sentence generator” will also turn up some handy packages…: markovify or markovipy, for example. If you prefer using neural network models, there are those too: textgenrnn. If you like waffle, here’s a WaffleGenerator.

If it’s technical waffle you want, here’s a classic pure fake computer science paper generator, SciGen is a classic, though it may take a bit of digging to find the required dependencies to run it…

Sometimes a real world source document can be used to bootstrap the production of a related, fake, item. For example, this “semiautomated scientific survey writing tool” that will create a scientific review paper for you: HackingScience. (I wonder, are there educational possibilities there that may help draft materials for, or support researching, new courses?)

The state of the art in text generation was evidenced by a blog post from OpenAI that was doing the rounds this week. gpt-2 looks like a related repo, but with a smaller model and no examples in the README…

Another use of text is for testing OCR (optical character recognition) systems: TextRecognitionDataGenerator. If you need to put some text into images in order to test text extraction from images: SynthText.

If it’s faces you need, then deep learning networks may help. For example, stylegan generates the sort of faces you can see on ThisPersonDoesNotExist. And here are some tips on how to spot fake face photos…

More general image synthesis from text is still a bit ropey, at least in some of the repos I found, but some look okay: text-to-image. If you do get a grainy image, though, I wonder what happens if you then tidy it up using something like this? deep-image-prior.

If you can find a way of generating semantic image sketches, you can generate real images by letting a network fill in the detail: PhotographicImageSynthesis.

If you need text surrounding an image, there are lots of examples of generating tags from images, but how about then using that to generate more sentence-like captions? image2story.

Many of these packages can generate plausible looking data for a wide definition of data, although they won’t necessarily model the mess of the real data; (any mess you build in will be a model of messy data, but not necessarily a realistic one). This is something to bear in mind when testing. You should be particularly careful with how you use them if you are testing machine learning models against them, and expect weird things to happen if you make like Ouroboros and use them to train models…

PS maybe useful in related contexts: generate your own Anscombe’s Quartet style data R code.

Author: Tony Hirst

I'm a lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.