The Four S’s of Real Data – And a Need for Data Technicians, Not Data Scientists?

“Meh” to the 4 Vs of “big data”, for most people, most of the time, real data is:

  • small: a few rows and a few columns;
  • slow: comes out rarely, often according to a trailing schedule (once a week, once a month, some time after the reported period),
  • spreadsheeted: it just is…
  • smelly: indications in the data that something is wrong with the way it’s been collected, processed or analysed. (Cf. code smells, spreadsheet smells).

At the same time, all data projects, big or small, often require folk to do a whole chunk of work with the data before they can actually get round to using it. (Much of the time spent on data projects is spent getting the data, cleaning it (is “J. Smith” the same as “J Smith”?; data-typing: is “1” the number one or a character “1”, should “12/1/17” be saved as a date (and it is day or month first? Is it the date or is it the period corresponding to that day, etc), putting it into a form you can work with (which may be a database, or a well formed spreadsheet), getting it into the right shape (that it, structured using rows and columns you can easily work with), and so on.

If the value you think you want from, and what you pay your, data scientist for is the stats’n’insights’n’data mining stuff, then should they be spending most of their time doing the grunt work, much of which relies on craft knowledge and skills? How many data scientists do we actually need if they arenlt spending all their time poking around fixing the plumbing?

Don’t we need more data technicians or data tech eng‘s (technical engineers) who can do the labour intensive stuff well (using their craft knowledge) as well as making a bit of sense from it (getting a bit of “insight” out of it based on familiarity with it) using the real data every company has? I just don’t get this whole “data science” hype thing… More people need to fix a dripping tap than a leaking high pressure, superheated steam valve in an online nuclear power station. So why the hype about a huge skills gap in the latter when what every company needs is someone who can do the former?

One comment

  1. Joshua

    I completely agree. Databases aren’t maintained because they aren’t used <=> Databases aren’t used because they aren’t maintained.

    The issue is that most data are stored without analysis in mind. The Excel-Analysts cannot use the data in its current format (SQL tables?). ML is seen as “magic” which is format agnostic and can solve the problem. The job goes to the Data Scientist. The Data Scientist labouriously cleans the data. The Data Scientist then performs Linear Regression, gets good results doing what an Analyst should have done. The other part of the solution is to teach more Analysts to use Python, R and SQL, not Excel… The more the Database is used the better it will be maintained.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s