Category: Anything you want

Recent Releases: Plotly Falcon SQL Client and the Remarkable Datasette SQLite2API Generator

A few years ago, Plot.ly launched an online chart hosting service that allowed users to create  – and host – charts based on their own datasets. This was followed with the release of a open source charting libraries and a Python dashboard framework (dash). Now, they’ve joined the desktop query engine’n’charting party with the Falcon SQL Client, a free electron desktop app for Mac and Windows (code).

The app (once started – my Mac complained it was unsigned when I tried to run it the first time) appears to allow you to connect to a range of databases, query engines (such as Apache Drill) and SQLite.

Unfortunately, I couldn’t find a file suffix that would work when looking for a SQLite file to try it out quickly – and whilst trying to find any docs at all for connecting to SQLite (there are none that I can find at the moment), I got the impression that SQLite is not really a first class endpoint for Plotly:

I did manage to get it running against my ergast MySQL container though:

Providing the client with a database name, it loads in the database tables and allows you to query against them. Expanding a table reveals the column names and data type:

After running a query, there’s an option to generate various charts against the result, albeit in a limited way. (I couldn’t label or size elements in the scatter plot, for example.)

The chart types on offer are… a bit meh…:

The result can be viewed as a table, but there are no sort options on the column headers – you have to do that yourself in the query, I guess?

The export options are as you might expect. CSV is there, but not a CSV data package that is bundled with metadata, for example:

All in all, if this is an entry level competitor to Tableau, it’s a very entry level… I’d probably be more tempted to use the browser based Franchise query engine, not least because that also lets you query over CSV files. (That said, from a quick check of the repo, it doesn’t look like folk are working on it much:-(.

Far more compelling in quick-query land is a beautiful component from Simon Willison (who since he’s started blogging again has, as ever before, just blown me away with the stuff he turns up and tinkers with): datasette.

This python package lets you post a SQLite file as a queryable API. (Along the way Simon also produced a handy command line routine for loading CSV files into a SQLite3 database: simonw/csvs-to-sqlite.) Out of the box, datasette lets you fire up the API as a local service, or as a remotely hosted one on Zeit Now (which I’ve yet to play with).

In part, this also reminded me of creating simple JSON APIs from a Jupyter notebook, and the appmode Jupyter extension that allows you to run a widgetised notebook as an app. In short, it got me wondering about how services/apps created that way could be packaged and distributed more easily, perhaps using something like Binderhub?

Idly Wondering… Python Packages From Jupyter Notebooks

How can we go about using Jupyter notebooks to create Python packages?

One of the ways of saving a Jupyter notebook is as a python file, which could be handy…

One of the ways of using a Jupyer notebook is to run it inside another notebook by calling it using the %run cell magic – which provides a crude way of importing the contents of one notebook into another.

Another way of using a Jupyter notebook is to treat it as as Python module using a recipe described in Importing Jupyter Notebooks as Modules that hooks into the python import machinery. (It even looks to work with importing notebooks containing code cells that include IPython commands?)

But could we mark up one or more linked notebooks in some way that would build a Python package and zip it up for distribution via pip?

I’ve no idea how it would work, but here’s something related-ish (via @simonw) that creates a command line interface from a Python file: click:

Click is a Python package for creating beautiful command line interfaces in a composable way with as little code as necessary. It’s the “Command Line Interface Creation Kit”. It’s highly configurable but comes with sensible defaults out of the box.

It aims to make the process of writing command line tools quick and fun while also preventing any frustration caused by the inability to implement an intended CLI API.

I guess it could work on Python exported from a Jupyter notebook too?

TO DO: see if I can write a Jupyter notebook that can be used to generate a CLI, (perhaps creating a Jupyter notebook extension to create a CLI from a notebook?)

See also: oschuett/appmode Jupyter extension for creating and launching a simple web app from a Jupyter notebook.

Hmm… also via @simonw, could Zeit Now be used to launch appmode apps, as in Datasette: instantly create and publish an API for your SQLite databases?

Programming, meh… Let’s Teach How to Write Computational Essays Instead

From Stephen Wolfram, a nice phrase to describe the sorts of thing you can create using tools like Jupyter notebooks, Rmd and Mathematica notebooks: computational essays that complements the “computational narrative” phrase that is also used to describe such documents.

Wolfram’s recent blog post What Is a Computational Essay?, part essay, part computational essay,  is primarily a pitch for using Mathematica notebooks and the Wolfram Language. (The Wolfram Language provides computational support plus access to a “fact engine” database that ca be used to pull factual information into the coding environment.)

But it also describes nicely some of the generic features of other “generative document” media (Jupyter notebooks, Rmd/knitr) and how to start using them.

There are basically three kinds of things [in a computational essay]. First, ordinary text (here in English). Second, computer input. And third, computer output. And the crucial point is that these three kinds of these all work together to express what’s being communicated.

In Mathematica, the view is something like this:


In Jupyter notebooks:

In its raw form, an RStudio Rmd document source looks something like this:

A computational essay is in effect an intellectual story told through a collaboration between a human author and a computer. …

The ordinary text gives context and motivation. The computer input gives a precise specification of what’s being talked about. And then the computer output delivers facts and results, often in graphical form. It’s a powerful form of exposition that combines computational thinking on the part of the human author with computational knowledge and computational processing from the computer.

When we originally drafted the OU/FutureLearn course Learn to Code for Data Analysis (also available on OpenLearn), we wrote the explanatory text – delivered as HTML but including static code fragments and code outputs – as a notebook, and then ‘ran” the notebook to generate static HTML (or markdown) that provided the static course content. These notebooks were complemented by actual notebooks that students could work with interactively themselves.

(Actually, we prototyped authoring both the static text, and the elements to be used in the student notebooks, in a single document, from which the static HTML and “live” notebook documents could be generated: Authoring Multiple Docs from a Single IPython Notebook. )

Whilst the notion of the computational essay as a form is really powerful, I think the added distinction between between generative and generated documents is also useful. For example, a raw Rmd document of Jupyter notebook is a generative document that can be used to create a document containing text, code, and the output generated from executing the code. A generated document is an HTML, Word, or PDF export from an executed generative document.

Note that the generating code can be omitted from the generated output document, leaving just the text and code generated outputs. Code cells can also be collapsed so the code itself is hidden from view but still available for inspection at any time:

Notebooks also allow “reverse closing” of cells—allowing an output cell to be immediately visible, even though the input cell that generated it is initially closed. This kind of hiding of code should generally be avoided in the body of a computational essay, but it’s sometimes useful at the beginning or end of an essay, either to give an indication of what’s coming, or to include something more advanced where you don’t want to go through in detail how it’s made.

Even if notebooks are not used interactively, they can be used to create correct static texts where outputs that are supposed to relate to some fragment of code in the main text actually do so because they are created by the code, rather than being cut and pasted from some other environment.

However, making the generative – as well as generated – documents available means readers can learn by doing, as well as reading:

One feature of the Wolfram Language is that—like with human languages—it’s typically easier to read than to write. And that means that a good way for people to learn what they need to be able to write computational essays is for them first to read a bunch of essays. Perhaps then they can start to modify those essays. Or they can start creating “notes essays”, based on code generated in livecoding or other classroom sessions.

In terms of our own learnings to date about how to use notebooks most effectively as part of a teaching communication (i.e. as learning materials), Wolfram seems to have come to many similar conclusions. For example, try to limit the amount of code in any particular code cell:

In a typical computational essay, each piece of input will usually be quite short (often not more than a line or two). But the point is that such input can communicate a high-level computational thought, in a form that can readily be understood both by the computer and by a human reading the essay.

...

So what can go wrong? Well, like English prose, can be unnecessarily complicated, and hard to understand. In a good computational essay, both the ordinary text, and the code, should be as simple and clean as possible. I try to enforce this for myself by saying that each piece of input should be at most one or perhaps two lines long—and that the caption for the input should always be just one line long. If I’m trying to do something where the core of it (perhaps excluding things like display options) takes more than a line of code, then I break it up, explaining each line separately.

It can also be useful to "preview" the output of a particular operation that populates a variable for use in the following expression to help the reader understand what sort of thing that expression is evaluating:

Another important principle as far as I’m concerned is: be explicit. Don’t have some variable that, say, implicitly stores a list of words. Actually show at least part of the list, so people can explicitly see what it’s like.

In many respects, the computational narrative format forces you to construct an argument in a particular way: if a piece of code operates on a particular thing, you need to access, or create, the thing before you can operate on it.

[A]nother thing that helps is that the nature of a computational essay is that it must have a “computational narrative”—a sequence of pieces of code that the computer can execute to do what’s being discussed in the essay. And while one might be able to write an ordinary essay that doesn’t make much sense but still sounds good, one can’t ultimately do something like that in a computational essay. Because in the end the code is the code, and actually has to run and do things.

One of the arguments I've been trying to develop in an attempt to persuade some of my colleagues to consider the use of notebooks to support teaching is the notebook nature of them. Several years ago, one of the en vogue ideas being pushed in our learning design discussions was to try to find ways of supporting and encouraging the use of "learning diaries", where students could reflect on their learning, recording not only things they'd learned but also ways they'd come to learn them. Slightly later, portfolio style assessment became "a thing" to consider.

Wolfram notes something similar from way back when...

The idea of students producing computational essays is something new for modern times, made possible by a whole stack of current technology. But there’s a curious resonance with something from the distant past. You see, if you’d learned a subject like math in the US a couple of hundred years ago, a big thing you’d have done is to create a so-called ciphering book—in which over the course of several years you carefully wrote out the solutions to a range of problems, mixing explanations with calculations. And the idea then was that you kept your ciphering book for the rest of your life, referring to it whenever you needed to solve problems like the ones it included.

Well, now, with computational essays you can do very much the same thing. The problems you can address are vastly more sophisticated and wide-ranging than you could reach with hand calculation. But like with ciphering books, you can write computational essays so they’ll be useful to you in the future—though now you won’t have to imitate calculations by hand; instead you’ll just edit your computational essay notebook and immediately rerun the Wolfram Language inputs in it.

One of the advantages that notebooks have over some other environments in which students learn to code is that structure of the notebook can encourage you to develop a solution to a problem whilst retaining your earlier working.

The earlier working is where you can engage in the minutiae of trying to figure out how to apply particular programming concepts, creating small, playful, test examples of the sort of the thing you need to use in the task you have actually been set. (I think of this as a "trial driven" software approach rather than a "test driven* one; in a trial,  you play with a bit of code in the margins to check that it does the sort of thing you want, or expect, it to do before using it in the main flow of a coding task.)

One of the advantages for students using notebooks is that they can doodle with code fragments to try things out, and keep a record of the history of their own learning, as well as producing working bits of code that might be used for formative or summative assessment, for example.

Another advantage is that by creating notebooks, which may include recorded fragments of dead ends when trying to solve a particular problem, is that you can refer back to them. And reuse what you learned, or discovered how to do, in them.

And this is one of the great general features of computational essays. When students write them, they’re in effect creating a custom library of computational tools for themselves—that they’ll be in a position to immediately use at any time in the future. It’s far too common for students to write notes in a class, then never refer to them again. Yes, they might run across some situation where the notes would be helpful. But it’s often hard to motivate going back and reading the notes—not least because that’s only the beginning; there’s still the matter of implementing whatever’s in the notes.

Looking at many of the notebooks students have created from scratch to support assessment activities in TM351, it's evident that many of them are not using them other than as an interactive code editor with history. The documents contain code cells and outputs, with little if any commentary (what comments there are are often just simple inline code comments in a code cell). They are barely computational narratives, let alone computational essays; they're more of a computational scratchpad containing small code fragments, without context.

This possibly reflects the prior history in terms of code education that students have received, working "out of context" in an interactive Python command line editor, or a traditional IDE, where the idea is to produce standalone files containing complete programmes or applications. Not pieces of code, written a line at a time, in a narrative form, with example output to show the development of a computational argument.

(One argument I've heard made against notebooks is that they aren't appropriate as an environment for writing "real programmes" or "applications". But that's not strictly true: Jupyter notebooks can be used to define and run microservices/APIs as well as GUI driven applications.)

However, if you start to see computational narratives as a form of narrative documentation that can be used to support a form of literate programming, then once again the notebook format can come in to its own, and draw on styling more common in a text document editor than a programming environment.

(By default, Jupyter notebooks expect you to write text content in markdown or markdown+HTML, but WYSIWYG editors can be added as an extension.)

Use the structured nature of notebooks. Break up computational essays with section headings, again helping to make them easy to skim. I follow the style of having a “caption line” before each input. Don’t worry if this somewhat repeats what a paragraph of text has said; consider the caption something that someone who’s just “looking at the pictures” might read to understand what a picture is of, before they actually dive into the full textual narrative.

As well as allowing you to create documents in which the content is generated interactively - code cells can be changed and re-run, for example - it is also possible to embed interactive components in both generative and generated documents.

On the one hand, it's quite possible to generate and embed an interactive map or interactive chart that supports popups or zooming in a generated HTML output document.

On the other, Mathematica and Jupyter both support the dynamic creation of interactive widget controls in generative documents that give you control over code elements in the document, such as sliders to change numerical parameters or list boxes to select categorical text items. (In the R world, there is support for embedded shiny apps in Rmd documents.)

These can be useful when creating narratives that encourage exploration (for example, in the sense of  explorable explantations, though I seem to recall Michael Blastland expressing concern several years ago about how ineffective interactives could be in data journalism stories.

The technology of Wolfram Notebooks makes it straightforward to put in interactive elements, like Manipulate, [interact/interactive in Jupyter notebooks] into computational essays. And sometimes this is very helpful, and perhaps even essential. But interactive elements shouldn’t be overused. Because whenever there’s an element that requires interaction, this reduces the ability to skim the essay."

I've also thought previously that interactive functions are a useful way of motivating the use of functions in general when teaching introductory programming. For example, An Alternative Way of Motivating the Use of Functions?.

One of the issues in trying to set up student notebooks is how to handle boilerplate code that is required before the student can create, or run, the code you actually want them to explore. In TM351, we preload notebooks with various packages and bits of magic; in my own tinkerings, I'm starting to try to package stuff up so that it can be imported into a notebook in a single line.

Sometimes there’s a fair amount of data—or code—that’s needed to set up a particular computational essay. The cloud is very useful for handling this. Just deploy the data (or code) to the Wolfram Cloud, and set appropriate permissions so it can automatically be read whenever the code in your essay is executed.

As far as opportunities for making increasing use of notebooks as a kind of technology goes, I came to a similar conclusion some time ago to Stephen Wolfram when he writes:

[I]t’s only very recently that I’ve realized just how central computational essays can be to both the way people learn, and the way they communicate facts and ideas. Professionals of the future will routinely deliver results and reports as computational essays. Educators will routinely explain concepts using computational essays. Students will routinely produce computational essays as homework for their classes.

Regarding his final conclusion, I'm a little bit more circumspect:

The modern world of the web has brought us a few new formats for communication—like blogs, and social media, and things like Wikipedia. But all of these still follow the basic concept of text + pictures that’s existed since the beginning of the age of literacy. With computational essays we finally have something new.

In many respects, HTML+Javascript pages have been capable of delivering, and actually delivering, computationally generated documents for some time. Whether computational notebooks offer some sort of step-change away from that, or actually represent a return to the original read/write imaginings of the web with portable and computed facts accessed using Linked Data?

Some Recent Noticings From the Jupyter Ecosystem

Over the last couple of weeks, I’ve got back into the speaking thing, firstly at an OU TEL show’n’tell event, then at a Parliamentary Digital Service show’n’tell.

In each case, the presentation was based around some of the things you can do with notebooks, one of which was using the RISE extension to run a notebook as an interactive slideshow: cells map on to slides or slide elements, and code cells can be executed live within the presentation, with any generated cell outputs being displayed in the slide.

RISE has just been updated to include an autostart mode that can be demo’ed if you run the RISE example on Binderhub.

Which brings me to Binderhub. Originally know as MyBinder, Binderhub takes the MyBinder idea of building a Docker image based on the build specification and content files contained in a public Github repository, and launching a Docker container from that image. Binderhub has recently moved into the Jupyter ecosystem, with the result that there are several handy spin-off command line components; for example, jupyter-repo2docker lets you build, and optionally push and/or launch, a local image from a Github repository or a local repository.

To follow on from my OU show’n’tell, I started putting together a set of branches on a single repository (psychemedia/showntell) that will eventually(?!) contain working demos of how to use Jupyter notebooks as part of “generative document” workflow in particular topic areas. For example, for authoring texts containing rich media assets in a maths subject area, or music. (The environment I used for the shown’n’tell was my own build (checks to make sure I turned that cloud machine off so I’m not still paying for it!), and I haven’t got working Binderhub environments for all the subject demos yet. If anyone would like to contribute to setting up the builds, or adding to subject specific demos, please get in touch…)

I also prepped for the PDS event by putting together a Binderhub build file in my psychemedia/parlihacks repo so (most of) the demo code would work on Binderhub. I think the only think that doesn’t work at the moment is the Shiny app demo? This includes an RStudio environment, launched from the Jupter notebooks New menu. (For an example, see the binder-examples/dockerfile-rstudio demo.)

So – long and short of that – you can create multiple demo environments in a single Github repo using a different branch for each demo, and then launch them separately using Binderhub.

What else…?

Oh yes, a new extension gives you a Shiny like workflow for creating simple apps from a Jupyter notebook: appmode. This seems to complement the Jupyter dashboards approoach, by providing an “app view” of a notebook that displays the content of markdown cells and code cell outputs, but hides the code cell contents. So if you’e been looking for a Jupyter notebook equivalent to R/shiny app development, this may get you some of the way there… (One of the nice things about the app view is that you can easily “View Source” – and modify that source…)

Possibly related to the appmode way of doing things, one thing I showed in the PDS show’n’tell was how notebooks can be used to define simple API services using the jupyter/kernel_gateway (example). These seem to run okay – locally at least – inside Binderhub, although I didn’t try calling a Jupyter API service from outside the container. (Maybe they can be made publicly available via the jupyterhub/nbserverproxy? Why’s this relevant to appmode? My thinking is architecturally you could separate out concerns, having one or more notebooks running an API that is consumed from the appmode notebook?

Another recent announcement came from Google in the form of Colaboratory, a “research project created to help disseminate machine learning education and research”. The environment is “a Jupyter notebook environment that requires no setup to use”, although it does require registration to run notebook cells, and there appears to be a waiting list. The most interesting thing, perhaps, is the ability to collaboratively work on notebooks shared with other people across Google Drive. I think this is separate from the jupyterlab-google-drive initiative, which is looking to offer a similar sort of shared working, again through Google Drive?

By the by, it’s probably also worth noting that other big providers make notebooks available, such as Microsoft (notebooks.azure.com) and IBM (eg datascientistworkbench.com, cognitiveclass.ai; digging around, cognitiveclass.ai seems to be a rebranding of bigdatauniversity.com).

There are other hosted notebook servers relevant to education too: CoCalc (previously SageMathCloud) offers a free way in, as does gryd.us if you have a .edu email address. pythonanywhere.com/ offers notebooks to anyone on a paid plan.

It also seems like there are services starting to appear that offer free notebooks as well as compute power for research/scientific computing on a model similar to CoCalc (free tier in, then buy credits for additional services). For example, Kogence.

For sharing notebooks, I also just spotted Anaconda Cloud, which looks like it could be an interesting place to browse every so often…

Interesting times…

Sharing Goes Both Ways – No Secrets Social

A long time ago, I wrote a post on Personal Declarations on Your Behalf – Why Visiting One Website Might Tell Another You Were There that describes how publishers who host third party javascript on their website allow those third parties to track your visits to those websites.

This means I can’t just visit the UK Parliament website unnoticed, for example. Google get told about every page I visit on the site.

(I’m still not clear about the extent to which my personal Google identity (the one I log into Google with), my advertising Google identity (the one that collects information about the ads I’ve been shown and the pages I’ve visited that run Google ads), and my analytics Google identity (the one that collects information about the pages I’ve visited that run Google Analytics and that may be browser specific?) are: a) reconciled? b) reconcilable? I’m also guessing if I’m logged in to Chrome, my complete browsing history in that browser is associated with my Google personal identity?)

The Parliament website is not unusual in this respect. Google Analytics are all over the place.

In a post today linked to by @charlesarthur and yesterday by O’Reilly Radar, Gizmodo describes How Facebook Figures Out Everyone You’ve Ever Met.

One way of doing this is similar to the above, in the sense of other people dobbing you in.

For example, if you appear in the contacts on someone’s phone, and they allowed Facebook to “share” their phone contact details when they install the Facebook app (which many people do), Facebook gains access firstly to my contact details and secondly to the fact that I stand in some sort of relationship to you.

Facebook also has the potential to log that relationship against my data, even if I have never declared that relationship to Facebook.

So it’s not “my data” at all, in the sense of me having informed Facebook about the fact. It’s data “about me” that Facebook has collected from wherever it can.

I can see what I’ve told Facebook on my various settings pages, but I can’t see the “shadow information” that Facebook has learned about me from other people. Other than through taunts from Facebook about what it thinks it knows about me, such as friend suggestions for people it thinks I probably know (“People You May Know”), for example…

…or facts it might have harvested from people’s interactions with me. When did you, along with others, last wish someone “Happy Birthday” using social media, for example?

Even if individuals are learning how to use social media platforms to keep secrets from each other (Secrets and Lies Amongst Facebook Friends – Surprise Party Planning OpSec), those secrets are not being held from Facebook. Indeed, they may be announcing those secrets to it. (Is there a “secret party” event type?! For example, create a secret party event and then as the first option list the person or persons who should not be party to the details so Facebook can help you maintain the secrecy…?)

Hmm… thinks… when you know everything, you can use that information to help subsets of people keep secrets from intersecting sets of people? This is just like a twist on user and group permissions on multi-user computer systems,  but rather than using the system to grant or limit access to resources, you use it to control information flows around a social graph where the users set the access permissions on the information.

This is not totally unlike targeting ads (“dark ads”) to specific user groups, ads that are unseen by anyone outside those groups. Hmmm…

 

See also: Ad-Tech – A Great Way in To OSINT

Do Special Interest Groups Reveal Their Hand in Parliamentary Debate?

Mulling over committee treemaps – code which I really need to revisit and package up somewhere – I started wondering…

…can we use the idea of treemap displays as a way in to help think about how interest groups may – or may not – reveal themselves in Parliament?

For example suppose we had a view over Parliamentary committees, or AAPGs. Add another level of structure to the treemap display showing members of each each committee with cells of equal area for each member. Now, in a debate, if any of the members of the committee speak, highlight the cell for that member.

With a view over all committees, if the members of a particular committee, or particular APPG, lit up, or didn’t light up, we might be able to to start asking whether the representation from those members was statistically unlikely.

(We could do the same for divisions, displaying how each member voted and then seeing whether that followed party lines?)

From mulling over this visually inspired insight, we wouldn’t actually need to use treemaps, of course. We could just run a query over the data and do some counting, creating “lenses” that show how particular interest groups or affiliations (committees, APPGs, etc) are represented in a debate?

Ad-Tech – A Great Way in To OSINT

Open Source Intelligence – OSINT – is intelligence that can be collected from public sources. That is to say, OSINT is the sort of intelligence that you should be able to collect using a browser and a public or academic library that also provides access to public subscription content. (For an intro to OSINT, see for example Sailing the Sea of OSINT in the Information Age; for example context, Threat Intelligence: Collecting, Analysing, Evaluating). OSINT can be used as much by corporates as by the security services. It’s also up for grabs by journalists, civil society activists and stalkers…

Looking at the syllabus for a OSINT beginners course, such as IMSL’s Basic Open Source (OSINT) Research & Analysis Tradecraft turns up the sorts of thing you might also expect to see as part of one of Phil Bradley or Karen Blakeman’s ILI search workshops:

  • Appreciation of the OS environment
    • Opportunities, Challenges and Threats
  • Legal and Ethical Guidance
  • Search Tradecraft
    • Optimising Search
    • Advanced Search Techniques
  • Profile Management and Risk Reduction
    • Technical Anonymity/Low Attribution
    • Security Tradecraft
  • Social Media exploitation
    • Orientation around the most commonly used platforms Twitter, Facebook, LinkedIn etc.
    • Identifying influence
    • Event monitoring
    • Situational Awareness
    • Emerging social media platforms
  • Source Evaluation
    • Verifying User Generated Content on Social Media

And as security consultant Bruce Schneier beautifully observed in 2014, [s]urveillance is the business model of the Internet.

What may be surprising, or what may help explain in part their dominance, is that a large part of the surveillance capability the webcos have developed is something they’re happy to share to with the rest of us. Things like social media exploitation, for example, allow you to easily identify social relationships, and pick up personal information along the way (“Happy Birthday, sis..”). You can also identify whereabouts (“Photo of me by the Eiffel Tower earlier to day”), captioned or not – Facebook and Google will both happily tag your photos for you to make them, and the information, or intelligence, they contain more discoverable.

Part of the reason that the web companies have managed to grow so large is that they operate very successful two-sided markets. As the FT Lexicon defines it, these are markets that provide “a meeting place for two sets of agents who interact through an intermediary or platform”. In the case of the web cos, “social users” who gain social benefit from interacting with each other through the platform, and the advertisers who pay the platform to advertise to the social users (Some Notes on Churnalism and a Question About Two Sided Markets).

A naive sort of social media intelligence would focus, I think, on what can be learned simply through the publicly available activity on the social user side of the platform, albeit activity that may be enriched through automatic tagging by the platform itself.

But there is the other side of the platform to consider too. And the tools on that side of the platform, the tools developed for the business users, are out and out designed to provide the business users – the advertisers – with intelligence about the social users.

Which is all to say that if surveillance is your thing, then ADINT – Adtech Intelligence – could be a good OSINT way in, as a recent paper from the Paul G. Allen School of Computer Science & Engineering, University of Washington describes: ADINT: Using Targeted Advertising for Personal Surveillance (read the full paper; Wired also picked up the story: It Takes Just $1,000 to Track Someone’s Location With Mobile Ads). Here’s the paper abstract:

Targeted advertising is at the heart of the largest technology companies today, and is becoming increasingly precise. Simultaneously, users generate more and more personal data that is shared with advertisers as more and more of daily life becomes intertwined with networked technology. There are many studies about how users are tracked and what kinds of data are gathered. The sheer scale and precision of individual data that is collected can be concerning. However, in the broader public debate about these practices this concern is often tempered by the understanding that all this potentially sensitive data is only accessed by large corporations; these corporations are profit-motivated and could be held to account for misusing the personal data they have collected. In this work we examine the capability of a different actor — an individual with a modest budget — to access the data collected by the advertising ecosystem. Specifically, we find that an individual can use the targeted advertising system to conduct physical and digital surveillance on targets that use smartphone apps with ads.

The attack is predicated in part around knowing the MAID – the Mobile Advertising ID (MAID) – of a user you want to track, and several strategies are described for obtaining that.

I haven’t looked at adservers for a long time (or Google Analytics for that matter), so I thought I’d have a quick look at what the UIs support. So for example, Google AdWords seems to offer quite a simple range of tools, that presumably let me target based on various things, like demographics:

or location:

or time:

It also looks like I can target ads based on apps a user users:

or websites they visit:

though it’s not clear to me if I need to be the owner of those apps or webpages?

If I know someone’s email address, it also looks like I can use that to vector an ad towards them? Which means Google cookies presumably associate with an email address?

This email vectoring is actually part of Google’s “Customer Match” offering, which “lets you show ads to your customers based on data about those customers that you share with Google”.

So how about Facebook? As you might expect, there’s a range of audience targeting categories that draw heavily on the information users provide to the system:

(You’ve probably heard the slogan “if you aren’t paying for the product, you are the product” and thought nothing of it. Are you starting to feel bought and sold, yet?)

Remember that fit of anger, or joy, when you changed your relationship, maybe also flagging a life event (= valuable to advertisers)?

Or maybe when you bought that thing (is there a Facebook Pay app yet, to make this easier for Facebook to track?):

And of course, there’s location:

If you fancy exploring some more, the ADINT paper has a handy table summarising what’s offered by various other adtech providers:

On the other hand, if you want to buy readymade audiences from a data aggregator, try the Oracle Data Marketplace. It looks as if they’ll happily resell you audiences derived from Experian data, for example:

So I’m wondering, what other sorts of intelligence operation could be mounted against a targeted individual using adtech more generally? And what sorts of target identification can be achieved through a creative application of adtech, and maybe some simple phishing to entice a particular user onto a web page you control and which you can use to grab some preliminary tracking information from targeted users you entice there?

Presumably, once you can get your hooks into a user, maybe by enticing them to a web page that you have set up to show your ad so that the adserver can spear the user, you can also use ad retargeting or remarketing (that follows users around the web, in the sense of continuing to show them ads from a particular campaign) to keep a tail on them?

[This post was inspired by an item on Mike Caulfield’s must read Traces weekly email newsletter. Subscribe to his blog – Hapgood – for a regular dose of digital infoskills updating. You might also enjoy his online book Web Literacy for Student Fact-Checkers.]