From Elsewhere: Archiving Twitter

Via an Inkdroid post on The Ferguson Principles, this handy suite of tools for archiving and normalising Twitter streams:

  • twarc – a command line tool for collecting tweets from Twitter’s search and streaming APIs, and can collect threaded conversations and user profile information. It also comes with a kitchen sink of utilities contributed by members of the community.
  • Catalog – a clearinghouse of Twitter identifier datasets that live in institutional repositories around the web. These have been collected by folks like the University of North Texas, George Washington University, UC Riverside, University of Maryland, York University, Society of Catalan Archivists, University of Virginia, tUniversity of Puerto Rico, North Carolina State University, University of Alberta, Library and Archives Canada, and more.
  • Hydrator – A desktop utility for turning tweet identifier datasets (from the Catalog) back into structured JSON and CSV for analysis. It was designed to be able to run for weeks on your laptop, to slowly reassemble a tweet dataset, while respecting Twitter’s Terms of Service, and users right to be forgotten.
  • unshrtn – A microservice that makes it possible to bulk normalize and extract metadata from a large number of URLs.
  • DiffEngine – a utility that tracks changes on a website using its RSS feed, and publishes these changes to Twitter and Mastodon. As an example see whitehouse_diffwhich announces changes to the Executive orders made on the White House blog.
  • DocNow – An application (still under development) that allows archivists to observe Twitter activity, do data collection, analyze referenced web content, and optionally send it off to the Internet Archive to be archivd.

The post further remarks:

These tools emerged as part of doing work with social media archives. Rather than building one tool that attempts to solve some of the many problems of archiving social media, we wanted to create small tools that fit particular problems, and could be composed into other people’s projects and workflows.

Handy…

And of the principles mentioned in the original post title?

  1. Archivists must engage and work with the communities they wish to document on the web. Archives are often powerful institutions. Attention to the positionality of the archive vis-à-vis content creators, particularly in the case of protest, is a primary consideration that can guide efforts at preservation and access.
  2. Documentation efforts must go beyond what can be collected without permission from the web and social media. Social media collected with the consent of content creators can form a part of richer documentation efforts that include the collection of oral histories, photographs, correspondence, and more. Simply telling the story of what happens in social media is not enough, but it can be a useful start.
  3. Archivists should follow social media platforms’ terms of service only where they are congruent with the values of the communities they are attempting to document. What is legal is not always ethical, and what is ethical is not always legal. Context, agency and (again) positionality matter.
  4. When possible, archivists should apply traditional archival practices such as appraisal, collection development, and donor relations to social media and web materials. It is hard work adapting these concepts to the collection of social media content, but they matter now, more than ever.

These arise from trying to address several challenges associated with [p]reserving web and social media content in ethical ways that protect already marginalized people (Documenting the Now Ethics White Paper):

  1. User awareness (or informed consent) of how social media platforms use their data or how it can be collected and accessed by third parties.
  2. Potential for fraudulent use and manipulation of social media content.
  3. Heightened potential of harm for members of marginalized communities when those individuals participate in activities such as protests and other forms of civil disobedience that are traditionally heavily monitored by law enforcement.
  4. Difficulty of applying traditional archival practices to social media content given the sheer volume of data and complicated logistics of interacting with content creators.

The white paper can be found here: Documenting The Now White Paper — Ethical Considerations for Archiving Social Media Content Generated by Contemporary Social Movements: Challenges, Opportunities, and Recommendations [PDF].