Jared L. Howland

Small Data Management: Master Data for Better Collection Analysis

Electronic Resources and Libraries 2018 Conference

Jared L. Howland

Given March 2018 in Austin, TX

Download: PowerPoint | PDF | Handout

Slide 001

Good afternoon! Thank you all for being here. I’m Jared Howland and I’m the collection development coordinator at Brigham Young University

Slide 002

Before I begin, I wanted to make sure that everyone who is live streaming this is able to access the files I’ll be using today. You can get these by going to the conference scheduler, logging in, and downloading the files from the session page.

Slide 003

Slide 004

Slide 005

The second and third parts constituted the actual military diploma which was typically made from two bronze tablets and presented to the retired soldier, possibly in a ceremony

Slide 006

The front of the first plate would declare the veteran a legal Roman citizen

Slide 007

The reverse side of the first plate would have an exact copy of the text as the outer engraving

Slide 008

The inside of the second plate included the names of 7 witnesses that were sealed with organic material and then covered and protected by metal strips

Slide 009

After which, the two plates were bound with wire

Slide 010

And then sealed together

Slide 011

Slide 012

  1. First, people have been thinking about how to store and verify information for a very, very long time. The Roman military diploma is an ancestor to the principle of master data management which we will discuss here in a bit. Though this is not a new problem, the scale and scope of technology to handle data continues to drastically change.
  2. Second, librarians, like the archivists at the military archive in Rome, have also been thinking about data management problems for a very long time. The progression, and increasing importance, of ideas such as “big data” and “data management” over the last decade has demanded that librarians learn how to manage researchers’ data.

Slide 013

Slide 014

Slide 015

  1. Have you ever had to wait for a colleague to get back from vacation to get data or numbers from them to finish a report you were working on?
  2. Or similarly, have you ever had to contact a colleague for a data set that would be very useful if it were made available to everyone working in the library?

Slide 016

Slide 017

Slide 018

If you answered yes to any of those questions then you have encountered a data management problem. These may have been rhetorical questions, but they certainly are not hypothetical. I have encountered each of these, among many others.

Slide 019

Slide 020

Slide 021

Slide 022

Slide 023

Other features:

  1. There are both public and private data repositories
  2. Fine-grained controls for who can and cannot read and write to the data repository are available
  3. You can add comments and documentation to data sets and even to specific data points
  4. There is a built-in troubleshooting ticket system
  5. You can create different, parallel, versions of the same data set depending on the audience and how the data was analyzed

Slide 024

Other features:

  1. There are both public and private data repositories
  2. You can tag, group, and place data into organizations
  3. Faceted searching is available for the entire data repository
  4. CKAN has an accessible and understandable user interface
  5. Extensions are available to greatly expand the default capabilities of the system, and if you have the technical expertise, you can build your own to meet a specific need

Slide 025

Let’s take a look at how we have used each of these systems. First, let’s review Github and how we have experimented with it as an MDM tool

Slide 026

We created a repository that includes all of our data sets. In this case, there are only 2 datasets: ‘arl’ and ‘serial-prices’

Slide 027

There are 3 branches for the data sets:

  1. Normalized: data that has been cleaned up and normalized to make sure it is ready to be analyzed
  2. Projects: a parallel data set that presents the data in different ways depending on the audience. A trivial example is a line chart vs a bar graph. In truth, we often use the same data in different contexts all the time. By having branches for a single data set, you can create as many variations and ways of interpreting the data without having to question where you got the data for the analysis from.
  3. Raw: the raw data as you got it from the source

Below the branches you see documentation on how the Github repository is set up and organized for use as a rudimentary master data management system

Slide 028

So let’s take a look at our ARL data set

Slide 029

Slide 030

Slide 031

Let’s say we are interested in the ‘collection-expenditures’ data. We click on that…

Slide 032

Slide 033

Slide 034

So, that’s Github. Let’s talk a little about how we have been experimenting with CKAN

Slide 035

CKAN includes a lot of features:

  1. There’s keyword searching for all datasets
  2. Tags you can create for each data set

Slide 036

Slide 037

Slide 038

Slide 039

And let’s say we are interested in the “BYU Class Schedule” dataset, so let’s click on that one

Slide 040

Slide 041

Slide 042

If you click on explore…

Slide 043

Slide 044

Slide 045

Slide 046

Slide 047

Slide 048