A DDI Primer: An overview and examples of DDI in action

A DDI Primer: An overview and examples of DDI in action

[Barry] Good afternoon everybody and thank you
for joining our webinar about the Data Documentation Initiative. The title of
the presentation is a DDI Primer: An Overview and Examples of DDI in Action. My name is Barry Radler, I’ll be joined
at various points of the webinar by Jared Lyle and Jon Johnson. To begin with
a very brief overview of the webinar. We’re going to start by discussing
barriers to sharing. This is really the the whole reason for being of a metadata
standard, to facilitate sharing in a networked environment. We’ll briefly
describe DDI. We’ll illustrate how DDI is exploited in two general use cases. One is
more germane to the Data Fair audience, I think, and that use case is how an
archive like ICPSR uses DDI. The other example is how DDI benefits individual
research projects and we’ll wrap up with the summary of takeaway points and then
open the floor to questions. So beginning with barriers to sharing of data and
metadata. First, the simple premise of the importance of metadata is that data
are largely meaningless without metadata. Metadata are critical not only for
secondary use or sharing with others, but they’re just as important for the data
producer to document the provenance of their own data and research projects. The
fundamental importance of good metadata is illustrated with a visual that I
really like and one that I attribute to co-presenter Jon Johnson. Metadata are like
punctuation… next slide… for your data. Metadata clarify relationships, they
provide structure and meaning to research information that isn’t apparent
in its rawer form. Now while metadata are critical to understanding and sharing,
barriers still prevent the wide production and use of metadata. In the
research pipeline various stakeholders use different systems that don’t
effectively translate across those systems. Metadata doesn’t always travel
well between systems and organizations. The next slide is an illustration of
these barriers. It shows the different stakeholders and their specific
responsibilities or roles throughout the research data life cycle. At each
of these transitions there’s a potential for a bad hand off of metadata and these
gaps result in information loss, again due to systems and software not
efficiently talking to one another. Even within organizations, within each of
those stakeholders, clear communication can always be a challenge. Different
individuals can be talking about the same thing without realizing it. And
again the next slide is an example that might be somewhat far afield, but it
illustrates how language and descriptive norms can be very
discipline-specific in research. Displayed are 14 synonyms for
multi-level models, very briefly, multi-level models are statistical
models, typically regression ones, of parameters that vary at more than one
level. This example is from my own personal experience. I was discussing MLM analysis with a colleague from a medical background and he kept referring
to clustering models. In contrast, I kept referring to nested models which is how
I understood MLMs. We were talking about the same thing. So enter DDI, the Data
Documentation Initiative is an international standard for describing
social science metadata in distributed network environments. DDI is not a panacea,
however, but it is the metadata standard for social sciences and it’s truly an
international standard. It’s being used by major studies, research consortiums, national statistical agencies, and
archives like ICPSR. So why is DDI so cool? Well, first of all it’s a free and open
standard, it’s most often expressed in XML, and it introduces or provides a
common communication protocol for research processes. DDI also plays
nice with other protocols and established standards which really expands its reach
and applicability to different studies. So these characteristics of DDI make
research data independently understandable and machine actionable.
Independently understandable means that secondary users can understand the data
without help from the data provider. Machine actionable means software and
systems can understand the metadata, and interpret it, and manipulate it, and
search it, intelligently. The next two slides illustrate these benefits
in the context of a simple codebook describing a survey instrument. This is the before DDI shot with a
description of a question in ASCII text format and this is the same information
after markup in DDI. Question elements are identified in standard DDI tags and even
descriptive statistics are displayed. Once such metadata elements are tagged
in a standard way they become machine actionable and can be displayed in any number of ways and this is summed up in a phrase we use a lot:
one document, many uses. DDI is designed to describe the research
data life cycle. So tools can utilize existing metadata and can be reused from
project planning, to instrument design, to dataset production, to documentation,
and even reused again in the next project. Now moving out to concrete
examples of DDI in action. I’m going to turn it over to Jared to discuss DDI from
the perspective of an archive. [Jared] Thanks Barry. So my name is Jared and I work at ICPSR.
I’m also the Director of the DDI Alliance and I’m going to talk about DDI
and how it’s used at ICPSR. Social and behavioral science data archive, several
these examples originated with Mary Vardigan who is our Assistant Director
and the previous Director of the DDI Alliance. Archives, as we know, are driven
by metadata standards and they allow information to be consistently described. They also enable search and discovery, the same information can be
reused and the metadata makes sure and help the information is transportable so it can be used by different organizations. Those are the
examples I’ll show with ICPSR in our use of metadata. Specifically with DDI we use
a flavor of DDI called DDI codebook. We have over 8,000 studies, each was
study level and variable level metadata and I’ll show examples of that.
Really DDI, the metadata drives much of our functionality within our archive and
within our site. So when something comes in through our deposit form, its
described using DDI fields. That information then, along with the
information in the data files, are used to generate study descriptions. So overviews
of the studies or projects, as well as individual variable level information or
metadata. So here’s the deposit form where people fill out information about
a study or a collection and then they submit that to us. We then use that
information to populate our study level metadata. These are some of the elements:
title, summary, specific things like universe, sampling, weighting. The study
level metadata is then leveraged in a variety of ways, through search. We also
repurpose records through different visual displays and different archives.
Interoperability and sharing with other archives around the world, in fact, and
also in creating study overviews. So with search, when you go to the ICPSR search
page all of the results are driven based on these DDI metadata. And so you can see
we did a search within our catalog, you can drill down and view a study overview,
this is the study level metadata. We can then repurpose that metadata. So as you saw this is on the main ICPSR
site. We have topical collections which have different displays and also
functionality, but they all use that same underlying metadata. The interoperability,
so we share metadata with others who want to grab it. So these are two examples, one the ODESI out of Canada and the other is Dataverse that have grabbed our DDI files
and populated their catalogs with them. Then the study overviews. When
someone downloads a data collection, we provide them with the study description
file that they can open in a PDF viewer and this also is generated from that
study level DDI metadata. We can export that study description DDI, we
can create different flavors or versions of that, and Dublin Core, for instance or
MARC records from that file. Moving from study level metadata down to the
variable level DDI elements. So for variable level information within data
collections we can describe variable groupings, question text, summary
statistics, notes. And this is really the rich metadata that makes us unique as an
archive as a disciplinary repository where you can not only find high-level
information, but variable specific details about a data file that you can
only grab through that DDI metadata. So this variable level DDI metadata is
leverage for search, searching for variable, searching across your entire
collection, and also generating code books with frequencies. So with the
search, within a study level homepage there is the capability to view and also
search within the variables of that particular study. You can see here an example of that, these are
variables from a study called Sit-ins and Desegregation in the US-South in the
Early 1960s. You can also search across all of the variables that have been
described in our collection, over 4 million of them. This is our Search/Compare Variables
page on our site and as you see if you search for heroin, you get hundreds or
even thousands of results for variables that contain that term. Once you’ve
located variables of interest you can compare them, as here. So you can view not
only of the variable labels, but also the question text, even the response codes and
frequencies and further information. That variable level information can
also be used for instance to show comparability of variables across
studies that have multiple waves or multiple parts. And so it’s another way
of making better sense of a collection and finding previously hidden
information or information that isn’t always obvious. This is from a crosswalk
of data from the American National Election Study and the General Social
Survey and that variable level DDI also helps us generate cookbooks or frequency
listings for all of our studies. That’s just another way of showing the users
the specific information in the data files. The DDI helps with unified
searching across all of our collections. So this is a topical archive within
ICPSR, you can search cross variables, again. So it’s, again, a really nice way
for us as an archive to describe and then power search and browse and use
functionality for people who access and use our data. With that, I’m going to
turn it over back to Barry where he’s going to go through use case having to do with research
projects, so Barry. [Barry] Thanks Jared. The particular use case that I’m going to be
discussing is the MIDUS use case… and Jraed can you advance the next slide? So I’m actually a researcher from MIDUS
and MIDUS is the Midlife in the United States study, it’s a longitudinal
multidisciplinary study of health and well-being began in 1995. Many of the
key strengths of MIDUS that make it a rich data source are the same ones that
pose challenges to managing its data and metadata. And these strengths are:
multiple samples being surveyed longitudinally going forward in time, so
it’s always expanding. MIDUS is based on a comprehensive and integrated theory of
aging, which means a variety of data capture protocols and data types.
Finally, the production of many research outputs. From a data perspective,
MIDUS has experienced tens of thousands of data downloads from ICPSR, which is our
official archive. From a scientific productivity perspective, we’ve generated
generated over 700 scholarly publications. For MIDUS, adequate metadata
is really crucial for discovery in search across datasets, waves, and
disciplines. For harmonization, combining waves and related equivalent measures.
While DDI doesn’t directly support data extraction or access, we’ve
piggybacked a data extract function on our DDI infrastructure. Some of these functions that I’m going to demonstrate are similar to the
ones that Jared was talking about, but the MIDUS DDI infrastructure is based
on the flavor of DDI called DDI Life Cycle which is very appropriate for
longitudinal or linked studies. And I’ll illustrate that infrastructure in a DDI
based portal that we developed using Colectica software. Here I’ve performed
a search of variables related to smoking, next slide Jared… This function, the search function,
searches for metadata fields such as question text, variable labels, or
assigned concept. So you don’t need to know MIDUS variable names, you just need to
know basically what your research needs are. Next slide… So the MIDUS collecting portal provides
information on equivalent longitudinal variables linking them and detailing
their comparability or concordance. This provides the user information necessary
to reconcile any differences among equivalent variables. And here we see a
table of variables related to smoking. It shows where cross time versions
exist, where there are gaps in coverage, and why related variables might not be
strictly equivalent, in which case we use notes on comparability
to give the user a heads up. At any point in exploring the portal, a user
can click on the shopping cart icon next to the variable and add it to a variable
basket for download. This function creates a wide formatted datasets and
automatically merges variables from different MIDUS datasets. The cool
thing is that each dataset is accompanied by a customized DDI codebook
and the codebook includes important metadata about versioning
providence and harmonization that aren’t included in the dataset’s metadata. So
the MIDUS Colectica portal really demonstrates how rich standardized
metadata under DDI can describe complicated studies in some dynamic ways
and ultimately DDI produces a relationship between data and metadata,
thereby improving all types of research activities. Now I’m going to hand off
the reins to my colleague John Johnson, who will discuss how he’s using DDI
to describe some important longitudinal studies in the UK. [John] Yeah, Hi, I’m John Johnson. I’m
from Centre on Longitudinal Studies which is one of the partners in CLOSER. I’m also
a Technical Lead on CLOSER. So CLOSER is a project which involves eight UK
longitudinal studies, primarily birth cohorts, but one panel study. It’s a
very different problem to the one that MIDUS faces because it’s a consortium and it’s going across
several different studies. It’s also trying to capture data that’s
going back… what… back to 1930 or so. So, as you can imagine, the way data is
collected and the way data is stored over those years has changed, so that provides
with additional problems. One of the aims of the project is to provide access to the best quality metadata we can extract from these
studies and to present them in as consistent a way as possible so there isn’t any sort of cognitive dissonance trying to go from one study to another. The
other challenge of CLOSER has been that it crosses both biomedical and social
sciences. So that’s been interesting in the ways that different domains have collected data. And also, as Barry pointed out
earlier, using the same term for the same thing. So we have to resolve some of
those issues in terms of presenting back to researchers. So in terms of what
CLOSER is, it’s eight studies and so far we’re about halfway to two-thirds
of the way through the project. And we’ve got 81 sweeps of data from the various
studies, 50,000 variables, and quite a lot of instruments and questions. Now the one, the
sort of the main things about CLOSER that differentiates it from many other
data portals is we really focus on the questions. So instead of just capturing
the question text and the filters, we’ve actually tried to present them in a way that is
contextual. So if you want a question, you see the next question and previous
questions. So you have a much better understanding of the flow of the
questions and also how that relates to the data that’s collected. So if you
click on a question, it should be able to take you to the the data. Just go to the next slide… and
as you saw in the example Barry was talking about, you get a very
similar presentation of data to both the MIDUS portal and what you see at ICPSR. The difference… what CLOSER trying to do is really give you a much
better tighter content relationship between the data and, if you like, the
context in which it was asked in a much tighter way. We try to do that, of course, across
the studies in as seamless a way as possible. We, like MIDUS, are using the Colectica portal, but we use it in a slightly different way. Our workflows are
slightly different, but it’s interesting how, in spite of the differences between
the various studies, we’ve managed to generate processes that will allow the
studies to contribute data that is consistent… metadata that’s consistent, so we get a consistent product for the researcher. Next slide, please. So in terms of the takeaway from
the presentation so far… it’s… I think what we’re able to do using a
standard DDI is to improve the usability of data. So in the CLOSER
example, you see you could reuse a question and you can reuse variables and code
schemes from other data. You can… one of the things that both MIDUS, ourselves, and also the archive is use DDI to reduce manual processes, otherwise these things take a
very very long time. I think one of the lessons of both MIDUS and CLOSER have
learned is that by aligning on a standard and getting good processes in place to
actually strip out a lot mental processes and increase accuracy and reduce costs
at the same time and that’s been very very valuable for the studies in the
short-term, but also in the longer term. In terms of distributed data collection, these things are pretty early stages, but we have found already that DDI
has made a difference in terms of aligning between different data
collection agencies and between different studies. The other thing which you
can see from the sort of presentations on the websites of
ICPSR, CLOSER, and MIDUS is just the general quality of documentation being
produced. It’s a very very high quality and, you know, that’s not to be sniffed at in
in the twenty-first century. Researchers have very good resources
they can draw on and the presentation of documentation is a very important part
of engaging researchers, not the users, in the data production… in the data
of studies that we’re involved in. It’s enabled us to really build new tools which have
helped us to ease the production process internal to the studies and I think it has
helped the archive to share within the archive, but also share tools between our
client and other data users to enhance data visibility and search
capability. Next slide. So if you go to the DDI Alliance website at that web address, there is
a whole range of resources in terms of specifications and training and how to
find other experts on DDI implementation in your area. So more information you can contact us directly or
you can contact us via the DDI website. Do you want to pick up Jared? [Jared] Yes, so with that we’ll open it up to
questions or comments. [Dory] Okay so I am looking at the questions
panel. We’re waiting for them to roll in. In the meantime, I’ll tell you what we have
up next, this is our last presentation for today, but tomorrow at noon eastern time
we’re going to start with Biomedical Data: What is it? Who’s involved? What data are
available? Then we’ll move on to Many Disciplines, One Topic: CivicLEADS
and the Potential for Multidisciplinary Research Data for Archiving. Next you
won’t want to miss our session: Cultural Participation of US Adults Featured
NADAC Data Highlights. NADAC is our arts and culture data archive. Okay, it looks like we do have a question… “As you mentioned DDI is
for Social Sciences, but is anyone using it for other research areas?” [John] Yes, so CLOSER is a consortium of both
social science studies and also primarily medically driven studies. So it has been used in… can you hear me? [Dory] We can hear you. [John] So it has been used in biomedical studies to a limited extent. Do you want to pick up on some of those points,
Barry? [Barry] Yeah, I concur with John. It is is primarily a social science and
survey metadata standard and much like John and the CLOSER studies, MIDUS has
some biomedical, some neuroscience variables, as well and data as well. What I and a few other people, I think, are trying to do is trying to look
at ways in which DDI can extend itself or possibly map to other standards. It’s a really good solid social science and survey research base from which
to kind of grow the expression of metadata. So we are looking at ways to
better represent our non-survey data through DDI. [Dory] Okay, next question, “Which version of DDI is being used by ICPSR?” [Jared] So, for our current version, we’re using is Codebook, so DDI 2.5, although
we’re looking at moving to Life Cycle. [Dory] Okay it looks like that… well just when I thought it was our
last one, we have quite a few questions. “Can you discuss how Colectica and
NESSTAR are used with DDI and if they are needed in order to apply DDI?” [Barry] I’ll take a
first stab at that. This is my own experience. I began using DDI with
NESSTAR, this was already I think 11 years ago. When DDI 2, DDI Codebook was
the only game in town. NESSTAR is DDI 2 compliant, but not DDI 3 compliant. Used NESSTAR a
bit, but it didn’t have some of the reuse and linking capabilities that I
wanted for a longitudinal study. When DDI Life Cycle or DDI 3 came along, it
offered some of those functions and since NESSTAR didn’t offer
them, it wasn’t, you know, conversant with DDI Life Cycle that’s when I started
using Colectica. Colectica has been involved, the principles in
that company have been involved with the development of DDI Life Cycle, so they
know it inside and out, and they’ve also been very responsive. I can’t speak for
John, but they’ve been responsive to my particular needs and in expressing
you know our metadata in DDI Life Cycle. [Dory] “As a follow-up to the first question,
what in your opinion makes the DDI standard particularly appropriate for
the social sciences compared with other metadata standards? Is it the
range of elements covered or is there something else, as well?” [Barry] I’ll take it, I’ll take a stab at
that one. I think it’s well suited for the social sciences because that was
the intent in its development and it also included as well as
developers and programmers, it also included scientists and researchers who
actually use it, so I think it’s well-informed from a social science
methodology standpoint. [Dory] “What kind of relationship might DDI have with efforts
to take datasets and express them as linked data?” [John] So there is some
work being done to integrate DDI with linked data. So there’s a standard called
Disco which is an RDF based expression of a subset of DDI and that is currently
in review, actually. It’s actually out for review at the moment. So yeah, I
mean, DDI is keeping up with the rapidly changing world of the web. [Dory] “Can you discuss
some of the other tools that leverage DDI?” [Barry] We have a list of the, I think, it’s a
comprehensive list of the DDI tools since, I think, even some of the
earlier beta versions of DDI. It’s available and I think Jared just
pulled it up. So the DDI alliance itself is, I think, tool agnostic. My experience has
been with Colectica, my experience from what I know is
that they’re pretty much the only game in town when it comes to really
exploiting the Life Cycle standard. [John] Slight disagreement there, I mean
there are other companies, alright. There’s a company called MTNA which does some
work with DDI and that’s quite widely used. There are a lot of community built tools and
sometimes they work for you and sometimes they don’t, like any community built tools. But there’s certainly developing community tool sets that are
coming out, typically for Life Cycle which is only really about three or four
years old in maturity. So those tools are now really being built and there is a
interesting range of different tools to meet different needs coming out of the
community at the moment. You’ll find those on that webpage and more to a
appear of next few months, I suspect. [Dory] Okay it looks like this might be… I don’t think I
read this one yet. “It seems that DDI is created mainly by archives. How can smaller data creators create DDI as
part of their research process?” [John] So I think there are some tools which are particular… depends on what your use cases are, so if there are some questionnaire capture tools, so if you’re specifying
questionnaires, there’s some tools around that that are usable by researcher. They’re
plenty of tools which could allow you to extract metadata from your SAS files, or
your STATA files, or your SPSS files which can push them out into different flavors of
DDI. One tool like that is called a SledgeHammer. There’s a reasonable amount of tools around to
create DDI. The main issue really is you needed something like Colectica or
one of the other tools to really hold in a repository and manage it. And that is one of the bigger challenges for a standard because it’s
just a very large amount of information you need to manage. [Barry] I would add to what John said. You don’t necessarily have to invest in a portal yourself, you can submit it to
a repository that is DDI compliant like ICPSR and, you know, an archive
that that speaks DDI can ingest it. [Dory] Okay, “Since many federally funded studies
require SAS datasets, are you considering a converter to make SAS data
set ingestable into DDI?” [John] There are already several tools that will allow you to do
that already. There’s Colectica, there’s SledgeHammer. I think,
actually, that… oh sorry I can’t remember the name of the product now… widely available, Stat/Transfer. Stat/Transfer, I think will support SAS export to DDI and there are
community tools, as well. [Dory] Okay. just want to make sure I didn’t miss any
questions. Let’s see, ok so it looks like we’ve answered all of our questions. That will
be it for me, unless presenters have something else to say… I’ll give you a moment. Okay,
well just want to say thank you for joining this session. We will put this
and all the others from our Data Fair up on YouTube and we hope that
you join us again tomorrow. Take care.

You May Also Like

About the Author: Oren Garnes

1 Comment

Leave a Reply

Your email address will not be published. Required fields are marked *