The Quest for Data: Acquisitions Activities and Archiving Options at ICPSR

Good afternoon and welcome back to Data
Fair 2016. My name is Piper Simmons and I am here today with my colleagues Jessi
Carrothers and Justin Noble. Today we’re going to talk about the Quest for
Data: Acquisitions, Activities and Archiving Options at ICPSR. The
Acquisitions Team works closely with data producers to acquire, preserve, and
distribute social science data collections that are of interest and
benefit to the ICPSR membership and the user community at large. Along with my
colleagues, Justin and Jessi, our team is led by Dr. Amy Pienta and we are the
Acquisitions Team. We’re going to start today by talking a little bit about
selection criteria. The things that we look for in a dataset that we’re
seeking to acquire. We’ll also talk a little bit about acquisition strategies
and then end with a review of our latest archiving options. So what do we archive? Well a lot of that is determined by the
scope. This is a social science data archive, so we’re interested in data that has a
social or behavioral science focus. We also look at the ICPSR collection
development policy. We also established periodically several areas that we call
high-priority focal areas or high priority topics which we hone in on to
seek and acquire data. We also look at methodology and population groups as well.
We typically seek social and behavioral science data in the major social
sciences like sociology, political science, psychology, economics, and
anthropology. But we also look at other disciplines, as well. Things like
archaeology, rural studies and urban studies, demography, communication and
media studies, environmental studies, arts and humanities, and gerontology. Now this is not a conclusive list, but these are some of the disciplines that we have found
that people are sometimes surprised that we do carry data in these areas. As
always you can contact us if you have a dataset that you’re interested in
acquiring and we will work with you to see if it fits within our scope. Our
high-priority focal areas that are current are sexual orientation. We’ve
noticed a growing interest in issues surrounding this topic so we’re seeking
out relevant data to include in our collection in the area of sexual
orientation. Bullying and its consequences that effects, as well as types of
intervention is another area of note. We’re interested in research surrounding
cyber bullying, as well as bullying in schools and workplaces. Social media has not only changed social and communication landscapes in this country
and elsewhere, but is has also played a role in research methodology and we’re
seeking data collections that focus on issues surrounding social media, internet
usage, search behavior, and the like. Some of our other focal areas include
immigration. Immigration and immigration policy continue to be
leading topics of national debate and as such we continue to see data about
immigrant populations, their issues and experiences, and their impact in society.
Many disciplines regard the value and the strength of
longitudinal data. And of all of these topics, I think longitudinal data will
probably always be in our list of high priorities. International data, among
other things, gives us the opportunity to provide a platform for cross national
and comparative research and in addition it broadens our horizons in being able
to collect data from other countries, as well as, maybe like a secondary benefit,
for cases where there’s countries that do not have policies for national
archiving of their datasets. Because we feel that data sharing in the area of
individual well-being is limited, we are focused on acquiring data surrounding
topics of social and psychological well-being, happiness, depression,
demoralization, and items such as these. There’s a variety of methodologies that
we accept, a lot of them are new and upcoming on the horizon, such as
online survey techniques, experiments, field research and observational
techniques, clinical trials and interventions, data mining and data
visualization, audio and video, geospatial, and biomarker data. Now, again, this is no
means an exhaustive list, but I do want to note with the advent of our new
Archonnex platform, a lot of data in these areas are data that we may not
have been able to accept or accept well, if you wish, a few years ago. So the Archonnex
platform has opened up the types of data that we’re able to accept and it gives an avenue with which to handle it better. Our populations of focus outside of the
United States are international, cross-national, and comparative. We’re
interested in state, local, and regional data, criminal justice populations,
children and adolescents, adults and the elderly, racial and ethnic groups,
families, couples and households. And there are things we do not accept. Of
course are scope being one of the most important area of criteria, we do not
accept data that does not have a social or behavioral research focus. We also tend
to not have datasets that are purchased. We typically do not purchase
data and we also do not distribute data that is linked to a third party in terms
of our being able to distribute it freely. So we don’t get data that we have to
adhere to third-party standard in order to disseminate to our population.
Data that are available elsewhere are generally not archived here. We tend to
desire to be the archive of note. However, many datasets are linked by our ICPSR
collection catalog, our catalog entries, where we can link users directly to the
source of data. We also do not typically archive direct identifying data. There
are times, of course, when certain identifiers are necessary for successful
analyzation of the data. In those cases, we can provide the data in a restricted
format, but we actually do prefer our data to be public access, if at all
possible. Now we’ll talk a little bit about
acquisition strategies. We’ve been toying with different methods of which
we contact PIs and others for the purpose of acquiring data. Some
opportunity strategies that we use is our contacts with the
principal investigators, end users, and staff. And often they have recommended
data that we ultimately archived. Even in our general correspondence with
principal investigators and others, we often ask them to let us know if they
have other data that are unarchived that they’d be willing to share or if they
could suggest someone or a particular dataset that would be a good fit for
ICPSR. We always use conferences, seminars, and lectures, and personal
contacts as an opportunity to get information and leads for data. The ICPSR website has a Suggest Data to Archive page and we do
get a lot of suggestions from this form to where users and visitors to our
website can also suggest data that we can research for the possibility of
archiving. Several outreach strategies that we have done in the past included
reaching out to grant recipients and we feel the grant recipients are a good
audience because well they just received the grant and we want to establish a rapport
with these grantees especially at the earliest stage, possibly before they
started to collect data, and we feel that this gives us an opportunity to share
ICPSR and the advantages of archiving, as well as assisting them through the archiving process. We also seek academic
journals and we also have strategic campaigns where we reach out for specific
kinds of data. One example is a group that is called
Women Who Also Know Stuff. This group was discovered on Twitter and
they are a group of political science focused researchers. However, because of
the collaborative nature of research, their research encompasses quite a few areas;
several of which are among our focal areas. So as a round one, and this is a continuing
process of contacting this group. Of our first 120, we got 2 deposits, so far, 15 commitments to deposit, we have another 7 people that are
checking whether they are able to archive the data. We have 12 that
produced no data, but again these are contacts. Some of these are working on
projects that would be interested in archiving in the future and then we have
a couple that have declined to deposit, mostly because of agreements and
limitations from their sources. I’d like to talk about some other
campaigns and I’d like to refer you to my colleague Jessi Carrothers. [Jessi] So there are several
strategies that we’ve used to identify good researchers to reach out to and inquire about their data. Certainly, we could always use a simple Google Scholar search which I’ll
talk about in a little bit. As Piper mentioned, we can have some specific
targeted campaigns to find research that’s been valuable to the scientific
community or that is widely used. As Piper mentioned we can always use word
of mouth and I’ll be extending that topic in a few
slides. When we reach out to researchers to inquire about their data, we can
categorize the outcomes in several ways. Certainly some researchers never get
back to us for whatever reason. Sometimes researchers may not have data ready to
deposit, but they have questions about the process such as the use of
restricted-use datasets or data management plans or other topics.
Sometimes researchers will say they’re not ready to deposit their data
now, but they like to formally commit to deposit at a certain time in the future.
With all of these interactions we are, of course, always working toward the
outcome of having data actually deposited for our network. One specific
campaign that we’ve used to find valuable data is to look for lists of
highly cited researchers. There’s many ways to do this. Google Scholar has a
function that you can look for highly cited studies, but a list that I found a
little bit more useful has been collected by Thomson Reuters. It’s
updated periodically and it encompasses pretty much any realm of
science. So we did restrict this list to only social and behavioral scientists. These researchers are cited a lot. An average of over 300 citations and most
of them are much more than that in the thousands for each study that they
produce. I went through this list and segmented it in several different ways.
We’re only looking for original data and not secondary analysis so that culls the herd by quite a bit. We’re looking for high-quality
methodology, nationally representative data, large-scale studies, and we’re
looking for data that is touching on one of our high priority areas, such as
bullying. Using the these criteria, I narrowed the list to about 15
researchers to inquire about their data on this initial pass. We actually had a pretty good response
rate considering that these researchers are incredibly busy and tell us they have
incredibly full mailboxes. So we had five researchers get back to
us and one researcher even committed to deposit his study. This researcher is
Roman Jaeschke, who was traveling back from Poland at the time that I sent out my
original email. He was so interested in depositing data that he sent me some
questions at 3:30 am his local time. So that showed a significant interest in our
work and the process that we have. Another way that I found research that’s
highly valuable and useful to other scientists is this list or this database
called the Social Science Research Network. This is a multidisciplinary, free
database, where you can search for pretty much any social science topic. So again, I
narrowed the herd looking only for original data, high-quality methodology,
et cetera and I came up with about 30 researchers to contact on the initial
pass. Again, we had a pretty good response rate considering that these
researchers are always quite busy and have a number of projects going on at
the same time. We have four commitments to deposit data out of this pass which is a
really good outcome. As Piper mentioned, sometimes we look for specific journals
that touch on one of our high priority areas. One of these journals is
New Media and Society, which is a relatively new project… a relatively new
journal. It’s been around for about 10 years now and looking through the
various issues I identified five researchers to contact. So far two of
those researchers have committed to deposit data. So this is a really good
strategy to find good original data studies. I put up a section of one of
these abstracts so that you can see that we are looking for high-quality data,
high-quality methodologies. This particular study is a multi-wave panel
data study that’s nationally representative which is
relatively uncommon to find for a study about social media. Certainly we can
always use Google Scholar and that can be a useful strategy and sometimes
that can be a not so useful strategy. I found that when looking for data about
youth or preteens, it was pretty hard to find original data using Google Scholar. This is because a lot of the research in
this area is secondary analysis of studies like the Growing Up Today study
and Youth Behavioral Risk Factor Surveillance System. A topic which was
easier to find original data studies about is that of social media when
coupled with other criteria such as international studies or cyberbullying
studies and obesity and diabetes is not an official high priority area for ICPSR,
but it’s an area that we’ve seen a lot of activity through in the context of
people searching for data through ICPSR’s website. So I was looking for studies
about obesity and diabetes and I found two large-scale longitudinal studies
about obesity and migration, which had been an area that was hard to find research
about in the past. So Google Scholar was really helpful in that respect. One of
the obesity researchers that I contacted had a large-scale study that involved
long-haul truckers and looked specifically at their obesity outcomes.
This researcher got back to me and said he couldn’t necessarily share that data
because it was proprietary and it belongs to the trucking companies, but he
said let me think about this because I know that archiving data is really
important, so let me see if anyone else would be interested. I thought that he might get back to me
at some time in the future or maybe not at all, but within two hours he contacted
me again and said, “okay so I asked around and here are some people who might be
interested” and he copied 17 researchers on the email. This was surprising, but
it was even more surprising that within a couple weeks four of those individuals
reached out to me and they committed to deposit their data. So that word-of-mouth
strategy was incredibly helpful and productive. Another example where
word-of-mouth was productive is in the area of sexual orientation and
gender research. One of our contacts in ICPSR’s network passed on a list of
15 new leads or 16 new leads in the area of sexual orientation. These researchers were really interested
in learning more about the process, so we had several commitments to deposit and
we actually had one researcher who said “Okay I’m really interested in the
process, here are my questions.” We went back and
forth for a little while and then she said, “You know I would like to deposit
this data and it’s not a good time for me to do it now, I don’t really
have the energy or the time to do it, but let me get back to you at some time in
the future.” So I wasn’t really sure when that would be, but I answered her
questions and I decided to wait for a while. Within two weeks, she had actually
deposited not one study but two different studies in the area of gender
research. So that was a really productive way of using word of mouth to find some
good studies to deposit in the archive. Now it was surprising that we
had two studies from this one researcher which were deposited really
quickly, but Rebekah Herrick, the researcher in question from Oklahoma
State University, was actually so interested in the process and in the
value of archiving data that she said, “So here’s this other researcher I know of,
my colleague who has this study and he is interested in depositing
it and is looking for an archive.” Kyle Knight and Andrea Flores, who were Rebekah Herrick’s colleagues have this really amazing study which touches on a number of our high priority areas and they’ve committed to deposit the
study. It’s a longitudinal large-scale study
in Nepal, which is actually cross-national so its international data
which we’re really interested in getting. Its high-quality methodology, and it
covers the areas of bullying, cyberbullying, social media, and new media,
youth studies, obesity and diabetes, and gender research. So it’s an incredibly
valuable study and we only found out about it by word of mouth. So that was a
really profitable strategy to use. So now my partner, Justin Noble, will be
talking about our collection development policy and the ways that we archive data. [Justin] Good afternoon. First I wanted to talk about a social
media campaign that we did to promote our collection development policy. So
from September 2015 to February 2016, we promoted our revised collection
development policy which was updated in the summer of 2015 and also
promoted these high priority data areas on our various social media channels.
What we did was share various statistics about the number of downloads for
studies in these areas, the number of people that are searching for these
topics on our website and by sharing these interesting facts and statistics
we couple that with the idea of encouraging data sharing. This was a
great collaborative effort between the Acquisitions Team, as well as the ICPSR
Editor, our Multimedia Designer, and the Membership Director. Here are
examples of some of the graphics and tidbits that we promoted. You can
see we promoted the collection development policy, as well as longitudinal data,
the topic of immigration, and happiness. “Are you happy?” being one of
the top subject searches. Also searches about social media, LGBT
issues, international data including China, and then also bullying data. So all
these were, these great graphics that we compiled, a coupled with these download
statistics and other information that we gathered from Google Analytics
that demonstrate this high user demand in these areas. As a
result of this Social Media campaign, we had 8 different posts that went out
over the course of that time period. On average, each of these posts that went
up on Facebook reached over 1000 people. They averaged 15 likes, comments, or
shares for each post and then each post was also clicked on an average of 37
times. We had similar numbers with Twitter where just by, you know,
each of these individual Tweets were reaching over 1000 people on average and
on average there were 14 engagements which is some type of interaction with
the particular Tweet whether that’s a share or a like or some other
interaction- clicking a link, for example. Next we’re going to switch gears here
and talk about the different archiving options at ICPSR. So there’s really three
main archiving options for an individual researcher who is looking to
archive a particular dataset or even multiple datasets at ICPSR. The first
one is an openICPSR self-deposit. Another option is to donate the data to
the ICPSR membership and a third option is a paid professional curation package. So here I just want to introduce those three options and I’m going to
talk about each of them in more detail and do a comparison of these three
options. The easiest factor… consideration that most people think of
in order to decide what the best archiving option is for his or her
project really comes down to two things, the
desired level of access and also the desired level of curation or the extent
of processing that ICPSR data processing staff and archival specialist will put
into the data. So with respect to the level of access with our different
archiving options we can either make the data freely available to the general
public or it can be freely available to the ICPSR membership. In terms of
curation, we have an archiving option in which the data are released “as is” and
then we also have two archiving options in which the data can be fully processed by
the ICPSR dedicated data processing team and data processing staff. Depending
on the combination of these two factors it can really get you going in
the right direction. These two factors also impact the time
to release for a particular data collection as well as whether there’s
any potential cost to the data depositors. First I want to talk a
little bit more about the level of curation and what I mean by processing
or curating a study. What I mean by data curation is the process of
reviewing and enhancing an entire data collection. And this is really what ICPSR
is known for and what I want to emphasize here, it’s hard to do it in
just one slide, but just wanted to convey that this processing and enhancement
is done on both the actual raw data as well as the accompanying study
documentation. Also it affects the ICPSR study description or metadata,
which helps facilitate discovery of a collection. So with respect to working
with the data, data processing and curation involves doing a disclosure review and
also a usability review for things like variable labels, value labels, missing
data, standardization and declaration, and also looking for undocumented and out of
range codes. It also involves doing a stat package
conversion to a variety of software packages, as well as converting the data
to a variety of formats to ensure long-term preservation. Also
depending on the study, we also set it up in our online analysis system. With respect
to curating and enhancing the documentation, this involves a detailed
review of comparing the data against the documentation, creating the ICPSR
codebook and other PDF documentation, as well as organizing the materials into
this complete package that allows a secondary user to independently
understand it and use these data collections. Lastly, with respect to the study
description, this involves creating both study level metadata as well as variable
level metadata in the DDI standard which enables users to easily find studies on
the ICPSR website as well as gather information about a particular
collection before even downloading the data to see if it might meet their needs.
So now going back to the different archiving options, with respect to the
openICPSR self-deposit option, this is where the researcher simply goes
to the openICPSR website and uploads their data and describes their data
him or herself. The data are released “as is”, but they are released to the general
public. They are freely available in open access and they’re not just restricted
to ICPSR members. Also the time to release for these studies is immediate
and currently an openICPSR deposit, a self-deposit is free for a collection up
to 2 gigabytes in size. So for a use case for openICPSR, it’s really or a
researcher who is looking to make their data freely available to the general
public while still being preserved at ICPSR and listed in ICPSR’s catalog of holdings. The main caveat for this option
is that ICPSR staff do not curate the files. So we’re not doing that detailed
review enhancement to make the files and ensure their usability. So it’s
entirely the depositor’s responsibility to make sure that the
data are usable and that the documentation and metadata sufficiently
describe the project. Another option is to donate the data to
the ICPSR membership. In this case, if accepted and they meet the
specifications in the ICPSR collection development policy, the data will receive
a full professional data curation and enhancement. Then the resulting enhanced data and documentation will be made freely available to ICPSR members and
non-members. So those who are not affiliated with an ICPSR member
institution would have to pay a $500 fee to access the
data. The cost for this, donating to the ICPSR membership is free. The time to release really depends on the number of studies that have been
submitted to ICPSR for curation, for member funded curation, so a particular study will be placed in a processing queue and generally this
takes at least a month or two for a study to be reviewed and
curated, but it could take longer it just depends on the number of studies that
are in the queue at that time. Donating to ICPSR is really a great
option for ICPSR members because ICPSR data processing specialists organize,
describe, clean, and enhance the data at no cost to the data depositor and we’re
doing this as a benefit to the ICPSR membership. The main limitation with
this option is that some agencies that fund the collection of research data
require the data to be open access in that they’re free to all users and not
just ICPSR members. So depending on the funding agencies that are associated with a particular project
this option may or may not meet the data sharing requirements of a particular
agency. The last option is the professional curation package.
This really entails a lot of the same things that I just described in the
ICPSR member funded curation, except that if a researcher is able to write ICPSR
in a particular… in a grant and budget for these professional curation services,
the data, the resulting enhanced data and the processed data, will be released to the
general public and not just ICPSR members. So you still get that old
professional curation but then the level of access instead of being the membership is the
general public. The time to release in these instances, of course the study
will be prioritized in the processing queue and the cost of the service really depends on the number of
variables and the complexity of the data. So the Acquisitions Team has a formula
that we use where we will consult with a potential data depositor about the
characteristics of the data that they have to archive. And you can email us
at [email protected] for more information or if you have a particular
study and would like to receive a quote for this service. In general, for smaller straightforward studies that are ranging between 100 and
a couple of hundred variables, the cost for this service has been averaging
in the $3000 to $8000 range, but it can go higher depending on
if there’s a greater number of variables or greater complexity and restricted-use
issues associated with the data and disclosure risk. So the professional
curation package is really designed for those researchers who are able
to budget for ICPSR services in a grant proposal. The summary
of this is that ICPSR does the full curation and makes this enhanced data
and documentation freely available to the general public and not just the ICPSR membership. For any of these archiving options, we just wanted to let
everyone know that we are happy to write letters of support to grant applicants
saying that if the data are in scope ICPSR will be happy to archive them.
These are letters of support or letters of commitment. It’s
not a requirement but oftentimes this really helps clarify the roles of the
research project versus the archive and we also have a lot of resources on our
website that describe creating effective data management plans. Lastly, one other question that I
wanted to address is the question of “What should my deposit include?” and really
this is for all of the archiving options that I just talked about.
Ideally deposits should include all the data and documentation necessary to
independently read and interpret the data collection. So we’re talking about the
actual data files, any accompanying data documentation files, such as data
collection instruments and codebooks, and then also a description that describes study methodology in detail and also the sampling and
other information such as weighting and other details that a secondary user
would need to know in order to understand and independently interpret
that data collection. The litmus test that we encourage researchers to
think about when depositing data is to ask yourself, “Is the data collection
complete, accurate, and well-documented?” By thinking of it in terms of the
three parts: the data, the documentation, and the description or the
metadata that accompanies the study that can help you answer that question. So that wraps up our presentation. We are
happy to field some questions now. If you have questions about a particular study
that you’re considering archiving and would like to know whether it’s a good
candidate for archiving at ICPSR, you’re welcome to reach out to us at [email protected] I also provided information here to our Facebook
accounts,, and then also Twitter. You can both
reach the ICPSR Twitter account as well as my account there where I also
promote a lot of data sharing. [Dory] So we do have several questions. (If you can
pass me that, it has a long cord. Then I’ll hand it back to you.) Okay, first question I see is, “I thought you could accept limited access or restricted data. Were you only referring to
commercially restricted or proprietary data?” [Piper] Mostly I was referring to
commercially distributed or proprietary data. Data which we could not redistribute
unless we have to get permission from someone else. Examples like that. [Justin] ICPSR definitely has
the capability of accepting data with confidentiality issues, so indirect
identifiers and restricted data in that sense of confidentiality is
something that we have expertise in and are available to distribute restricted
data collections but, you know, the proprietary data is where… we do
not generally accept that. [Dory] “Where do you see libraries helping with
data acquisition? It seems to me that your strategies for finding researchers and data are very similar to the way libraries are trying to make contacts to their
local researchers. Can you see a collaboration with librarians?” [Justin] Yes, most definitely. Us in the ICPSR Acquisitions Team, we coordinate with a lot of librarians and
official representatives. And we encourage the librarians to be as active
as possible in encouraging researchers to deposit with ICPSR. So that can be
simply being very involved in even depositing data with ICPSR on behalf of
the investigator, to fielding some preliminary questions, to
something like just directing them so that they have a direct contact
with one of us on the Acquisitions Team. I think that that’s definitely
something where we try to establish those relationships and we do
coordinate a lot with official representatives and librarians who have
questions and email us about our different deposit options. I think that’s
something that going forward into the future, we want to continue to engage
with that audience. [Piper] I also want to add that if something comes across your
radar, maybe it’s not something you’re directly
working with, our suggest data to archive form. You can always let us know about what what you know. The PI, the name of the study, and we will follow up. [Dory] The next two questions seem to be related, the first one is “three to eight thousand per annum or one time?” and the next one is “was is the estimate for archiving a data set an annual fee or one-time charge?” [Justin] Thank you for those questions, and yes those are for a one-time fee. So
then we will archive the data into perpetuity. The number one
factor that drives cost, because we do a variable level review of the data and
curate the data at the variable level, it’s going to be the number of variables
across all of the data sets. And so that three
to eight thousand dollar figure is generally the range in which we have
been giving out quotes to individuals and we’re continually refining our
cost estimate procedures, but that number is for relatively straight forward
datasets that are in the range of 100 variables to possibly 500
variables. But it really depends on the number of variables as well as whether
they are restricted or not, the format of the data collection, and also
the number of files as well. So whether they’re in one file or they’re in a
bunch of different files and that kind of speaks to the complexity issue
of the data collection. [Dory] It looks like that’s the last question, so
thanks again for attending this session. I want to make a plug for the next session that
starts at 4 pm Eastern time, Managing Your Team’s Data, Attach Metadata, and Publish to ICPSR Using
SEAD. Thanks again and have a good day!

