The Mandate to go Public & the Pressure to Publish First, Big, & Fast

Okay, thank you. This is exciting. As Dory mentioned, I am
Amy Pienta, Acquisitions Director of ICPSR and I think this is actually my
first Data Fair presentation. I don’t remember doing this in the past, maybe I
have. I’ve certainly given webinars for ICPSR. I’m excited to share with you some
of my thoughts and some of ICPSR’s views on this question that Linda
asked me to talk about, which is the mandate to go public and the
pressure to Publish First, Big, and Fast. So it’s kind of an interesting moment, as
I was putting together my comments for this slide today to think about what it
has meant for this increased amount of commitment that the highest offices in
the US have made to making data more open and government more open and many
of you will be familiar with a lot of these policies. I’m going to just spend
really a couple few minutes talking about them. If you go to the
website today, you’ll get this blue flash of screen which is advertising that the
culmination of Obama’s initiatives on open data are happening in Washington DC
this week at the White House Open Data Innovation Summit. Actually ICPSR
is going to be there as one of the invited guests, so I’m excited about that.
I’ll talk a little bit more about that is as we go forward too. So the Obama
Administration’s commitment to open government and to open data came very early
in his presidency. In the first full day in office, Obama issued
the Transparency and Open Government memorandum which stated that his
administration was committed to creating an unprecedented level of openness,
fostering a sense of transparency, public participation and collaboration amongst the government and the American people. What this has meant for the
scientific community in the social science data community has been a couple
of really exciting things. So of course has been around for a
few years, as I mentioned talking about the White House summit that’s
happening later this week and that has really opened up access to a lot of
government datasets. Some of that’s good, some of it’s bad, certainly a lot of
it is messy, but there’s certainly more data available today than there has been
in the past from government agencies. Another exciting thing that has happened…
I see a typo on the slide… coming from the OSTP memo. The Office of Science and Technology Policy issued in February of 2013 a memo that directed agencies to develop a plan for making sure that results from the research that
they funded will be accessible. It was clear in this memo that it wasn’t
just the publications that needed to be accessible that came from the grants
of these agencies, but also the data and so that was a really exciting moment for
those of us in the data space and thinking through how a resisting
community to sharing their data might be compelled to do so going forward. Of course this idea, though it
received increased attention in 2013, this idea of the government commitment
to sharing data has been around for much longer. One of the moments, in relatively recent
history, that we like to point to is the NIH data-sharing requirement that predated
that OSTP memo. In February of 2003 NIH put into effect a
requirement that large grants, grants that were seeking more than five hundred
thousand dollars in direct costs in a single year, so something that had a
really big single year, at least, would be expected to submit a plan for sharing
their data or at least state why data sharing would not be possible. Of
course this doesn’t apply to all NIH funding, that has happened since, but became sort of the
basis for changing the discussion in the NIH world about complying with this data
sharing requirement for at least some of the projects and that trickled down to even
some of the smaller projects, as well. Of course we can’t leave off a
discussion about this changing environmental context around how
scientists might view open data or the need to share their data without thinking
about the National Science Foundation. A couple of years before the OSTP
memo, the National Science Foundation released a new requirement at that time requiring that all of their research grants, applications for research
grants, would have to include a data management plan. The nice thing that NSF has thought
through, of course, is that plan is part of the peer review process so it had more teeth than some
of the policies that predated it. Not just at NIH, but in other places as well. It became clear that NSF scientists would have to think through, in
a very serious way, what their commitment would be to data
sharing and how to make their data more transparent. That sort of begs the question of
what about the scientific community, has just gotten in step and joined with this
change in climate from the funding agencies or has there been
reluctance? We knew in the… sort of in the repository world that researchers
have not always wanted to share their data and I think some of
these things on this slide probably get back to my start at ICPSR 13 years ago
where people said, “Oh, exciting, Acquisitions Director… it’s gonna be hard
to get data for these reasons.” and these reasons go back then and they certainly
go back even further than that. So scientists would, I think, be happier to
share their data if some of these barriers weren’t there. Concerns about
confidentiality of their study participants. Concern that other people
might publish before they publish something they wanted to publish from
their data. Certainly a lot of concerns and this one I think is plaguing even
today, concerns about errors in the data and in the documentation. Not having the
resources to get the data ready so that they could be shared with the public. Of
course, repositories answer many of these challenges, but it doesn’t make the
challenges themselves go away. ICPSR itself has been committed to
studying the data sharing culture that has permeated the Social and
Behavioral Sciences. We conducted, in 2004, a study of NSF and NIH PIs asking
them about why they might not share data from their research project. What the
challenges were from their point of view. So this graph represents what we’ve come
to know as the things that are the most troubling and because this is a
discussion of how on board the scientific community has gotten with
sharing, this shows some of the persistent areas where scientists report
challenges and that we need to continue to think about not just ways to address
these challenges but communicating with the scientific community so that they
know that there are solutions to some of these challenges that can help data
enter the public domain or at least conditions where access controls might
be in place. So for things like concerns about confidentiality obviously there’s
lots of ways to distribute confidential data and making sure that scientists
know what those are and how the data will be protected as well as the
organization’s where they work, for example. So turning back to this question
of what I was assigned, the question of how imperative is it for
scientists to… how are they going to get their publications completed from a
research grant if they have to share their data? If you look at that group
from our survey, we actually saw statistically significant differences by a
couple of variables and probably a few more, but the two that I’m going to talk about
are rank and gender. Probably not surprisingly, looking across rank there
is more concern among Assistant and Aassociate Professors about making their data available for sharing because they’re afraid that or concerned that somebody might publish the results from their study before they
will. Full Professors, on the other hand, are more confident that they… are
probably more confident and secure in where they are in their careers and are more
willing to share their data. Differences by gender as well, only 17 percent
of men report that as a problem male scientists and a 23 percent of female
scientists. This is sort of where we’re at, right? So even just five years
ago because a PI study from a
little while ago, my impressions when I got to ICPSR a decade ago about why people
wouldn’t share their data. But does the publish or perish imperative… has it
changed in any way that the data sharing imperative is becoming as
important? I think probably has but first let me talk about what… these
are more… so this isn’t ICPSR studying itself, but rather a collection
of qualitative assessments of what we’re seeing in acquisitions at ICPSR. So some
of the things that we see repeatedly over time are requests for these kinds of
things. So if we talk to a team about depositing their data with ICPSR or
perhaps another repository, because we are not even just guided by
getting the data to ICPSR, but rather getting the data in a way that it’s more
permanent and accessible by a broader audience. These are the things that people tend to
talk about, they ask about well is that is it possible to have an embargo period? Could I
have a 10-year embargo period on the data because I’m still publishing? I’d love
to give you my data, but can I… but I need to be co-author on all the
publications that come from the data. I’d be happy to put the data out there for
sharing, but I need to review all the
publications before they go out to make sure that it’s consistent with the
methodology that I used and the messages that my project is trying to share. The
PI sometimes wants approval of the users of the data, who’s going to get access
to their data. These are the kinds of things that we
are all the time. Another really common one, even still today, is offers of really
outdated data. You can have my 1985 data because I’m really really really
publishing from the 1990 iteration of that dataset. It’s 16 years, you know, down
the road. So sometimes we get a very very early part of the collection, but
not the later part of the collection that users might be especially
interested in getting access to. So these are the things that say, you know,
the publish or perish imperative is alive and well and while people might be
interested in data sharing a bit, they’re not willing to give up the glory that
might come from the publications. But at the same time there’s some causes… I
would say there’s some cause for thinking that things are changing, as
well. And these are some of the things that we see that are emerging. So
ICPSR Acquisitions has calls with potential depositors, so scientists, PIs, their managers, about depositing data every single week. That’s not been the
case in the past. So there’s plenty of people to talk to. So if we ask, make a
simple request for a dataset to be deposited, it almost always leads to
a protracted exchange of information usually leading to a call. A lot of the PIs who are just formulating their plans, even for new research, wanted to
talk about what a data management plan would look like so that they might
deposit one day. There’s a lot of requests that we are fielding for
archiving new data formats. And so not just quantitative data or survey data, but a lot of contacts from researchers looking to archive data, qualitative data
or mixed method data and they’re interested in archiving for the first
time, that’s an emerging trend. Another is what I started with, which is this increase in one-on-one support. So the Acquisitions Team fielding, and this probably happens across the top of archives at ICPSR, as well, certainly the
ones that I know familiarly because I direct them. Questions about how do I
archive the data? Our data are complex, they don’t quite fit with, you know, simply
deposit, upload, and curation, and release. It’s not just sort of a simple path for
that dataset, its complex in some ways. Many of them have disclosure issues with
restricted access conditions. Questions about, “Do I need to merge the datasets or
leave them unmerged?” “Does ICPSR want raw variables or summary variables?”
All those kinds of discussions are things that we talk about in an
ongoing basis with research teams who are looking to archive their data with
ICPSR. I’d say this is new. That happened occasionally five years ago, but
in the last five years there’s been an increase in those kinds of calls. One of the things that has been a
fortunate development in ICPSR’s history, when I think about this, has been that
ICPSR… openICPSR, the model of self-published data at ICPSR gives
researchers, at least some researchers another outlet for getting data to ICPSR.
So what I talked about the acquisitions point of view was mainly,
you know, donations to either a topical archive or data donations made to the
General Archive, where you’re considering taking the data in and curating them. openICPSR though, of course, is a less intensive way for data to come to ICPSR.
And so by allowing for openICPSR submissions, I would say the ICPSR staff
have, acquisition staff in particular, have more bandwidth for those lengthy
discussions and negotiations with the PIs that we do talk to. So for example
here’s a little acronym for PRA, that was ICPSR’s Publications Related Archive
and that was the most analogous thing we had to self-publish data before openICPSR and that was tied to a publication. Scientists could deposit a dataset and
we wouldn’t curate the data, it would be a very low touch endeavor on ICPSR’s
staff part and we would release the study in the condition that came in.
The volume of that was pretty low. It’s really just, you know, one or two a
month that would come in through the PRA. So now openICPSR is easily five times
the volume of that PRA system and growing. And so a lot of deposits coming
into ICPSR are coming in through openICPSR, at least a non-trivial amount. In answer
to some of the things that the PIs, you know, have asked historically it allows
the PI to do some things like set the access conditions based on the
sensitivity of the data and the needs of the data. It allows the PI to determine an embargo period and how long the embargo period is to some limits and also it allows the PI to make decisions about how they
want the data and the documentation arranged. So in some ways that has
given more bandwidth. I think really interesting things are
arriving to ICPSR. A lot of them, I just pulled a recent example from this
week to share with you today on an article that was published in PLOS ONE,
and the underlying data were made available through openICPSR. ICPSR
itself, as a whole, is a recommended repository option on PLOS ONE. So we
certainly see some thing’s coming in from PLOS ONE to the openICPSR archive based on the
deposit questions, journal related publications like that typically go into
openICPSR, unless for some reason the researcher might be interested in
having been curated. So, again, a lot of change happening in the space
of how scientists are sharing data in the social sciences and many of you in
the audience will be familiar with those. Open data itself then has sort of
served its purpose that I started with at the beginning of increasing transparency, increasing the speed to which publications and data can be linked to
one another. It’s increased capacity for replication and it certainly raised the
academic and the public dialogue about the role of open data in ensuring research
integrity. This brings me to one of the stories that we wanted to tell
and share through this afternoon’s webinar, which is the story of the LaCour data. Most of the public and those who work in academia will be familiar
with the Michael LaCour data that have been called into question as being
fabricated because it hit the popular press in a really big way. Essentially what this is, was a
2014 publication based on a study that Michael LaCour had done looking at whether
you could change somebody’s opinion or attitude or political persuasion. Of
course, hot topic today, right? Many of us probably watched the
presidential debate last night. Can you really change or influence your peers
on Facebook who don’t agree with you just by making a really creative, awesome post or putting forward a meme that supports your point of view. Are you going
to change anybody’s opinion? Nobody thinks that they are. So
what was really fantastic about this LaCour data was that by sending
political canvassers to the doors of… gay political canvassers to the doors of
people who maybe haven’t thought much about their views on gay rights that the canvassers and sharing their story and who they were as an individual,
were able to persuade the potential voter that gay equality kinds of
issues were something to consider and something that they might
buy into. That had a relatively durable effect for a little bit of time, I think. So
that was the study and that was, you know, obviously a really big finding. It was picked up in the in the popular press. It was all over, it was in
the New York Times, This American Life on NPR did a… what am I trying to call it?… an oral post. Did a story on the impact of this study and
lots of other outlets, as well. Because this was such a big study, of
course, there was many who wanted to emulate its methods and use it for similar questions. Two Berkley graduate students had that in mind. David Broockman and Joshua Kalla, two UC Berkley graduate students were studying…not sturdying, but studying political persuasion. So they wanted to use the same methodology for their work and they did a lot of things the same way. They used
the same political organization: the LA LGBT Center. And they were trying very
hard to emulate this work and so one of the things that happened at some
point along the way, the results weren’t encouraging and so they contacted even
the same vendor that LaCour had said that he used to collect the data, called uSamp. And
that’s where, I think, things began to unravel and it became clear that
that data maybe had not been collected that had been purported to be collected.
So the data that existed underlying the publication that I showed at
the beginning were then claimed to be falsified. These two graduate students
put out a report in May of 2015 claiming this discovery of fabricated data, which
since there’s been no evidence saying that didn’t happen so it does look
like they were fabricated data. One of the things, a couple of the things,
that I sort of see as interesting and potentially ironic is that they’ve been
touted as having a lot… they actually, in the end, in working to replicate their
findings and continuing on the path of working in this area, they were
eventually able to show a finding with real data that the LaCour study had
claimed in the first place. Not only that, but they were able to do it in
ways that could really change the methodology of how political persuasion
research might be done and lower the cost of that really, potentially expensive kind of research and work. So they’ve had a really large impact not
just with their work and their findings, but also on the methodologies that are
being used. So the Nate
Silver blog talks about the interesting thing here is that
they’ve been very transparent in their research. They published their data and their code. Others have pointed to the same
thing. One of the maybe lesser-known facts is that ICPSR actually has the LaCour data, the fabricated LaCour data, the allegedly fabricated LaCour data. It was deposited to openICPSR in 2014. And it had some sort of, you know,
history that followed, but it remains accessible and available from openICPSR’s
website today and you can certainly go to take a look at that if you’re
interested. One of the questions that’s raised for ICPSR, though, has been “Is it
ever a repository’s right to take down data because it’s perhaps fabricated?” “In
20 years will somebody use these data in a way
that they think the data are real when they’re not?” You know, how do we provide sort of an
even broader context around the data about the scientific debate about it’s
truth in this context? You know, at this point, I think we’ve, you know, settled
on a pretty simple way that says these data… the data in this collection
point to an article itself that’s been retracted, so we’re able to say it in
that way. Anyway that’s the LaCour story
that we wanted to share today as, you know, we talked about thinking through
the imperatives on scientists to be transparent and why they want to be
transparent when they have a lot on their plate otherwise. ICPSR certainly is committed to the
goals of science and to the goals of scientists. Removing barriers to data
sharing, as I’ve been talking about today, I think is an essential step to making
sure that there is transparency in not just the data, but in the methods as well.
Other ICPSR presentations in the Data Fair will talk about the role of giving
credit to the data collector through the data citation so that if we want to
encourage data sharing and transparency for all of the right reasons. Making sure that there’s a system where
researchers get credit for that work is essential. Another way that
repositories can be committed to these goals is to help new communities build a
culture of sharing. I think we’re fortunate in the social sciences because
we’ve been thinking about, in most of the disciplines, sharing data for decades and
ICPSR continues to work with new communities who have not sort of
answered this challenge for themselves. So these things like, you know, finding
ways to give credit across these communities and removing barriers across these communities are ways to ensure that we’re supporting the
goals of science and the goals of the scientists. Then as I mentioned in
the last slide, you know thinking through ways to
document even the broader context around the data with this LaCour example,
thinking through how we can represent some of these other tangential pieces of
information about data that are essential to users not just today, but in
the future. So I’m thinking that probably maybe questions have been coming in, if not
make sure that you send Dory any questions that you might have. I wanted
to share with you sort of the opening slide for ICPSR’s presentation at the
Open Data Innovation Summit in Washington tomorrow. We’re actually talking about
Open Data Flint, which is a project being conducted in conjunction with the
Healthy Flint Research Coordinating Center. It is an effort to make sure that all of
the research that’s ongoing in Flint, Michigan related to the water crisis of
the past year, and the elevated lead levels in the water, and the failure of
government keep people safe. All of those things have spurred a lot of
research in Flint right now. And the Healthy Flint Research Coordinating Center which
is comprised of U of M-Flint, the University of Michigan, and Michigan
State University are all committed to building open and ethical science
in the Flint community through this initiative. One of the things they’ve
asked ICPSR to do has been to create an Open Data Flint repository or portal. We’re a fledgling website at this point, but interesting enough that we
were invited to talk about this at the at the White House Open Data Innovations Summit tomorrow. So that’s what we will be talking about. So
a happy ending to my presentation today is sharing sort of the next
generation of things that ICPSR is doing to ensure transparency and broader
access to data is a great example and Open Data Flint, where we’re working with
different universities as well as the community itself, to identify data and
develop capacity to use the data within the community. So with that I’ll see if
Dory has any questions, I think she’s saying she might. [Dory] Yes, so I have a question that says, “Is there a
citation for the ICPSR Perceived Barrier study? Where can we find it?” [Amy] Yes so there is. There’s a publication that is in
Deep Blue at the University of Michigan… and I don’t know do we make notes
attached to presentations in the webinar? [Dory] We can, we do. [Amy] Perhaps we can released that in this way. [Dory] Okay, so far that’s the only question. We’ll give
folks a few moments to see if we have any more more come in. In the meantime, I want to plug
our next two sessions that we have going on later today. We have a Thoroughly
Gentle Introduction to Methods Metadata and that is at 2 p.m. Eastern time. And then we have one called Data Karma: How to Deposit Data
that Stand the Test of Time. That one is at 3pm. It looks like that’s going to be it. So again, we just want to
thank you for coming to the session and we will post this on YouTube. It will also
have access to the presentation slides.

You May Also Like

About the Author: Oren Garnes

Leave a Reply

Your email address will not be published. Required fields are marked *