Archonnex at ICPSR – Data Science Management for All

Archonnex at ICPSR – Data Science Management for All

[Tom Murphy] Thank you, Dory and welcome to the Archonnex at ICPSR: Data science management for all. Today we’re going to discuss the new architecture that we have put in here at ICPSR. Many of you know the mission for ICPSR is about expanding the social and behavioral research. We are considered one of the global leaders in that field since we’ve been involved in it for over 50 years now. We’re an international consortium of over 760 academic and research organizations. We provide over 500,000 files of research in our archives. With 16 currently specialized collections of data ranging from education and aging, to substance abuse, and terrorism, and many other disciplines as well. One of the key things we’ll find as we go through this is a pattern of what’s happening to data science in general. Which is the accumulation of many different types of data disciplines, not just behavioral sciences, social sciences, and political science, but the adaptation of some of the other more hard sciences and mixing those together. Starting to get different types of outcomes which is one of the premises behind Archonnex today at ICPSR. Archonnex is a digital management asset system that we have built and designed here specifically for ICPSR, but on the premise that ICPSR as a data science institution and we manage those data science the same way regardless of discipline. Because of that we’ve been able to allow us to get many more types of flexible data structures as well as flexible components involved in our system. You’ll find as we go throughout this presentation that many of those components are widely available and deeply integrated in the platform that we built here at ICPSR. Again it’s based upon our expertise, over 50 years in the business, but also open source technologies that are proven and well supported within the open source communities. The guiding principles we have were fairly high level but very specific to what we wanted to achieve with the product that we have released today, that is released and in production today. Currently as we mentioned, I have 16 different archives today with many more knocking on the door to be a part of the ICPSR family, as well as being able to integrate and use those systems and data sets with their own type of archives as well. So we have a multi-tenancy, we are looking at many different types of data sets and archives, repositories. Both from the journal and institutional repository as well as general archival systems. We have based it on web standards, W3C standards as well as the Open Archival Information Systems also known as the OAIS model. The product we have now releases an OAIS compliant, takes care of many of the standards that have been set forth inside of that particular platform. The key to note here is that these are federal standards that we adhere to at ICPSR. Again we wanted to make it a service-oriented, componentized product that was modular so we could plug in many different types of solutions if we found the best of breed and another product or another solution that maybe even another repository had. Following that certain service-oriented model that is out there today in the industry would allow them to plug into our product as well as our product to plug into their’s. Based on open source technologies, we’ve wired these various open source technologies together with the expertise of ICPSR and what we do best. Which is the collection, the aggregation, curation, dissemination of data, and delivery back out in several reusable formats for various researchers. We’ve also used some cohesive technology choices that allows these various components to work together, very flexible UI components as well, and the scalability and the ability to handle these large datasets and peak activities were very key long-term for what we’re looking for. As I mentioned, most of these technologies, data science disciplines, I should say are starting to evolve into the combination or the collaboration of various different disciplines. Because of that some of these different disciplines have very very large data sets that need to be integrated. Varying types of data types: media, video, audio, things like that. All these things are part of the original thought process, our guiding principles in developing a comprehensive digital asset management platform for the use at ICPSR and eventually to other either repositories or brokers that might need this type of service. So we’re following a message based integration platform. We’re again, following the open-source mindset is we’re going to use Apache’s ActiveMQ as the messaging server and Apache Camel provides a simplified integration of the most common EIS patterns out there. And if you look at the diagram, what we have on the screen today, what you’ll see is how we’re bringing… how Camel routes many of the various types of messages and things back and forth through the ActiveMQ system. We have various processes for each type of message that we may want to leverage and then we also have some content based router processes as well. The components associated with this, as you’ll see at the bottom part of the screen, obviously file component, JMS, HTTP. That provides some very uniform endpoint interfaces to connect with other various resources. The diagram you see in front of you now is the base, the standard for the Open Archival Information System broken down into one simple diagram. On the left you have the producers, on the right you have folks who consume that research data has been deposited. Many things happen to that data once it comes out of the researcher’s hand and gets submitted to ICPSR. Generally speaking, what we want to make sure we do is we capture a provenance of that data as it moves through the system, both upon ingestion as well as when it gets back out to the end consumer as well. Open Archival Information Systems have the concepts of: a SIP, the AIP, and the DIP. The SIP being the submittal information package, the AIP being the archival information package, and the DIP the distribution information package. Many things happen, again, upon ingest. We want to make sure we have the data management portion of that data taken care of, as well as making sure we have sustainability through the archival storage. And then determining what type of access that data is going to have is predicated on the type of content of that data as well. Inside the larger boxes will perform our preservation planning. At the top bar as we go through those various steps in that slim line and the administration that are associated with that with how we track the data, guarantee authentication, access, things like that. Obviously that’s the overall management. This is a very large standard that was put out by the NASA engineering group. They had actually develop the OAIS standard reference model. And we have been adhering to it as many more of the federal obligations are coming with… federal researchers are asking us to have a standard model to go with… follow the standard model for archiving information systems. Today what we have on that outer… we affectionately refer to this in the Computing and Networking Resources team as our architecture doughnut. The outer ring or the outer flavor of that doughnut are all the different archives that can plug in there and you can have various types of other archives as well. The archive is determined… the doughnut is built with several layers inside. The central layer being infrastructure: what’s the hardware components, and the software infrastructure components that are necessary to sit there? Then we have the various… outside of that inner circle you have the vert next layer: which of the various application components that will sit on top of the architectural components. Architectural components may be something like a relational database management system, such as Oracle or MySQL or things like that are out there. Again because we’re component-based what we did was we have abstracted each one of these components to be inter-playable. If we decide to use a different database for example or even add an additional database, our architecture is designed and developed to just plug that database in and follow those standard communication mechanisms and integration points that are standard with both their product and ours now. Going through some of those different components in the center you’ll see metadata manager, repository engine, which we have a separate piece of software for the repository, which is specifically designed for data repositories as opposed to relational database system which is much more broader in scope, wide reaching repository software really geared around the open archival information system standard and delivers many of those features and functionalities for that product itself. Search engines, many of you are familiar with various ones. We currently employee two different types of search engines. Cloud storage, obviously we can we can tie in various cloud storage. We currently use, for the most part, we’re using Amazon Web Services. You’ll see some of the different components we’ve laid out here. Some of the names we’ve associated to the various components. Fedora, not to be confused with Red Hat Fedora, for the more technical folks on this call. Fedora is an acronym in and of itself, which is basically a product that tracks all files at its file level access mechanism. As well as the various permissions, data types, and all the associated metadata with each file as it comes in. And also how they are aggregated up into packages or deliveries of certain types of publications. SOAR is one of your key open-source search engines that are available today. We also use a search engine, another search called Elasticsearch which does many of our analytics for us. So a file that comes into our system today in our architecture, in Archonnex, will be tracked in every activity that happens is being captured through log files that we can generate further information from as well. We’ll talk a little bit why this architecture at its high-level is so important that the low-level as we get further into the discussion. The main reason for the architecture doughnuts give you a visible representation of what we’re doing. So if you took a slice, if you will, through this doughnut coming in from openICPSR you’re going to see openICPSR leverage the deposit manager. There’s a sead agent you’ll see in the left side of your screen over here and you also see various web components. They’re using a various spattering of some of these components because they’re not required to use all. It depends on the type of archive or the implementation that you want to do with your particular repository or archive that you have here at ICPSR. The virus scanner, and I’ll give you some example of what we mean why architecture is so important and its implementation critical, it’s because we have had datasets come in with over 200,000 files. One of the things we employ at ICPSR, the strategic things, everything gets virus scanned even if we know it’s a trusted source. So we bring in a 200,000 file archive or deposit, takes a very very long time to archive. With the new infrastructure that we’ve built and the platform that we have in place with Archonnex sitting on top of that, that virus scan which used to take 42 days for that same dataset now finishes in a matter of a few hours because we can… it’s distributable across multiple systems as well as across multiple types of virus scanners. It’s not uncommon today for organizations to have multiple virus scanners that sit in their organization to catch various types of viruses. There is not a one-size-fits-all virus scanning malware type of world. Because that’s the case, many times… we’re different at ICPSR, we will scan multiple times on various scanners. Because of that we need to run those through obviously very fast, but it needs to get through that system and the way we do that is having an ability to dynamically expand our services availablity, our service processing across multiple hardware instances as well as multiple software instances for that product. That goes the same for all of our components. They have that type of expandability or scalability across all the various components that we have. So let’s talk a little bit now about some of the various components that we do, that we are building in the system. I’ll give you an update too that even since we’ve put this presentation together we have identified probably another eight or nine different components that ICPSR will be adding to our base product, as well to help segregate and collaborate better. Segregate various components that we can collaborate easier and integrate easier with some of the other systems that are out there. Single Sign On and ID Management. What this really helps us to do is to give the end user, the consumer of the data that we have made available a simple way to come in, whatever is simplest for them. You’ll see that we… many times in the existing ICPSR platform, you just would have a My Data account which is common inside of ICPSR. But we’ve expanded it quite a bit now to say use whatever ID you feel most comfortable with, share common with and we will link that internally and make sure you are who you are through regular authentication mechanisms. And today we handle most all of the social IDs that are out there. ORCID ID for sure. We also integrate that into our MyData as well. We’re using openICPSR as well. The authorization management supports those role based access controls as well. And why is this important is because many times you have IDs, many systems give you your own ID. Remembering to have certain passwords on… frankly it’s just very difficult to remember. And it’s very commonplace now today to use your Facebook or your LinkedIn login. Very common if you tie in Google email, works very very well. So we picked those commonalities up and apply them to our products as well. The deposit manager which supports the data ingest so that you’re pulling in the data from the actual producer. Takes care of the ingest and storage of that data initially. Coordinates all the virus scanning, does some of the variable extraction if possible, some of the simple basic metadata extraction. We now will have some certain image processing going on because of the way we’ve restructured and reconfigured our architecture to meet some of the more pressing needs of data as data has evolved over the last 50 years. We are now looking at components to do various types of image processing. One of the components that I looked at Friday that will eventually will begin moving into our product as well, is a product that allows you to have, I believe the number from this particular service provider is over a hundred different video playback mechanisms. So if you have a specific type of format and we have it stored with our system, because the new system no longer requires… currently it restricts any type of file type you give us. It will take them all whether we can do certain things with them is a different issue, but at least we have that data stored in our archive for the long haul. As we move forward and continue to evolve the Archonnex platform you’ll see more and more of these types of capabilities grow into the Archonnex capabilities that we offer here at ICPSR. Again standard protocols for ingest: HTTP, SFTP. Even email will integrate with a workflow engine, which we call Activity. We’ll talk a little bit about that as we go forward and what that could mean to various types, either archives or institutional repositories as well. Preservation Manager is what, obviously we have to have that from the perspective of, how do we get at the data consistently? How do we package that data up, archive it? How do we actually disseminate that package for distribution later? So we’re replicating all the archival information packages for the long-term preservation. Obviously performing fixes and checks and things like that. One of the things we do on the preservation side, which is again not a component of Archonnex, but is a process or a methodology that’s benefited by Archonnex, is that we have a preservation process that goes out and keeps six different backup copies of the things that we put into archival information. Typically a deep archive where we… just things to go back and get them should we need them, but active archive is handled inside of ICPSR. With a deep archive, our backup copies and our replication copies of the data we’ve kept throughout the process. On the Search Manager, what we’ve been able to do is work real closely with Solr. We have an existing relationship with the Solr process as well today for our current older ICPSR implementations. We now are expanding that into Solr as well as some of the various keywords and metadata. Prior to this platform coming in, we struggled with key… certain file information because we were taking in studies of information, which many times was multiple files of information. Today that’s kind of been set aside now. We can deal with a file-level basis and because of that we don’t need many more types of data. We can address various types of [unintelligible] and keeping even a controlled vocabulary, for some it is necessary. We’re also exploring the GeoBlacklight for search and dissemination of geographical data. You may have recalled on the architecture doughnut that there was some thumbnails and there was also some geospatial or geo-tagging component out there. We’re looking at taking and expanding the capabilities of this technology and Archonnex. The main reason for that, as I mentioned before, many times when we do social research we will have various types of mapping data or data that has various types of geo-tags such as a zipcode or latitude and longitude. What we can eventually do then is they’re taking that data and mapping out the results of research and potentially see patterns, visually see patterns, on a map from the results of the data that we got. Additionally the other thing we can look at is, geo-tagging other numerical data and do overlays of this type of data as well. Not existing today, but that’s the vision of what ICPSR has looking forward as it pertains to data and why the title of this is: Data Science Management for All, not just the social and political sciences but how all the other disciplines are tying in with social research as well. I’ve already mentioned the virus scanner and why that’s, obviously has become very important for us as well. It’s just the ability to go, in our particular case, we’re scanning with two different types of virus scanners today, Sophos and ClamAV. If we decide to add a third or replace one, it’s not so much a coding change as it is just an interface change for Archonnex. Again, we could add a third one if we deem it appropriate for a different type of scanning of the product as well. Those type of things are valuable, but what adds the additional value is when we get a large data deposit in, that data deposit no longer backlogs everything else from getting down. It takes and runs on its own and we can still run out multiple ingestions at the same time. So performance-wise, we’ve been able to see a linear performance as it pertains to the processing of data now from ingest into the archival information package for AOS model. SPSS Processor is a another feature that we’ve added. We’re taking advantage of IBM’s SPSS files and their libraries to tie it… to do some SPSS processing on the data that comes in, as necessary assuming we have SPSS files. Analyzing for potential missing variables as well as inconsistencies. Pull whatever data we can variables, metadata out for use on future online analysis tools and just do some initial pre-processing of the file electronically as opposed to waiting for certain things to be done by the submitter of the data and/or by our processors as well here at ICPSR. Again, the more we can take care of at the system level, the more we can get our processors, our talented processors in front of these files quicker with better work to do as opposed to working on some of the things we can systematize. The key component, I think of this entire architecture is really the Open API. This is the access, if you will, to all the services that I have already discussed and the ones we will discuss in the future as well as our RESTful services that we’ve implemented to enable metadata harvesting, pull abstracts out. We’ll talk a little bit about how we’re using this API and one specific product researchers are doing today that was funded by the NSF, but this type of an approach allows us to open up capabilities to the various other repositories, software systems out there as well. We’ll talk a little bit about that as well as we move on through the presentation. Standard formats typically are: RDF, JSON-LD, as well as DDI XML. XML, we have various types of formats we can pull and communicate with as well. That will be provided with the various services that you have access to from the API. Our Workflow Engine, as I had mentioned before, another component is the ability to start assigning business processes in a very standard, cohesive way using a workflow engine, what’s called a business process management tool. A business process management software has been out there for awhile. We’re using the open source version called Activiti. Our chosen technology works very very well by ensuring common features and functionalities that are executed all the time, as opposed depending on human interaction or things like that…not human interaction, but human dependence to get things done in a very specific way. The workflow makes sure it follows that, it doesn’t necessarily do it for you, but it helps guide the processor no matter how experienced or how non-experienced they are, they can leverage this, the workflow should do the same thing. Additionally, one of the things the workflow can do for you is take the same dataset process and work in several different ways without losing the capability in between those dataset processes and have them perform several different ways. Many opportunities here with this particular product. ICPSR is just beginning to scrape the surface with it but, we’re looking at some very different ways to help enable and energize our processors to be able to do more with with less using the technology this product brings. Leveraging ElasticSearch and Kibana will allow us to do quite a bit on reporting analytics. Give us a consolidated storage for all the logs that we have and being able to take those logs and build specific types of reports out of them. Different types of discovery, even see some of the Google Analytics side of things as well being fed into here. Because of the way we’ve integrated system, it’s a natural flow by wiring all these various components up. Stand-alone they’re pretty good components, but when we get them wired up into the architecture we have, we take advantage of a simple download becomes two or three pages of different types of analytics that gets generated on behalf of that data coming in for the benefit of the processors, for a benefit of the management systems, for the benefit of even Network, saying how we do certain things. Before we just didn’t have the accessibility of that data because we simply didn’t have the capture of that data. So the new architecture under Archonnex, we now have that capability and allows our ability to grow and build various types of components. And not to mention, just brand new systems off of the product. So add-on modules we can derive and extract various custom attributes. One of the key things of these specific processes like in SPSS, we could build certain meditative processors, or certain variable data type processors, even certain security level or sensitive data pre-processors and stuff like that as well. These type of components allow us to run various types of things, the data through various security modules, for example is a real good one. Being able to handle geospatial data is another one. When we have this geospatial, the geo data that comes in for mapping, particular mapping file what have you, we have these processes that can gather that information and provide it back to the the various users or even as reusable data that we can provide back to the secondary use of data folks outside the ingesters of the secondary use data. Security again, disclosive risk. We can take… we can follow some various standards with disclosive risk or at least the pre-processors give us a lot more capability and I think allows us to generate quite a bit more information about the data we’re getting in. When at the end of the day that’s really what one of the key missions of ICPSR is generating that data so it’s much more valuable and reusable for the other researchers out there. Talking a little bit about the Geo Tagger by deriving geographic information from the inputs we can put certain geo taggers from an address or an IP address, put markers on a map in general so that information… can visually see different patterns. And also taking that data and feeding it into the next level of geospatial processing can potentially and doing some other online analysis with that as well. One of the things I mentioned earlier was our integration with a particular product. In this particular case, we built already built an integration for an external data producer, for SEAD. And SEAD is an NSF funded project that I actually help work with as well as did Harsha. Building interfaces for these type of systems is extremely valuable for both ICPSR and these types of systems. We started off with SEAD. For those who don’t know SEAD stands for Sustainable Environment Actionable Data. Primary focus is to give researchers a chance to leverage the infrastructure of this particular organization, this particular software product of SEAD. And build their own collaborative workspace with other researchers, go through the various type of management of their research processes from beginning to end, and actually do a push-button publish to another archive. openICPSR was the very first archive available for them. There is also another one at Indiana and I’m drawing a blank on that one. I believe it is called SDA but I don’t recall off the top of my head. And there’s a couple more than have been looked at to be able to… for researchers once they finish their work to be able to publish/archive, which is really the intent of the Federal Government, is we want… now that you’ve done the work that we paid you for, we want the data to become available or be sustainable throughout the life of the product. openICPSR is that product SEAD is the mechanism to do the research and we built the interface to just transform that data into an already existing archive that’s sustained by a long-term organization. And its really helped tremendously for the researchers on SEAD as well as some various other organizations [unintelligible] product. In addition to SEAD, we’re entertaining both Dataverse from Harvard. Interfaces with that I talked to Mercè over there. We are going to be getting at a minimum, we’re going to be building those metadata exchange points so that SEAD is the Dataverse can pick up our metadata at ICPSR and use that with some of the searches that they do. Again the world is really at the hands of the researchers but if we can’t get them the access to that data or at least where to find that data, that becomes very problematic. So one of the reasons, again for Archonnex, its ability to at least discover data or points where data is at, for the researcher and for secondary reuse or even tertiary reuse as well. OSF is a very popular, the Open Science Frameworks are very popular, [unintelligible] out there as well. I’ve had conversations with these folks as well. They’re on our horizon to build the interface with as well, and to integrate their framework into what we do so we can share back and forth as well. And of course, many people are already familiar with Figshare. Very limited, more of a storage area that you can build a lot of things on. What we’ve done with ICPSR is we’ve already built those things for you and we’ll put Figshare as another collaboration point or integration point for our particular product here at ICPSR. Our technology stack is… I’m not sure if you can read it, but I’ll try to go through it real simply for you. We have a Web UI, we have our desktop UI, and we have our batch automation UI. Which is kind of the core components of what we do and how we work here at ICPSR. We’ll have some web applications, there might be some desktop applications, if so any type of desktop components are going to be built with a Java Swing and the Java Web Start components. We’ll have certain protocols; protocols are pretty similar across depending on the type of work you need to do. For the technology folks out there our batch automation is typically going to be Java RMI and SSH protocols, but typical web is your standard web rest JSON and JSON-LD. It’ll do some XML if necessary. Most of the restful interface protocol has been able to handle all the things we’ve needed so far. There might be some security things where we want to clamp down a little bit and we might want to go to rest, but I think the way we’re doing security now is probably going to suffice. And I’ll let my architect, Harsha, yell at me if he thinks I’m wrong. He does that a lot now and then. So the bottom side of the diagram you’ll see the various high- level components that kind of tie up what we’re doing; how we move back and forth between the web, the desktop, and the batch automation. We’re using Git is, again another open source control management system. Interestingly enough, Figshare is built on top of Git, I’m not sure if everybody knew that. So Figshare is kind of a wrap around Git which is our source code management system. We have some built tools: Ant, Maven, Bamboo is another one. Bamboo is also a productivity tool. Again we talked about our message ASB (Enterprise Service Bus) and our message brokers with Apache MQ. We do have some RabbitMQ as well. Storage, quite a bit of it, we have the Amazon Cloud, we have DuraSpace Cloud as well as doing our own internal network file storage and we’re using Oracle, MySQL, PostgreSQL. The data stores are really wide open from our perspective. We’ll use whatever we need to for the benefit of that particular archive or particular repository. We’ll do what we need to do with. Because its componentized, we just need to build the interface for that, which most of those exist, we just need to plug into them for the standard storage systems. All of our servers are currently on Linux. We do the desktop using the Windows, the Mac, the Linux as well for web-based items. And again all Java-based. Because it’s Java-based it’s pretty much platform-independent with some minor tweaks depending on how old your platform is and whether it is supported by the latest levels of Java as well. So that’s overall kind of the technology stack if you will, the various components we’re using inside each one of the breakdowns inside that stack. These are now, what you see here, are the various technology components we used to build our architecture. Or partners, ORCiD for example, is a partner by having the ORCiD ID which is a very standard ID for researchers. We’re using a couple of different types of JavaScript. We’ve used React, we’ve used Bootstrap. For the [unintelligible] because we’re using a Spring interface. We’re using ISILON as well for disk storage. We talked about [unintelligible] and ClamAV, jQuery is part of the work, is the main work we’re doing with the database. ElasticSearch and Kibana get all of our various analytics that we’re looking for. We’re moving to the latest version of Apache Solr which will be significantly improved for capability to what we use. Again standard web consortium that were using today, the W3C standard along with the OAIS standards have allowed us to take these various open source components and wire them all up to start building, to start providing services on a more open basis. What this does for us quite a bit frankly, is open up a lot of integration points for various other datasets coming in and/or repositories and/or researchers. So we’re fairly wide open to allow a lot of connectivity via component based development. openICPSR, which is a component, as you may remember on that architect doughnut, is a component that sits on top of the outer band there. We released that August of this year on the new Archonnex platform. It has been, so far, widely received. We’re building more and more capabilities on it. As a matter of fact, we’re about to release a second product on there which actually wasn’t on the original doughnut for a new product that came that we were approached on. And we’re building that new product on this platform and because of that this platform, we introduce that product much quicker now for the vendor is certainly capable. So a lot of interesting things have transpired over the last year here at ICPSR regarding technology. Regarding our approach to technology, and then the technology and its impact on both data, the way it’s handled, the way it’s managed, the way it’s processed, the way it’s stored. As well as the ability, internally on our operations, to see where we can find our improvements and make our processors more efficient and effective at what they do by letting the system do some of the things better, faster, and more efficient as well. That makes our already valuable processors that much more valuable to the processes we have a place for operations. At this point, it’s the end of the presentation. So I’ll open it up for questions if there are any. And if not, I’ve certainly enjoyed having you guys listen to the presentation and the opportunity to present to you the new things happening at ICPSR regarding not just our systems, but data science in general. And how the management of that data is starting to transform the way we do business here, not just at ICPSR but researchers as well. [Dory] Okay. So we have one question. Is there a file size limitation when using the SPSS processor? [Tom] The only file limitation we have with the current openICPSR product is a download of 2 gig [gigabytes]. I’m not aware, I’m going to turn this over to Harsha, but I’m not aware of any file limit we have currently that is preventing us from doing something large. Harsha, do you have any thing on that? [Harsha] So we have not defined any limits on the size of the file, but I think we may have issues if it’s extremely large files. We may have to allocate more memory for the processing for that specific instance. I think we have sufficiently allocated at this point we can handle most of the cases that we have come across in ICPSR’s archive. [Tom] Okay, great. So not an official one, we’ll just have to see how big it gets. The good news though… I think, that brings up another interesting point about the architecture is, in the past it would be very very problematic. Today we can recognize it, restart, rebuild, do what we have do and address those problems much more efficiently in the computing network world. [Dory] Can you speak more to how your new system shares record information/ metadata with other archive products such as Dataverse? [Tom] I can speak to it in general terms because we haven’t built the interface yet. The intent is to probably move that data in one of two formats back and forth with Dataverse specifically. Either DDI and/or XML at this point. The third option would be JSON. That’s something that I’m working with the folks at Harvard to clarify what they’re looking at. Today Harvard picks up our data through an OMPI server, and that’s how we’re exchanging the data currently. But longer term we’ll be using that as an interface as opposed to a batch process that gets done. [Dory] We can wait a few more moments to see if any more questions come in. [Tom] Sure. [Dory] Well looks like there’s a… Okay, so there’s one more question that just came in. It says I would like to track usage by asking if the download will be used for classroom training. Can you add a question? [Tom] … would like to track the uses by asking if the download will be used for classroom training, can you add a question? Oh, I see what your saying. So when they download the data from ICPSR, I think with their saying is, when they download can we ask specific questions about what they’re doing? It’s not something that we’re doing today, it’s certainly a request we can add to the Product Management team and see what they think about adding that or maybe having specific type of datasets should that come out. I think it’s a good idea. It isn’t something we’ve really thought about, collecting metadata about the data being downloaded from the downloader. It isn’t really something we’ve looked at. We’ll collect IP addresses and things like that, but not necessarily asking specific questions of what the use of the data was for. So that’s something we could consider in product. [Dory] Should we assume that this is strictly an in-house application with no plans for public distribution? [Tom] No, I’ll expand on that. Actually I was in a meeting today and we are looking at more ability to distribute it, but it will not be at the… if the question is, will this be public source or open source. No, that’s not the intention of ICPSR to do that. There might be some components though I think, that we may look at making open source that could benefit the generalized community. [Dory] Okay, so it looks like the next one says… It would be great to know for Cal State University? [Tom] And I’m not sure if that’s John’s previous question or if that’s [unintelligible]’s previous question to know about the question to add to the data load. So I’ll answer them both. I think, yeah, I do think there’s going to be an opportunity to license the product from us if they desire. I know we presented this product at the IASSIST in Norway and again at Open Repositories in Ireland this past summer. Very widely… a huge amount of interest was generated primarily because we built the product for the intended use of ICPSR, but when we designed it was to be used, it could be used by anyone. So the intent is to try to make this something more valuable for anybody using data science. Or a public university so we don’t do a lot of software development in terms of for-profit here. So I do think there’s some things will probably be working out from a licensing perspective across various institutes. Specifically those in our Consortium. [Dory] Okay, well it looks like that’s the end of our questions. I just want to make a quick plug for our next session at 2pm. Orientation to ICPSR with a Fresh New Look and we just want to thank you for attending the ICPSR Data Fair. [Tom] Thank you again. It’s been great spending time with you folks out there and I hope it was valuable for you.

You May Also Like

About the Author: Oren Garnes

Leave a Reply

Your email address will not be published. Required fields are marked *