PROGRAM: P-23

Title:

LONG-TERM ARCHIVE SYSTEM FOR UNIVERSITY-WIDE RESEARCH DATA PRESERVATION
Takaaki AOKI1* , Shoji KAJITA1, Hirokazu AKASAKA2 and Hagane TAKEDA2
*1Institute of Information Management and Communication, Kyoto University, Sakyo, Kyoto, 6068501, JAPAN
2Planning and Information Management Department, Kyoto University, Sakyo, Kyoto, 6068501, JAPAN

Abstract:

In 2013 and 2014, Japanese academia was shocked by high-profile incidents of scientific misconduct. Members of the academy at research institutes and their colleagues at government offices and academic societies were called upon to help with the urgent reconstruction and development of policy, guidelines, and procedures to ensure research integrity. In particular, a mandate for preservation of research data more than 10 years was issued for both researchers and research institutes. For many researchers, it is a natural assumption that their research data should be kept for as long as possible, protected from data loss and corruption due to any accidental or artificial reason. However, this becomes extremely difficult to achieve when data preservation is mandated to every researcher. Ensuring the availability and integrity of research data for more than 10 years goes beyond an individual researcher’s personal IT skills. Because of such circumstances, Kyoto University and its central IT division (Institute for Information Management and Communication, IIMC) decided to develop and provide a long-term data preservation system.

IIMC designed a stable and cost-effective research data archiving system in FY2016. This system consists of an enterprise content management (ECM) system and an optical disc storage system. The schematic concept is shown in Fig. 1. ECM provides user interface for document management, such as "access control and auditing", "metadata tagging", "revision management" and "searchable content and metadata". However, it is difficult to secure research data for the long-term with ECM due to the shorter lifetime of ECM system hardware, software and database structure compared to the required time for preservation. This problem could be solved by connecting ECM with other long-term preservation archiving systems in which retrieved data is archived on classical and open data formats and file systems. For this system, the IIMC utilizes an Oracle WebCenter Content (OWCC) and FUJITSU Eternus DA700 data archiver. OWCC is an instance of ECM software which provides the requisite functions mentioned above. The DA700 is a disc array system consisting of an Archival Disc. The Archival Disc is 'write once read many (WORM)' media and guarantees more than 50 years of data preservation time. Moreover, discs are assembled in a cartridge and may incorporate RAID5 or 6 to improve redundancy.

The typical scenario for data operation and archiving is as follows.

  1. Users can create folders and upload their research data on OWCC. Users may organize their research data using OWCC functionality, such as tagging metadata, utilizing revision control, or sharing collaborators for local use.
  2. A user can issue the 'archive' command on any folder under his/her administration. The archive command retrieves all content within a given folder and its descendants. These contents are copied to DA700 with additional information such as metadata, an access control list, etc. If content has several revisions on OWCC, the content’s owner can choose copying the latest version only or all revisions to DA700.
  3. When the data copy from OWCC to DA700 is finished, index information on DA700 is included with the source content as metadata. Additionally, the access to the source content on OWCC is set to 'read only,' including for the owner of the content. This process ensures the contents on OWCC and DA700 are the same. The content owner may retrieve write or administrative access control with several steps on OWCC, and make a copy on DA700 again. This feature enables the user to keep archive revisions on DA700, as well as reducing frequent copying to DA700.
Under the system operation policy, no user can access the data archived in DA700. This means that the data in DA700 is treated as a 'dark archive' and also ensures fairness in the research data preservation procedure.