Canadian Museum of History’s Digital Collection Inventory: A Case Study Applying the Digital Preservation Toolkit

This document recounts the digital inventory process undertaken by the Canadian Museum of History (CMH) as a first step in developing a digital preservation plan and policy. It provides a case study of how the Canadian Heritage Information Network’s (CHIN) Digital Preservation Toolkit was applied in a larger museum. The document was produced by Paul Durand of the CMH in collaboration with CHIN over the course of 2015 and was edited for web publication by CHIN.

Introduction of the Digital Preservation Toolkit

In 2013 CHIN introduced the Digital Preservation Toolkit intended to “offer concrete steps to identify digital material found in your museum, the potential risk and impact of lost material, and how to get started in the development of Preservation Policies, Plans and Procedures” ^{Footnote 1}. In feedback provided after the launch of this toolkit, museum staff requested more information on how to use the tools and examples of the final products.

In response to this request, and to ensure the toolkit worked as intended, CHIN began searching for institutions to test the tools and to share the results as case studies with the cultural heritage community. CHIN approached institutions who had helped create the toolkit, such as the Canadian Museum of History (CMH), which happened to be starting a review of some of its archival practices and procedures, with a special focus on digital collections. The plan included an inventory and subsequent digital preservation policy. CMH agreed to use CHIN’s Digital Preservation Inventory Template ^{Footnote 2} to inventory their archival collection and to provide feedback on the template.

CHIN case studies with smaller institutions

Up until this point, CHIN had collaborated on digital preservation work with the 8th Hussars Museum in Sussex, New Brunswick, and the Medalta Museum of Medicine Hat, Alberta. The 8th Hussars is a small, volunteer-run organization whose only paid staff are interns in the summer months. Medalta is mid-sized. Both organizations are representative in their size and structure of the majority of Canadian museums, and neither has the resources to manage a system such as the Open Archival Information Systems (OAIS) model recommended by the professional digital archiving community. In each case, some of the toolkit resources were pared down to decrease the amount of work required. Also in each case, simplified solutions were adopted, which did not observe the OAIS model but still respected as many digital preservation best practices as possible, given the resources available.

These two case studies are useful resources for the majority of Canada’s museums. However, CHIN was still interested in seeing how the tools might be taken up by a larger institution. The Canadian Museum of History’s offer to participate as a case study was timely, and the fact that only the first phase, the digital assets inventory, would be sufficient for this study was testimony to the degree of thoroughness with which CMH approached the task.

Archival reorganization at the Canadian Museum of History

In 2013, the CMH Library, Archives and Documentation Services department underwent a reorganization, which prompted a review of archival practices. Previously, each archival collection had operated as its own section. As part of the reorganization, the disparate sections were brought together and all archival and library collections fell under the care of the Information Management (IM) section. The six collections in the IM section now include Audio-Visual, Photos, Textual, Corporate Records, Library and a brand new collection called New Media for assets including websites and interactive content.

The reorganization has opened the door to a more holistic information management approach. As archival deposits and accessions come from a number of different internal and external sources, which do not always fit into one archival collection, it is challenging to maintain the integrity of the intellectual and technical organization of the deposit or accession. The goal is to streamline processes, procedures and practices so that multi-format and multi-collection deposits may be better managed together, not just intellectually, but also technically.

The timing of the launch of CHIN’s Digital Preservation Inventory Template coincided with CMH’s archival reorganization and provided the museum with a good opportunity, and tool, to start the inventory process.

The first step for CMH in re-evaluating its practices and procedures and creating a digital preservation policy was grasping the scope of the collection. This included not only the assets themselves, but also the practices, procedures and tools employed to manage the assets. The challenge consisted of scoping a digital collection that dated back to the 1970s and included a multitude of formats, managed according to the information management standards of the day. To be all inclusive, and because many existing digital collections are digitized objects, CMH conducted a concurrent analogue collections inventory using the same template and processes outlined in this document.

Purpose of the CMH digital collection inventory

Key goals of carrying out this inventory were not only to identify the quantity and nature of the content, but also to better understand the risk of losing access to it (i.e. the probability of losing access and the impact of losing access). This knowledge is important because it helps determine what investment should be made in a digital preservation solution.

The project consists of the following objectives and work phases:

Objectives

To know the scope in quantity, format, digital size and location of digital assets under IM management
To identify the areas of risks for damage, access, obsolescence

Phases

Completed
- Phase 1 – Provide an inventory that details the quantity, formats and digital size of CMH’s archival collection
- Phase 1 – Identify areas of risks for damage, access, obsolescence
To Come
- Phase 2 – Conduct a corporation-wide inventory of digital content requiring special consideration outside of the CMH archives, and integrate preservation practices at a corporate/information producer level
- Phase 3 – Develop and implement a preservation plan for the collections under IM management, along with recommended archival standards and formats

Details of the inventory process

Although described below in more detail, the inventory process could be broken down into four fundamental steps. In preparation, CMH adjusted and prepared the CHIN inventory template and identified its main groups of digital assets. With the template ready to be filled in, teams of staff conducted counts of both digital assets and physical mediums housing digital assets. This was followed by completing the survey section of the inventory template by having staff most familiar with each asset group meet to discuss. With the count and survey complete, the final step was cleaning and interpreting the information collected into risks upon which CMH could identify and act.

Formatting the Digital Preservation Inventory Template for Museums

The online Digital Preservation Inventory Template for Museums is formatted for the web and outlines the information to be gathered. However, CHIN provided the template as an Excel spreadsheet that is formatted as a form that can be filled out. The spreadsheet contains exactly the same information as the web version, but it does not conform to online accessibility standards.

Obtain a copy of the Excel template by sending an email request to CHIN Client Services

To better capture the statistics of CMH’s collection, the template was altered and further formatted to add some additional inventory requirements.

Case example: Amongst other additions, a “Networked Storage” section was added to the Location and Environmental Conditions table since much of the digital storage at CMH is server-based. This section was added to identify the various server locations where the asset groups are stored. A “Server/Network Security and Access” section was also added to the Security table to identify the levels of access granted to the networked locations.

Collection asset groups

The inventory starts by splitting all the material into asset groups (in short, digital assets that share enough similarities to be inventoried and surveyed together). Grouping assets together not only makes the counting and surveying easier but also provides more decisive results for preservation and risk assessment purposes.

CMH designated seven asset groups, which were based on their existing collections found in Table 1. Additionally, CMH further sub-divided each asset group between processed and unprocessed material. In general, processed materials are those that have a reference number assigned and have been registered in a database or, in the case of older material, an Excel spreadsheet. Unprocessed material has no formal reference number and may be recorded in inventories but not in a database. Unprocessed materials include large accessions currently being ingested, materials that are due to be ingested but are backlogged and a few special instances of assets in the care of the archives that will never be formally accessioned.

Case example: Special unprocessed material may include leftover images from internal photographers that were not deposited with the “chosen” images but are kept on hard drives with retention schedules in case more images from a particular project are requested.

Table 1: breakdown of the CMH asset groups
Processed	Unprocessed	Special
1a – Photos Processed	1b – Photos Unprocessed	N/A
2a – Library Processed	2b – Library Unprocessed	2c – Library - Digitized Publications Unprocessed
3a – Audio Processed	3b – Audio Unprocessed	N/A
4a – Visual Processed	4b – Visual Unprocessed	N/A
5a – Textual Processed	5b – Textual Unprocessed	N/A
6a – New Media Processed	6b – New Media Unprocessed	N/A
7a – Corporate Records Processed	7b – Corporate Records Unprocessed	N/A

Asset groups required additional considerations:

The Librarycollection was further split to represent the collection of digital CMH publications as opposed to commercially available digital publications. Unlike other collections, Library houses commercially available publications such as CD-ROMs which affected some survey responses such as “Years required to preserve.”
Although the AudioandVisualcollections appear separate, they are in fact managed as one. Where necessary, results of the inventory and answers to the survey were answered to reflect both the Audio and Visual collections separately or the two collections together.
Corporate Records are managed with some different information management industry practices since the vast majority of assets are destined for either deposit to the textual archives or destruction once they have met the end of their retention schedule.

Tools

A variety of tools were used in the inventory process. An overview of each tool, along with its uses, is provided for reference.

Excel
- Not only is the template formatted in an Excel spreadsheet, but Excel was used to manipulate data outputted from other tools and to tally results.
- Most software that manages files allows for exported reports in formats that can be rendered in Excel, like .csv files.
- Many collections had previous information in Excel spreadsheets.
DROID ^{Footnote 3}
- DROID was used to inventory some collections that were either stored on servers or hard drives. It was not used on collections stored on CD and other physical carriers as it would be too time consuming.
- DROID can create .csv reports that list information from directories, including but not limited to file format, file extension, location and size. The report is a “listing” and required us to manipulate the data in Excel to get counts of file formats.
- Although not used during this process, DROID also provides a link to the PRONOM ^{Footnote 4} registry of formats.
MS-DOS
- The “dir /s” command provides a listing of files in a given directory that includes the extension and file size.
- It is much less resource intensive than DROID and is quick.
- It creates a .txt file that requires further manipulation in Excel to extract the desired data.
KE EMu
- The CMH image collection is partially managed in KE EMu, so it was used to help count assets.
Vubis
- The CMH textual and AV archives are registered in Vubis, so it was used to help count assets.

Inventory: Overall approach

Although the various digital assets are managed with the same “preserve and make accessible” intent, archival practices can be different for each asset group. Various formats, collection structures, locations, databases, collection sizes, tools specific to formats, industry practices and legacy practices mean that one blanket inventory method could not be used for all of the asset groups. To ensure the inventory provided commensurate numbers, one staff member was assigned to assist with the counts, tally the numbers and enter the results. The same staff member conducted the survey questions alongside the archivists working with each of the asset groups so that the language used in the results was also consistent.

Inventory: Counting

The wide variety of file formats and storage types, whether server-based or on physical carriers, necessitated various counting techniques. Where possible, the techniques that could provide the most detailed inventory were used. However, certain collections could only be approximated.

Case example: Parts of the Audio-Visual asset group allowed for the most granular count by using DROID on the hard drives where some of the assets are stored. Conversely, the Textual asset group posed a challenge due to the quantity and variety of physical carriers. It would be unrealistic to expect a granular inventory of the file formats, file quantities and digital size of every diskette, ZIP drive, CD, etc. Instead, estimations could be done by listing the known file formats used during the time of the physical carrier along with the maximum physical carrier sizes. File quantities could not be counted. As a result, the older physical carriers in the Textual asset group were identified as an area of risk.

The complexity of the collection meant that a couple different combinations of tools were used. They were usually dictated by factors that allowed varying degrees of accuracy and detail: whether the collection was on a server or on physical carriers, whether existing metadata for the collection was in a database or Excel spreadsheet, etc.

The methods used can generally be broken down into the following:

Using DROID for a granular and accurate count
- DROID is a free and portable file identification tool created by The National Archives in the UK. Amongst other information, it identifies file formats and produces a link to the PRONOM technical registry which provides information about the file format.
- DROID can be used to create file listings that include information such as file size, file type and file extensions. The file listing can be exported to a .csv file and the data manipulated to strip out any data that should not be counted. Once the rows containing only the information for the files to be counted are left, it is easy to sort by format, to count formats, to tally file sizes, etc.
- DROID identifies file formats a number of different ways that are very useful for digital preservation, including “signature identification” which identifies signature patterns of file formats and versions.
- This method provides the most granular inventory and was used for some smaller server-stored collections and hard drive–stored collections. Network bandwidth limitations prevented larger collections from being inventoried with the tool.
- DROID is faster with low quantities of high file sizes (audio-visual) rather than with high quantities of small file size collections (images, textual).
- It was used for parts of the following collections: AV, Photos, Textual, New Media, Library.
- Case example: We used DROID on 16 hard drives containing audio-visual assets. DROID took approximately one minute to create a file listing for each 2 TB drive. From the file listing, we created an exact count of file formats, file quantities and digital size.
Using the MS-DOS“dir” command
- The MS-DOS “dir” command is a simple command that provides a list of contents found in a directory.
- Here is an example of a command that will create a .txt file called “listing.txt” on the C drive/temp folder with a list of the contents of the images folder on the H drive:
  - H:\images > dir /s >c:\temp\listing.txt
  - [drive:][path]>dir /s >[location to save .txt output file]
- It is useful where DROID would be too taxing on a network or server.
- It requires good knowledge of Excel functions such as “Sort and Filter,” “Replace” and “Text to Columns” to distill text-formatted data into useful spreadsheet data.
- It was used for parts of the following collections: AV, Photos, Textual.
- Case example: We used the “dir /s” command to inventory some server spaces where DROID would be too resource-intensive, including the Audio-Visual and Textual asset groups.
Using “Properties” for a general count
- Often referred to as “right click – Properties.” This is a common way to find the size of a folder’s contents and the number of files located in the folder. It is not granular nor accurate since you cannot remove files you don’t wish to count (an Adobe Bridge sort file or supporting documentation such as a finding aid). However, it does provide accurate enough numbers to be used to make decisions and get a fairly good idea of file numbers and collection size.
- It is useful for preliminary inventories to get a rough idea of file count and digital size of collections about to be inventoried by more powerful software such as DROID.
- This method was initially used for some larger server-stored collections and hard drive–stored collections where limited network bandwidth or quantity prevented DROID from being used.
- It was used for parts of the following collections: AV, Photos, Textual.
- Case example: The high file counts and digital size of the Audio-Visual and Textual asset groups stored on the server were discovered using “right click – Properties.” This indicated that DROID would be inappropriate as it is resource-intensive across the server; therefore, the “dir /s” command would be a better tool.
Using databases, finding aids and existing inventories
- We found these tools the most useful method for counting carrier types and quantities. This information is often recorded upon ingest, so it already exists for many of CMH’s collections. Since different collections were managed by different systems at different times, information needed to be pulled from multiple resources then put together. Even where numbers were not recorded, it at least pointed us in the direction of physical carriers to be counted.
- This method was used for some larger collections stored on physical carriers where information already existed.
- Systems and software used were KE EMu, Vubis and Excel.
- It was used for parts of the following collections: Textual, AV, Photos, Library
- Case example: This was most useful in counting the physical carriers in the Textual collection. In the past, physical carriers such as diskettes and CDs were deposited and stored in folders with the paper assets. To check every folder and box would not be a good use of time. Instead, Vubis and finding aids provided the locations, and staff physically confirmed the numbers for the bigger groups of physical carriers.
Taking a manual count for physical carriers and digital assets
- Manual counts were used in two instances: for physical carriers and for assets with source files.
- For some of the smaller collections of physical carriers, or where numbers pulled from existing datasets were outdated, we had to physically count the carriers. This was done primarily for the textual archives which had some older catalogue records that recorded the existence of digital carriers but not the quantity. Although very small, CMH’s collection of hard drives was also counted manually.
- Manual counts were used for some collections where an “asset” is many files together.
- Manual counts were also used where automated tools would count an asset’s source files but not the asset itself.
- Case example: CMH’s very small websites collection has thousands of files, but they make up only a handful of websites. We used DROID for the source files, but the quantity of websites was counted manually.

Each asset group required a combination of the four methods to achieve the highest degree of accuracy possible. If an asset group could be counted with granular detail, it was. Although it was generally avoided, some collections stored on physical digital media formats such as MiniDisc or 3 1/2 diskette only have an estimate based on the maximum capacity of the physical media. In such cases, no file count or file formats are listed.

As a result of having to use different tools and techniques, notes were added to the inventory to give context to the numbers produced by these tools. Additionally, to provide commensurate numbers, the summary sheet of the inventory was divided between files that could be counted in detail and files that could not.

Simple manual counts and using “right click – Properties” should only be considered a first step, or triage step, that provides imprecise numbers yet still draws attention to collection risks. Sometimes there is no added value in counting every object. Instead a general number is better; more precise numbers will be generated when the risk is remedied through a project, such as migrating from physical carriers to server storage.

Case example: We have 1321 CDs in the combined Audio-Visual asset group. We know the maximum capacity of each CD is 700 MB, so the maximum digital size of the CD contents is approximately 1 TB. Inventorying the contents of each CD would be impractical. These numbers were added in the Audio-Visual asset group column of the inventory sheet with a note. Since these numbers would skew the total file size and digital size of the overall inventory totals, they were added to the “A count of each file is not possible” section of the summary along with other assets where only approximate numbers could be gathered. However, we now know that 1 terabyte would be needed to transfer the data to a server.

Inventory: Survey

Just as format variation presented a significant challenge while conducting the inventory, the application of industry practices by the diverse staff responsible for individual asset groups presented challenges in the survey section. The complexity of industry standards applied to different formats and collections, combined with the diversity of roles of individual staff members, means that survey questions can have varying levels of relevance and applicability, can be interpreted discordantly and can be answered with different language and terminology. The standards and practices of a Library collection versus an AV collection versus a New Media collection are great. Consequently, a survey question may not be applicable to certain collections. Although one staff member was appointed to coordinate the survey, it still required quite a bit of consideration for the variations in collection formats and the interpretations from responding staff.

The effects of collection standards and practices variation: Format-specific standards and practices

Although there are common archival standards and practices, specialized collections require specialized standards and practices. This is true for both the technical aspects of ingesting, managing, preserving and accessing digital content as well as the intellectual management aspects including cataloguing and metadata standards. A digitized PDF library publication, a born-digital film and an archived website are all digital objects which share some common standards and practices but also require collection-specific standards and practices.

Case example: Access to a Library collection is generally very open—staff and the public can search and pull assets—whereas Audio-Visual collections require an archivist to help search, pull and often transfer media to an access format. Differences such as this had to be taken into account when discussing the survey questions on access, frequency of access and security. The same two asset groups required some discussion for the question “Estimated years required to preserve” because library assets are generally commercially available and CMH’s audio-visual assets are generally created in-house or collected for uniqueness.

The effects of collection standards and practices variation: Evolving standards and practices

Archival standards and practices over the past 35 years have evolved at a very fast pace as they try to accommodate the new information mediums that are increasing in both volume and diversity. The first digital assets in CMH’s collection appeared around the early 1980s but didn’t become common until the mid to late 1990s. Additionally, the software used to manage the collections evolved. To capture accurate information, answers to the survey questions were sometimes split to address certain situations within an asset group that were not representative of the whole asset group.

Case example: The question “How are files names constructed for digital assets in this group?” was difficult to answer for the Photos asset group because different physical carriers have dictated how file names could be constructed. For example, Kodak Photo CDs only allowed for sequential numbering, but once archival CDs came into practice, the image control number was used for the file name.

The effects of collection standards and practices variation: Project initiatives and special collections

This challenge presents itself in projects or collections where segments of collections can be considered different or distinct enough that answering a survey question representing the whole asset group would not provide an adequate picture of the collection.

Case example: The “Ease of Replacement” section for digitized 35mm slides is much different than born-digital images. Likewise, digitization projects performed on parts of asset groups can result in parts of collections that have better organization, structure and documentation that skew the “Directory Structures” and “File Naming” sections.

The effect of staff diversity

To balance the effects that staff diversity could have on the survey, one staff member was appointed to conduct the survey. By doing this, the survey questions could be asked verbatim from the survey, but the intent of the question and how it applies to the collection could be discussed. This method worked well. It helped ensure that the intent of the question was understood, that the question was answered with the other collections in mind and that the vocabulary was analogous.

To help ensure those survey questions were answered from a common perspective, examples from other asset groups were provided. This had an unexpected yet welcomed benefit of enlightening staff responsible for different collections of standards, practices, techniques and tools being employed elsewhere in the archives. Additionally, different staff perspectives raised questions and pointed out differences in collection standards and practices not previously realized. These two benefits not only helped identify where practices could be putting collections at risk, but also helped make staff aware of where standards, practices, procedures and workflows could be improved and streamlined.

Case example: Terminology was partly distinctive to each collection. For example terms like “accession,” “acquisition” and “lot” are used differently depending on the asset group, as are “control number,” “catalogue number” and “shelf mark,” which are all used to describe unique identifiers in different asset groups. Likewise, in the Library collection, if an item cannot be searched and pulled by the client, it is not accessible; however, in the AV archives, items are often searched for and pulled by an archivist and are considered accessible.

Table 2: completed Digital Preservation Inventory Template for the CMH
Asset Group	Number of Items - Files	File Space	File Formats	Storage - Carriers	Notes
Photos	730,000 images	14,051 GB	TIFF, JPEG (PCD)	Server, CD (PCD)	PCD format is noted but has already been migrated to TIFF
Library	641 publications (237,000 files)	820 GB	PDF/A, TIFF, JPEG	Server, HDD	CMH digitized publications
Library (continued)	910 physical carriers (commercially available publications on physical carriers)	2010 GB max	N/A	CD-R, DVD-R, Audio CD	Italics notes items that are commercially available publications
Audio Physical	1453 physical carriers	2140 GB max	N/A	DAT, CD-R, MiniDisc	N/A
Visual Physical	1504 physical carriers	6570 GB max	N/A	DVD-R, MiniDV, DVCPRO, DVCAM, Digital Betacam, IMX, HDCAM	N/A
AV Server/ HDD	253,376 files	31,179 GB	Audio: WAV, MP3 Visual: MP4, AVI, MOV, MPEG, WMV	Server, HDD	The Audio and Visual collections are stored on the server as one
Textual	235,960 files	370 GB	Unknown due to network fragility	Server	The CMH collection is being transferred off an older, fragile server and could not be itemized at the time of the count
Textual (continued)	2135 physical carriers	2518 GB	N/A	CD-R, DVD-R, 3.5 disk disk, ZIP, 5.25	N/A
New Media	8 webpages (61,638 files)	3.16 GB	PHP, HTML, PM, HTM are our most numerous file formats.	Server	N/A
Corporate Records	N/A	N/A	N/A	Likely CD-R, DVD-R, 3.5 disk	N/A

Results of the digital collection inventory

As stated, the two objectives of this project were to grasp the scope of the collection and identify areas of risk.

Collection scope

Due to size limitations, only a condensed summary of the collection scope can be provided here. To explore the collection more, the inventory itself must be consulted.

Some statistics for “unprocessed material” were left out if the intention is never to accession the material. For instance, if the material had already been transferred to preservation formats, or if it was not selected for accession and was to be destroyed. Example: deposited CDs of JPEGs or RAW images are transferred to TIFF and migrated to the server, after which the CD is put on a retention schedule and placed in a cabinet in the event that originals are required in the near future.

Areas of risk

The following are identified as potential areas of risk:

Photos
- The “unprocessed” collections pose a risk of loss due to corruption and obsolescence of both physical assets and file formats.
- The aging Kodak PCD collection should be re-evaluated as new tools would allow for better file format migrations than was previously done.
Library
- Commercial products (CD-ROMS, etc.) pose a risk as they often require obsolete software and operating systems to be accessed.
Audio-Visual
- Legacy duplicates, versions, obsolete formats, inconsistent numbering and organization in the past pose the greatest risk to the collection as they hamper management of the assets. (Numbering convention and collection/folder structure instituted in 2016.)
- Large video collection stored on HDDs has no backup (will be transferred to server in 2016).
- Physical carriers pose a risk of obsolescence, making the digital size of contents unknown.
Textual
- Many obsolete physical carriers dispersed through collection, contents unknown.
New Media
- Collection standards and practices are not formalized for the collection yet (it is a very young collection).
Corporate Records
- Has limited digital capacity.
General
- The absence of a digital asset management system is putting the digital assets at risk because it hinders preservation practices and access. (CMH will be acquiring a DAM in 2016.)
- Multiple metadata schemas, both standardized (MARC 21, Dublin Core) and unstandardized (created in-house) hinder cross-platform sharing, slow workflows and make discovery and access complex. (New DAM has prompted CMH to map and streamline metadata.)
- Dispersed collection with no central repository makes preservation, management and access difficult. (CMH will be transferring all external hard drive–stored assets to the server and consolidating all server-stored digital assets in 2016. Digital collections repositories for ingest, storage and access will be created for each collection.)
- For diverse physical carriers, inconsistent/historical practices have the potential to carry forward. (CMH is reviewing older practices and procedures and adjusting where necessary. This includes recording past practices so they are known in the future.)
- Not all collections have formal ingest, management, preservation or access standards and procedures. (CMH is reviewing practices and procedures and adjusting where necessary to fill gaps.)

Challenges identified during the inventory process

The following challenges were identified during this inventory process:

A large IT infrastructure project, including new server equipment, large data transfers and IT workflow changes, prevented larger collections (digital size and/or quantity) from being itemized using DROID at the time of the inventory.
At first, some staff had concerns with the monumental task of trying to inventory every object in the collection. However, we explained that one of the main objectives was to identify areas of risk and that a collection that could not be counted due to its size, dispersal, obsolescence, etc. would be identified as an area of risk.

Case example: The textual archives have many disks, diskettes, CDs and DVDs mixed throughout a large collection. Opening each box and each folder, counting the media, then opening the media to count the files and inventory the formats would be a massive project on its own.

Inventorying files and digital size was very difficult on physical carriers. In many cases, the file count was impossible and the digital size was calculated based on the maximum capacity of the carrier. However, many variables can affect this, including how many files are stored on the carrier, the format of those files and, in the case of AV, the variables of carrier/product version, length, compression, etc.
The distinction of processed and unprocessed was required because it could really affect the count and cause issues with the survey. In some cases, the lines between processed and unprocessed were not always clear.
For the “File Format/File Type” section, it was very difficult for us to answer the questions:
- Name and versions of software
  - We could guess the most predominate software used to create and modify the files, but the software versions would be impossible.
- Still readable?
  - It would be impossible for us to find out for all of the formats and all of the files.
The website and new media presented a challenge for the “File Format/File Type” section as websites are aggregations of many source files.
The applicability and relevance of some survey questions were lost on some asset groups. This was especially true when the asset groups were split into three main categories: archival (Photos, AV, Textual), library and corporate records. There are many differences in the standards and practices between those categories, so the relevance and applicability of questions were not consistent.
“Access Permission” and “Frequency of Access” were the two most difficult survey sections to complete.
- “Access Permission” asks: Who should have access to these digital assets? As a national institution, the answer should be “everybody.” However, it is not that simple. This question led to more questions such as:
  - Is it asking about collections management access, or public reference access or internal staff access?
  - Direct access or third party access, i.e. the client has access to the catalogue records for discovery, and we provide the physical access?
  - Access to a master/full resolution version or an access copy?
  - Access through a database or direct access to the server or physical carrier?
- “Frequency of Access” asks: How often are digital assets within this group accessed? This often raised the question: direct access by archives staff for management purposes or indirect access through archives staff or a database system?

For the “File Formats” section, it was difficult to answer questions for some asset groups because in many cases only a portion of the asset group could be accurately inventoried. If answering the questions literally, there would have been many “unknown” entries. Instead the numbers often represent what is known with a note about what is unknown.

Conclusion

CMH’s digital collections inventory exercise has provided a bird’s eye view of the collection. The inventory document is already being used as a decision-making tool that is allowing CMH to manage the collection holistically, while also focusing on specific areas of risk. Not only does CMH now have a good grasp of the collections in its care, but through the survey questions it also has better information on how the collection is managed.

The tool provided by CHIN guided CMH through the process with ease. CMH made only minor additions to the template and would consider the template as usable “out of the box.” Some criteria were added to reflect the nature of CMH’s large and varied digital collection. Chiefly, we added criteria to capture information on the server spaces in the same way the tool asks for details on digital physical carriers. This is a direct influence of CMH’s situation since, as a large institution, digital assets are stored primarily on servers. In other institutions this could be different: data could be stored on RAIDs, in data centres or in Clouds.

Moving forward, CMH plans to maintain the inventory scheduled updates of approximately every two to three years. There is now a plan to use parts of the template to assess large quantities of digital assets in the institution that are not under the care of the IM section but are expected to one day be deposited into the archives. Additionally, CMH plans to use CHIN’s Digital Preservation Toolkit to guide the process of developing a digital preservation policy and accompanying procedures.

Glossary

.csv: A plain text file format in which data elements are separated by commas.
3.5 disk: A 3 1/2 inch diskette is a removable and portable physical electronic storage media, popular in the 1990s.
5.25: A 5 1/4 inch floppy diskette is a removable, flexible magnetic diskette used to load and store data and applications to and from a computer. It was commonly used on personal computers in the early 1980s.
access copy: A copy of a digital asset in a format that allows for easy distribution and/or interoperability. It is a user-friendly copy of an original, master, or preservation format which in many cases, due to their size or industry specific format, are difficult to distribute or access.
access format: A file format into which a digital asset is put for easy viewing, or access. This is in contrast to a preservation format, which may involve compression, the bundling of information, or the inclusion of preservation metadata.
AVI: Audio Video Interleave is a multimedia file format introduced by Microsoft that allows the storage of audio and video content in a single file format.
CD-R: A compact disk-recordable is a compact disk on which information can be written (recorded) once only.
DAM: Digital asset management includes the tasks and decisions surrounding the ingestion, annotation, cataloguing, storage, retrieval and distribution of digital assets.
DIR: An MS-DOS operating system command. It lists the names of any file or sub-directory found within the current working directory.
DROID: A file format identification software developed by The National Archives, UK.
DVCAM: A digital video camera storage format developed by Sony.
DVCPRO: A digital video camera storage format developed by Panasonic.
DVD: A digital video disc/digital versatile disc is a digital optical disk storage format co-developed by Philips, Sony, Toshiba and Panasonic.
DVD-R: A digital video disc/digital versatile disc - recordable is a blank DVD disk onto which data, including music, movies or any other digital format, can be permanently recorded.
Excel: A spreadsheet software developed by Microsoft.
GB: A gigabyte is 230 (or approximately one billion) bytes of information.
HDCAM: A high definition camera is a digital video camera storage format developed by Sony.
HDD: A hard disk drive is a storage device used for storing and retrieving information using one or more rigid rapidly rotating disks (or platters) coated with magnetic material. Sometimes referred to as a hard drive or a fixed disk.
HTML: Hypertext Markup Language is the standard markup language used to create web pages.
IMX: Integrated MultiMedia Exchange is a digital video camera storage format developed by Sony.
JPEG: An image file format (and file extension name) developed by the Joint Photographic Experts Group.
KE EMu: A collection management system (developed predominantly for museums) by KE Software.
MARC 21: A set of codes and content designators defined for encoding machine-readable records.
MiniDisc: A small-format optical disk developed by Sony Corporation (no longer manufactured).
MiniDV: A camcorder recoding media format for the storage of digital information on magnetic tape.
MOV: A movie file extension. It is a version of the MPEG-4 file format used by Apple.
MP3: MPEG-1 or MPEG-2 Audio Layer III is the audio component of a digital file format developed by the Moving Pictures Expert Group.
MPEG: Moving Pictures Expert Group, unless otherwise specified, refers to MPEG-1 or MPEG-2 formats for audio and video content.
MS-DOS: Microsoft's disk operating system, the precursor to MS-Windows. DOS commands can still be accessed via Window's DOS Shell.
MP4: A multimedia file format developed by the Moving Pictures Expert Group that allows the storage of content such as audio, video, still images and subtitles in a single file.
PCD: A file format developed and supported by the Point Cloud Library. The PCD file format is commonly encountered where 2D images have been produced using Kodak scanners.
PDF/A: Portable Document Format/archival is an open standard format suitable for archiving text and images. The PDF/A format was established by the International Organization for Standardization and based on the originally proprietary Adobe PDF format.
PHP: Hypertext preprocessor is an executable code that can be embedded into a standard web page and that runs on the web server, not on a client's browser.
physical carrier: Any physical media on which digital information is stored (for instance, a CD, a hard drive or a USB stick).
PRONOM: An online file format registry developed and maintained by The National Archives, UK: http://www.nationalarchives.gov.uk/PRONOM/Default.aspx.
properties: In the context of "right click," the selection of the "Properties" option from a pull-down menu that appears when right clicking on a file name listed in a Windows operating system.
RAW: RAW image data refers to a variety of proprietary image formats unique to the image sensor chips found in digital cameras.
right click: The action of clicking on the right button of a mouse connected to a Windows operating system computer.
Vubis: Library software developed at Vrije Universiteit Brussel.
WAV: Waveform Audio file format is a Microsoft and IBM file format for storing audio data on a personal computer.
WMV: Windows media video refers to a video file format or the codec (coder-decoder) software used to store or read these formats.
ZIP: A disk drive that stores data on high-capacity removable magnetic disks, often used for data backup.

Page details

2017-08-27

Language selection

Search