GDFR WikiMain Page | About | Help | FAQ | Special pages | Log in

Printable version | Disclaimers | Privacy policy

Activity 3: Gathering of functional and non-functional requirements for a community format registry

From GDFR Wiki

Contents

Activity 3: Gathering of functional and non-functional requirements for a community format registry


Lead: Stephen Abrams (CDL), Steve Knight (NLNZ) Additional Participants: Open to all

Deadline: January 15, 2009

Subtask 1: The lead group will look at the GDFR wiki page that has been set up for this activity and will determine if any changes to the wiki page are needed to support this activity. Any wiki page changes can be made directly by this group or can be requested from Andrea Goethals. This group will determine how best to solicit participation, for example through targeted invitations or more openly through an email list.

Subtask 2: Participants will compile a set of functional and non-functional requirements for the format registry. At a minimum participants should document the name of their institution and rate the importance of the requirement to their institution.

Deliverables: Completed compilation of functional and non-functional requirements for the format registry on the GDFR wiki.


Existing GDFR Use Cases and Functional Requirements:

2003
The original use cases for GDFR were submitted in 2003 by the Bibliothèque nationale de France, Harvard University, JISC, JSTOR, MIT, NYU, OCLC, UK National Archives, and the University of Pennysylvania. The full text of these use cases is on this website at http://gdfr.info/docs.html#usecases_2003.

A list of services that GDFR should support was compiled by Mackenzie Smith (MIT) based on the above use cases. This service model is documented in: "Global Format Registry Service Architecture". MS; March 11, 2003 which is on this website at http://gdfr.info/docs.html#service_model

2006-7
OCLC developed a set of use cases for GDFR in 2006-7. They are contained within this document: "GDFR Analysis Model, v. 2.0; October 1, 2007 which is on this website at http://gdfr.info/docs.html#analysis_model. This document is the GDFR functional requirements specification.


2008/9 Format Registry Use Cases and Requirements:

This section will be used to compose a complete set of functional and non-functional requirements / use cases for a format registry.

Introduction

An understanding of digital formats is fundamental to the effective long-term curation and preservation of digital assets. To date, two large-scale efforts – the PRONOM project in the UK, and the GDFR (Global Digital Format Registry) project in the US – have been focused on the need to collect, manage, and preserve significant representation information about digital formats. However, the stakeholder communities addressed by these projects have expressed concern over the necessity to provide intellectual, technical, administrative, and financial support for two independent, and undoubtedly duplicative, registries. An informal group of international library, archive, and preservation institutions has agreed to support the efforts of an ad hoc Format Registry Working Group (FRWG) to determine the best way forward in establishing a single common Community Format Registry (CFR).

CFR Functional [FR] and Non-Functional [NFR] Requirements

Note: In the following requirements the terms “will”, “must”, “shall”, and their grammatical cognates indicate a requirement; the terms “should”, “ought”, and their cognates indicate a recommendation; and the terms “can”, “may”, and their cognates indicate an option.

The requirements are presented in a hierarchical fashion, proceeding from the general to the specific in order to permit evaluation and acceptance/rejection at an arbitrary level of granularity.

Requirement priorities can be rated on the following scale: Very important, Important, Average, Slightly important, Not important.


0. Definition. A format is a class of digital object whose members all share a common set of syntactic and semantic rules controlling the mapping from an abstract information model [ISO 14721] to serialized byte streams, and in many useful instances, back again from serialized bytes to an abstract model. [NFR]

1. The CFR will manage a controlled namespace for the unambiguous persistent public identification of digital formats. [FR]

2. The CFR will support the binding of various typed representation information [ISO 14721] to format identifiers. [FR]

3. The CFR will follow evolving best practices for the secure, sustainable management of format representation information. [NFR]



Other Requirements
Please indicate below each requirement the importance of it to your institution using these words (in decreasing order of importance):
Very Important, Important, Average, Slightly Important, Not Important

Use Cases

Use Case Source: Bibliothèque nationale de France

Use Case ID: PRES_1

Description: Assemble Information Representation metadata

Summary: The goal of this use case id to allow a preservation expert to add information representation on formats (using an external format registry) so that this information is kept in perpetuity and could be referenced in the SPAR system itself.

Actors: Preservation expert, Format registry, Ingest module

Assumptions: The information representation metadata comes in the form of a SIP. The data-object is made of the description of the information representation.

Pre-conditions: Request stored in historical database.

Primary Functional Path:

  1. Actor authenticates with the system.
  2. Elaboration of a SIP with the information coming from the format registry
  3. Ingest of this SIP in the system (ING_2).
  4. Receipt of the external identifier of this information representation for latter use (policy, process, ...).

Primary Result: AIP of the information representation metadata

Post-conditions:

Exceptional paths:

  1. The SIP is not valid according to the QA process: a non-conforming receipt is sent to the actor.
  2. Failure in the storage transaction: a failure receipt is sent to the actor and a trap is sent to the administration.

Use Case Source: Bibliothèque nationale de France

Use Case ID: PRES_4

Description: Harvest Information Representation metadata

Summary: This use case allows the collect in a regular basis of information representation metadata from a format registry.

Actors: Format registry, Ingest module

Assumptions: Frequency of the collect

Pre-conditions: The AIP of the information representation already exists.

Primary Functional Path:

  1. Connect to the format registry
  2. Collect the new information representation metadata from the format registry.
  3. Lookup for the corresponding AIP
  4. For each new format with a corresponding modified,
  5. Build a SIP
  6. Submit the updated SIP (ING_2)

Primary Result: Information representation AIP updated

Post-conditions:

Exceptional paths:

  1. SIP refused: send a rap to the administration.

Use Case Source: Bibliothèque nationale de France

Use Case ID: ADM_4

Description: Assemble process metadata

Summary: The process metadata are stored in the archive as AIP; so they are imported as any other SIP. The goal of this use case is to allow a IT staff to add information on available processes so that this information can be made persistent and can be referred to in the SPAR system itself.

Actors: IT staff, Ingest module

Assumptions: The process metadata comes in SIP format. The data object can be the specification of the process or even the source code of the program that realizes this process.

Pre-conditions: Request is stored in historical database.

Primary Functional Path:

  1. Actors authenticates with the system.
  2. Receive the process metadata in SIP format.
  3. Import this SIP in the system (ING_2)
  4. Send receipt of the import to the actor: the receipt contains the external identifier of the process for a latter reference or use (transformation packages, AIP generation, DIP generation, …).
  5. Store record of the operation in historical database

Primary Result: AIP of a particular process

Post-conditions:

Exceptional paths:

  1. The SIP is not valid according to the QA for a process: a non-conforming receipt is delivered to the actor.
  2. Failure in the storage transaction: a failure receipt is delivered and a trap is sent to the Administration

National Library of New Zealand Thoughts

Our understanding of what the CFR will need to do is informed by our current thinking on the concept of format obsolescence and how to analyse the risk of obsoleteness coming to pass. At the National Library of New Zealand, we have written some documents on this. One is an informal Risk discussion paper that outlines some of our current world view.

Some highlights of the document:

  1. NLNZ is obliged to take in content regardless of its conformance to format standards (i.e. if it fails DROID and JHOVE, it still has to come in).
  2. Our initial risk assessment is whether we as an institution can render it or not. That is, is the format associated with an application within our tech profile?
  3. Characteristics within a format (that is, certain compression methods, colourspaces, etc) can prohibit an application from rendering the content.
  4. To achieve this, there should be a Format Library, an Application Library and a Characteristics Library, all interrelated, telling us what we can and cannot render successfully.
  5. Formats, Applications and Characteristics will all have 'sustainability information' associated with them -- is there a vendor support end date, how stable is it, etc. This information is used to assess a) when we should move content from a certain format; and b) what format/application/characteristic profile we could move the content into.
Comment on section above:

Point 0 above: "Definition: A format is a class of digital object whose members all share a common set of syntactic and semantic rules controlling from an abstract information model [ISO 14721] to serialized byte streams, and in many useful instances, back again from serialized bytes to an abstract model. [NFR]"

This statement describes formats (and the files that represent their encodings) from the view of the format originator or creator. It does not appear to account for the fact that application developers (file encoders) interpret these syntactic and semantic rules and frequently either wittingly or unwittingly, alter them to suit their own purposes when writing byte streams.

So, to a repository preservation manager, while the definition of the format in its idealized state is of course important, equally important are the quirks regarding how different applications actually encode files. We've got a few RTFs that don't fully comply with the spec for the format, but we can happily open them in a number of applications. They are not versions of the format, rather badly encoded files where a number of applications are "forgiving" enough to still open them properly. It seems that we would want to record this information somewhere like the GDFR and gather together the community's knowledge in this area as well.

This relates to the NLNZ validation routine and our institutional requirement that we accept files even if they don't conform with a format standard. From our point of view it would be desirable to have JHOVE(2) capture what technical metadata it can even if a file does not meet the requirements around "validity" or "well-formedness".

CFR Scope

In terms of CFR requirements, the actual scope has to be defined first (apologies if this has already been defined in previous meetings). How much/little of the list below is envisaged as being within the remit of the CFR? Will it:

  1. store information that can be used to help identify how files are encoded (their format & version)?
  2. store information that can be used to understand the ‘significant properties’ of objects (we prefer the term ‘characteristics’)?
  3. describe the creating application of the content (and the related computing environment (with any hardware or software dependencies))?
  4. list applications that can render formats and their variations, taking account of characteristics that can hinder this?
  5. supply risk analysis of formats/applications/characteristics?
  6. describe appropriate (and inappropriate) preservation strategies?
  7. hold information to be used in the evaluation of preservation actions?
First and foremost we see the primary objective as being point 1. The unequivocal description of an object in terms of its format and version is the ideal goal. To get to any of the other parts (risk analysis, preservation strategies) this information has to be complete, reliable and absolutely comprehensive.

Our point of view is that all major characteristics of an object must be known before any sort of action can be taken on it (for example, a TIFF image has CIELab encoding -- run this blindly through a TIFF-JPEG2000 converter and you get a lovely bright green result). While we understand that such depth of detail (point 2) may be a step too far for the CFR, trustworthy information at the format and version level will help us fill in half the picture (no mean feat). [We are working on this characteristics aspect with great vigour.]

In addition, points 3&4 could be achievable too. This means we would have a baseline of rendering information for the content (even ignoring for the moment the relationship with characteristics of the objects that could stop it being rendered 'properly').

In terms of risk, we're nervous about community consensus on this. Current models seems to favour numeric analysis with certain results then triggering planning or actions. We do not see the underlying information being comprehensive enough, and nor do we see consistent interpretation of this information into numerals (for example, how exactly do you put 'Market Share' into a scale of 1-5?) At the moment, we believe that this should be tackled on an institutional level. This is not to say that we do not think this information should be shared/discussed/argued. It should be and as widely as possible. However, the CFR may not be the place to do that while such analysis is sill in its infancy.

Point 6 is the community dream: towards full automation of mature, tried and tested preservation actions. Could work on this part stifle the more immediate concerns of point 1? If possible, placeholders could be made in the CFR structure for this.

Point 7 is related to characteristics (significant properties) and could be a step too far at this point with this level of effort. We are working on this though (as are others).

Retrieved from "http://gdfr.info/wiki/index.php/Activity_3:_Gathering_of_functional_and_non-functional_requirements_for_a_community_format_registry"

This page has been accessed 3,812 times. This page was last modified 02:20, 16 January 2009.


Find

Browse
Main Page
Community portal
Current events
Recent changes
Random page
Help
Donations
Edit
View source
Editing help
This page
Discuss this page
Post a comment
Printable version
Context
Page history
What links here
Related changes
My pages
Log in / create account
Special pages
New pages
File list
Statistics
Bug reports
More...