Case Study: Archival Information

SWUI 2009 at ISWC: Sharing Ideas for Complex Problems in User Interaction

Case study to support SWUI 2009 workshop activities and discussion, prepared by Duane Degler, 6/2009.

Summary

This case study illustrates challenges and opportunities in the archival community, facing rapid growth of large-scale electronic archives and library collections online. This domain provides a rich opportunity for semantic web and linked data researchers, practitioners, and software developers to explore. From the public/researcher user perspective, it offers highly heterogeneous data sources, a wide mix of structured and unstructured data, diverse user populations and goals, requirements for a range of query/search/retrieval strategies, and unique needs for visualizing/rendering both results and individual objects. From the archivist/records manager perspective, it offers an increasing tidal wave of data in a wide range of formats that needs to be reviewed, processed, described, tagged, cataloged, shared, and enhanced over time – in an environment where users expect more and more while the available staff may remain static or even decrease.

Increasingly, both web tools and archival management are embedding RDF and linked data capabilities into their architectures, and overlaying standards-based ontologies to supplement older proprietary vocabularies. However, key questions remain:

Why is archival information an important topic for the semantic web community? Ultimately, it is a highly multi-dimensional, contextual, authoritative, and evolving problem space with an extremely diverse set of user needs and expectations that at the same time has to provide a service today and plan for service tens or even hundreds of years hence. That sounds like the kind of complex environment where the rich potential of linked data and the power of the semantic web should play a key role!

The Challenge

Online and electronic archival resources have many very interesting and complex aspects that have a SWUI component. For example, government archivists face:

Applicable Domains

Research and designs from a range of disciplines are applicable to the challenges described above, including:

Possible Roles for Linked Data and the Semantic Web to Support Interaction

How do users articulate their goals?

In many cases, the user’s experience with an archival information site is not a single-encounter, single-item experience. It is an exploration, often over many sessions. Sometimes that exploration is quite focused (for example, gathering information about a specific relative or family tree, which is focused but not necessarily quick) and sometimes it is quite broad and extensive (for example, analyzing a wide range of documents, reports, and statistics as research for a book on the effect of international trade patterns for particular local economies).

Users need to find all relevant information throughout the archive, while incrementally filtering out things that prove not to be relevant. This exploration and decision-making should be facilitated by the environment, so that the user’s focus is on their goal and not on the process of managing the software.

Findability for heterogeneous data

Searching and browsing archival records is a different experience from other types of information and sites. As a result, interactions potentially need to be different than they are in other searching/browsing environments in order to be effective. Linked data approaches and technologies that are driving the potential of the semantic web are a key part of the solution.

The interesting attributes from a searching and browsing perspective include:

Self-description: carrying “meaning” far into the future

The primary purpose for archiving information is to preserve it for future generations, in a way that it can be accessed, used and understood by people who may be far removed from the context in which the information is created. The information must not only be readable by humans and machines as technology evolves over hundreds of years, but the information must also be able to be interpreted contextually and prompt users to associated records. This means:

Supporting the disambiguation and identity challenge

The modeling and describing of information needs to address ambiguity of what is being sought. Here is a particularly famous example: “John Kennedy” – for which there are many documents, images, and data records for each of the following:

Reflecting the authoritative nature of the archival repository

People who use a national archive expect it to be the authoritative source for the information it holds. The interaction experience between users and the information should help users feel that sense of trust. The data, its various formats and transformations over time, and any overall site architecture for accessing and viewing information should reflect a transparent authority – modeling that authority for users and other systems could benefit greatly from semantic components and architectures.

Archival information is often used outside the Archives’ web presence or physical buildings. Materials are shared with other collections. People use images, documents, audio, and statistical data from archives as raw materials for other works. It is important now, and will be increasingly important as electronic data is available more widely, that the path back to the provenance of that data is sound. This starts with providing clear source provenance data in a standard, machine-readable form, and providing services that are trusted when that provenance is requested by users of other sites and repositories. Provenance itself should be modeled in a way that is easy to interpret and aligns with provenance representations from other information sources.

Codifying policies and rules for finding, accessing, and using data

There are a range of rules around accessing and using information, depending on the nature of the information and the person (or in future, system) that is seeking the information. These rules need to be carried in the system in a way that allows them to be understood, reviewed, audited, communicated, and changed over time. They should participate in the data relationships, rather than be coded into applications. Rules, logic, and approaches to proof are all important, and must be transparent and manageable by responsible users.

In the service oriented architecture and also in services made available on the web, rules for agents and services also need to be created, managed, reviewed, and changed over time. As the semantic web becomes a more complex ecosystem for agents and services, the rules and management of interactions needs to be understood and able to be changed flexibly by responsible users.

Keeping up with evolving descriptions and language

XML metadata structures and thesaurus syntax is widely used throughout the archival community. At the same time, there is an increasing use of RDF within metadata, description, and linking architectures, and OWL vocabularies as part of the formal controlled vocabularies. One reason for this is clear: language and structure are both going to evolve, and modeling rules and relationships must be flexible and orthogonal (not simply hierarchical). Semantic web formats allow for more flexible modeling and maintenance, and the ability to map new vocabularies over time. However, much of the current implementations are focused on back-end enabling technologies. Approaches for user interaction so that this richness and flexibility can be captured and maintained are not yet as evident, but are vitally needed.

Coping with the volume of information

The sheer volume of information that is collected and can be made available is staggering. Traditional mechanisms for finding and displaying information can break down at this scale. And yet it is important to guard against creating silos of data that aren’t flexible to accommodate both current and future use. One aspect of the semantic web that may help this is to create overlay models that allow information to be managed in segments for particular purposes but do not limit the integrity of the whole.

Note also that the information held is not necessarily very well described in subject terms. Archival records, unlike library collections, go through a lifecycle where the majority of the information that is captured relates to the process in which the materials are identified, handled, transferred, and preserved. The management context, physical attributes, and handling instructions/restrictions may be better described than the subjects or the context in which the records were originally used. Subject-related description can be later applied over an extended period of time by people who handle and review the archived materials – as the volume grows, this becomes increasingly challenging. Concept and entity analysis/extraction tools can help extend the quality of subject identification, but it is important to find ways that identify subjects without increasing the “noise” that may result in final search results.

Creating an “Enhancement Ecosystem”

In this Web 2.0 world, users expect to be able to tag, annotate, cross-link and share information. There is value not only to users in being able to do this, but also value to the archives at a whole. While the idea of “crowd-sourcing” descriptions may not be appropriate for long-term, authoritative archival materials, it seems feasible to use the aggregated content of user descriptions as raw material for an archivist’s more detailed and refined archival description.

There will be patterns in tagging that can be aligned with organizational ontologies, and annotations that can help archivists understand what users feel to be important information about the records being described. However, to make this effective for the archivist, the interfaces that aggregate and present user content and patterns need to be well designed and easy to use.