Case Study: Archival Information
SWUI 2009 at ISWC: Sharing Ideas for Complex Problems in
User Interaction
Case study to support SWUI 2009 workshop activities and discussion,
prepared by Duane Degler, 6/2009.
Summary
This case study illustrates challenges and opportunities in the
archival community, facing rapid growth of large-scale electronic
archives and
library collections online. This domain provides a rich opportunity for
semantic web and linked data researchers, practitioners, and software
developers to explore. From the public/researcher user perspective, it
offers
highly heterogeneous data sources, a wide mix of structured and
unstructured
data, diverse user populations and goals, requirements for a range of
query/search/retrieval strategies, and unique needs for
visualizing/rendering
both results and individual objects. From the archivist/records manager
perspective, it offers an increasing tidal wave of data in a wide range
of
formats that needs to be reviewed, processed, described, tagged,
cataloged,
shared, and enhanced over time – in an environment where users expect
more and
more while the available staff may remain static or even decrease.
Increasingly, both web tools and archival management are
embedding RDF and linked data capabilities into their architectures,
and
overlaying standards-based ontologies to supplement older proprietary
vocabularies. However, key questions remain:
- What are the most effective ways to harness these
capabilities to support high quality user interaction by researchers
and the
public?
- To achieve successful interaction with archival data, how
do we support archivists and records managers in the creation and
maintenance
of rich, linked, semantically-enabled data?
Why is archival information an important topic for the
semantic web community? Ultimately, it is a highly multi-dimensional,
contextual, authoritative, and evolving problem space with an extremely
diverse
set of user needs and expectations that at the same time has to
provide a
service today and plan for service tens or even hundreds of years
hence. That
sounds like the kind of complex environment where the rich potential of
linked
data and the power of the semantic web should play a key role!
The Challenge
Online and electronic archival resources have many very
interesting and complex aspects that have a SWUI component. For
example,
government archivists face:
- High demand for access to records electronically.
- A widely diverse user population, with a wide range of goals and
expectations. This includes:
- Genealogists from all over the world looking to “link up”
people,
and information about people, dating back hundreds of years and
spanning more
than one country.
- Military service people looking for supporting documentation
to
manage administrative or benefits issues.
- Military historians looking to connect people and groups to
historic locations and events.
- Academics and PhD students performing exploratory research
and
doing complex analysis that enriches the historical corpus by creating
new
papers that are themselves published electronically and need to be
linked.
- Authors and historians gathering source material for
electronic
and printed publications.
- Journalists and government staff researchers who need to
connect
current events with historic events very quickly when a new topic
arises.
- Members of the public and students world-wide, who are
curious to
discover the wide range of information available in texts, photos,
maps,
illustrations, videos, audios, and databases.
- Users with special needs, such as physical disabilities,
cognitive impairments, or environmental circumstances that make access
to the
information online both vital and difficult.
- Massive amounts of extremely heterogeneous "records"
data (e.g. petabytes of every type of file and database you can
imagine, including
“born digital” electronic materials and scanned/digitized materials
that may be
hundreds of years old).
- Interesting mixtures of structured metadata, structured data,
unstructured data, ontologies, flat vocabulary lists, audio, video,
images.
This diverse data exhibits a range of organizational relationships,
geographical
relationships, time relationships, and event relationships that users
will
interact with in a range of ways.
- The need to provide context for the files from many dimensions,
both from their origination (source, time period, etc.) and their
evolving
subject relevance over time. This presents a particular challenge: the
uses for
which data will be put cannot be known in advance, because much of the
historical material of government is referenced in the context of a
current
event which is distant in time from earlier cataloging and description.
The
historic information develops new links and context relationships (as
well as,
sometimes, interpretations) through its use over time.
- The need to provide clear, usable access for a variety of user
groups, from casual general public users (survey searching) to academic
researchers (exploratory searching) to historians (cataloging and
targeted
research) to interaction between systems creating larger historical
record sets
(API, mashup, etc.).
- The need for "meta representations" and visualizations
of search results, collections of materials, and relationships between
large
numbers of items; because there will be so many "hits" to most
searches, and the idea of "ranking" individual results is not aligned
with the problem space.
- The need to support both formal and informal description
(summary,
elaboration, annotation, classification, linking, structuring) of
materials and
collections of materials, and allow review of descriptions by
archivists to
improve formal descriptions over time.
- The need for individual users to manage their descriptions as
notes in their research, but also to allow those notes to enhance the
whole and
to determine what are appropriate to persist over the long-term as part
of
historic record.
- Data sharing between disparate partners, both government and
commercial, each of whom have their own interaction designs and
approach to the
data. At NARA, they have partnership agreements with organizations such
as the
Library of Congress, ancestry.com, and footnote.com, among many others.
The
aggregate can be thought of as the total accessible "holdings" in the
collection. The linked data challenges and opportunities are
significant.
- Social interactions and collaborations between communities of
users, and between users and archivists/historians, that support the
communities’ use of the information and as a by-product enhance the
description
and discoverability of the material.
Applicable Domains
Research and designs from a range of disciplines are
applicable to the challenges described above, including:
- Semantic Web
- Linked Data
- Machine Learning
- NLP, concept and entity extraction, and linguistic analysis
- Search and indexing of structured and unstructured data, images
and multimedia
- Visualization
- Text entry, tagging and annotation
- Ontology and taxonomy modeling and integration
- Social and community interaction
- Multimedia
- Grid and cloud computing
- Agents
Possible Roles for Linked Data and the Semantic Web to Support
Interaction
How do users articulate their goals?
In many cases, the user’s experience with an archival
information site is not a single-encounter, single-item experience. It
is an
exploration, often over many sessions. Sometimes that exploration is
quite
focused (for example, gathering information about a specific relative
or family
tree, which is focused but not necessarily quick) and sometimes it is
quite
broad and extensive (for example, analyzing a wide range of documents,
reports,
and statistics as research for a book on the effect of international
trade
patterns for particular local economies).
Users need to find all relevant information
throughout the archive, while incrementally filtering out things that
prove not
to be relevant. This exploration and decision-making should be
facilitated by
the environment, so that the user’s focus is on their goal and not on
the
process of managing the software.
Findability for heterogeneous data
Searching and browsing archival records is a different
experience from other types of information and sites. As a result,
interactions
potentially need to be different than they are in other
searching/browsing
environments in order to be effective. Linked data approaches and
technologies
that are driving the potential of the semantic web are a key part of
the
solution.
The interesting attributes from a searching and browsing
perspective include:
- There is an almost unlimited number of information types,
multiplied by a large number of data formats used. For example:
- Individual records of people (immigration, birth/death,
military
service, census, medical, political, employment)
- Documents, on paper, scanned, and “born electronic”
(treaties,
contracts, reports, analyses, communications, memos, specifications,
policies,
procedures); this might also include communication formats like e-mail
which
have to be handled in three or more ways: as collections of messages (a
single
person’s e-mail history), individual files, and threads
- Images, photographs, illustrations, and many other forms of
visual information, some scanned and some digital, which may or may not
have
descriptive metadata
- Maps and cartographic illustrations, which convey visual
information, geographic information, and structured data about places
and events
- Statistical data files and databases (economic data,
population
analyses, research study data)
- Raw scientific databases and flat file data sets
(environmental,
oceanographic, geological, space-based data capture, medical study
data,
biological/genomic data)
- Audio files (speeches, music, proceedings, office records
such as
the recently released Nixon tapes)
- Video and film materials
- Possibly going forward, user-generated content and newer
forms of
electronic information like Twitter messages and other electronic
messaging
- The records reflect the widely heterogeneous and overlapping
nature of the information creators. For example, in the United States
you may
find responsibility and involvement in nuclear-related issues in many
departments, such as the Nuclear Regulatory Agency, the Environmental
Protection Agency, the State Department, and the Department of Defense,
but
also in places like the National Science Foundation, the National
Institutes of
Health (part of Health and Human Services), the National Institute of
Standards
and Technology, and of course responsible groups within each of the 50
states
and other government levels. Which group is responsible for what? How
is their
information related? Are their categorization schemes aligned?
- Network information models, designed in the web world, can
struggle against the data’s lack of linked relationships. A large
volume of
documents, images, etc. that were stored on a department’s file server
will not
have the same level of deliberate (and intact) coded relationships that
you
would find for information that was created from the beginning with the
intention of operating in a linked environment.
- Semantic analysis/extraction models for cataloging and searching
can struggle against the broad, heterogeneous nature of the data and
inconsistent use of language that arise, since the information is
coming from
so many different sources over a significant span of time.
- Even if the appropriate information is retrieved, there are
challenges with how the information is ranked when presented to
the
user. If results are presented in lists, and users are not likely to
page
through many pages of results, then there is a risk that relevant and
appropriate information is never surfaced to the user.
Self-description: carrying “meaning” far into the future
The primary purpose for archiving information is to preserve
it for future generations, in a way that it can be accessed, used and
understood by people who may be far removed from the context in which
the
information is created. The information must not only be readable by
humans and
machines as technology evolves over hundreds of years, but the
information must
also be able to be interpreted contextually and prompt users to
associated
records. This means:
- Data about relationships and context should be coded using
standards that will allow it to be read independent of current
technology tools.
- That data should be contextually rich, so that meaning is
enhanced rather than lost or clouded.
- Context and meaning should be machine discoverable, as the volume
of information and the variety of media formats will continue to grow
exponentially and thus users will need assistance in locating,
relating, and
interpreting information.
- Context and meaning should be clearly available to users in a way
that is understandable and detailed, and yet does not distract the user
from
their primary goal in using the information.
- New interpretations and relationships should be able to be
created easily to enhance the value of historical records, and the
format
should be consistent over time to reduce the amount of data
transformation
required that records and their metadata have to go through.
Supporting the disambiguation and identity challenge
The modeling and describing of information needs to address
ambiguity
of what is being sought. Here is a particularly famous example: “John
Kennedy”
– for which there are many documents, images, and data records for each
of the
following:
- John F Kennedy, the President (a person)
- John F Kennedy Jr., the President’s son (a person)
- John Kennedy (many people with this name, and many more with
either “John” or “Kennedy”), serving in the military, emigrating to the
US, acting as contracts agents in procurements, featuring in
legislation, etc.
- The Kennedy Presidential Library
- USS John F Kennedy, the Naval aircraft carrier (not to be
confused with the USS Joseph P Kennedy, for those searching for “USS
Kennedy”)
- Kennedy Airport
- The John F Kennedy Center for the Performing Arts
- The Kennedy Foundation
- Etc.
Reflecting the authoritative nature of the archival repository
People who use a national archive expect it to be the
authoritative source for the information it holds. The interaction
experience between users and the information should help users feel
that sense
of trust. The data, its various formats and transformations over time,
and any
overall site architecture for accessing and viewing information should
reflect
a transparent authority – modeling that authority for users and other
systems
could benefit greatly from semantic components and architectures.
Archival information is often used outside the Archives’ web
presence or physical buildings. Materials are shared with other
collections.
People use images, documents, audio, and statistical data from archives
as raw
materials for other works. It is important now, and will be
increasingly
important as electronic data is available more widely, that the path
back to
the provenance of that data is sound. This starts with
providing clear
source provenance data in a standard, machine-readable form, and
providing
services that are trusted when that provenance is requested by users of
other
sites and repositories. Provenance itself should be modeled in a way
that is
easy to interpret and aligns with provenance representations from other
information sources.
Codifying policies and rules for finding, accessing, and using data
There are a range of rules around accessing and using
information, depending on the nature of the information and the person
(or in
future, system) that is seeking the information. These rules need to be
carried
in the system in a way that allows them to be understood, reviewed,
audited,
communicated, and changed over time. They should participate in the
data
relationships, rather than be coded into applications. Rules, logic,
and
approaches to proof are all important, and must be transparent and
manageable
by responsible users.
In the service oriented architecture and also in services
made available on the web, rules for agents and services also need to
be
created, managed, reviewed, and changed over time. As the semantic web
becomes
a more complex ecosystem for agents and services, the rules and
management of
interactions needs to be understood and able to be changed flexibly by
responsible users.
Keeping up with evolving descriptions and language
XML metadata structures and thesaurus syntax is widely used
throughout the archival community. At the same time, there is an
increasing use
of RDF within metadata, description, and linking architectures, and OWL
vocabularies as part of the formal controlled vocabularies. One reason
for this
is clear: language and structure are both going to evolve, and modeling
rules
and relationships must be flexible and orthogonal (not simply
hierarchical).
Semantic web formats allow for more flexible modeling and maintenance,
and the
ability to map new vocabularies over time. However, much of the current
implementations are focused on back-end enabling technologies.
Approaches for
user interaction so that this richness and flexibility can be captured
and
maintained are not yet as evident, but are vitally needed.
Coping with the volume of information
The sheer volume of information that is collected and can be
made available is staggering. Traditional mechanisms for finding and
displaying
information can break down at this scale. And yet it is important to
guard
against creating silos of data that aren’t flexible to accommodate both
current
and future use. One aspect of the semantic web that may help this is to
create
overlay models that allow information to be managed in segments for
particular
purposes but do not limit the integrity of the whole.
Note also that the information held is not necessarily very
well described in subject terms. Archival records, unlike
library collections,
go through a lifecycle where the majority of the information that is
captured
relates to the process in which the materials are identified,
handled,
transferred, and preserved. The management context, physical
attributes, and
handling instructions/restrictions may be better described than the
subjects or
the context in which the records were originally used. Subject-related
description can be later applied over an extended period of time by
people who
handle and review the archived materials – as the volume grows, this
becomes
increasingly challenging. Concept and entity analysis/extraction tools
can help
extend the quality of subject identification, but it is important to
find ways
that identify subjects without increasing the “noise” that may result
in final
search results.
Creating an “Enhancement Ecosystem”
In this Web 2.0 world, users expect to be able to tag,
annotate, cross-link and share information. There is value not only to
users in
being able to do this, but also value to the archives at a whole. While
the
idea of “crowd-sourcing” descriptions may not be appropriate for
long-term,
authoritative archival materials, it seems feasible to use the
aggregated
content of user descriptions as raw material for an archivist’s more
detailed
and refined archival description.
There will be patterns in tagging that can be aligned with
organizational ontologies, and annotations that can help archivists
understand
what users feel to be important information about the records being
described.
However, to make this effective for the archivist, the interfaces that
aggregate and present user content and patterns need to be well
designed and
easy to use.