Moving Getty.edu “404-ward” With Help From The Internet Archive API

By admin
This is a guest post from

Teresa Soleau
PERSON

(

Digital Preservation
ORG

Manager),

Anders Pollack
PERSON

(

Software Engineer
ORG

), and

Neal Johnson
PERSON

(Senior IT Project Manager) from

the J. Paul Getty Trust
ORG

.

Project Background


Getty
PERSON

pursues its mission in

Los Angeles
GPE

and around the world through the work of its constituent programs—

Getty Conservation Institute
ORG

,

Getty Foundation
ORG

,

J. Paul Getty Museum
ORG

, and

Getty Research Institute
ORG

—serving the general interested public and a wide range of professional communities to promote a vital civil society through an understanding of the visual arts.

In

2019
DATE

,

Getty
GPE

began a website redesign project, changing the technology stack and updating the way we interact with our communities online. The legacy website contained

more than 19,000
CARDINAL

web pages and we knew many were no longer useful or relevant and should be retired, possibly after being archived. This led us to leverage the content we’d captured using the Internet Archive’s Archive-It service.

We’d been crawling our site since

2017
DATE

, but had treated the results more as a record of institutional change over time than as an archival resource to be consulted after deletion of a page. We needed to direct traffic to our

Wayback Machine
PRODUCT

captures thus ensuring deleted pages remain accessible when a user requests a deprecated URL. We decided to dynamically display a link to the archived page from our site’s

404
CARDINAL

error “Page not found” page.

Getty.edu

404
CARDINAL

error “Page not found” message including the dynamically generated instructions and Internet Archive page link.

The project to audit all existing pages required us to educate content owners across the institution about web archiving practices and purpose. We developed processes for completing human reviews of large amounts of captured content. This work is described in more detail in a

2021
DATE


Digital Preservation Coalition
ORG

blog post that mentions

the Web Archives Collecting Policy
ORG

we developed.

In this blog post we’ll discuss the work required to use the Internet

Archive
ORG

’s data

API
ORG

to add the necessary link on our

404
CARDINAL

pages pointing to the most recent

Wayback Machine
PRODUCT

capture of a deleted page.


Technical Underpinnings

Implementation
ORG

of our

Wayback Machine
PRODUCT

integration was very straightforward from a technical point of view. The

first
ORDINAL

example provided in the

Wayback Machine
PRODUCT

APIs documentation page provided the technical guidance needed for our use case to display a link to the most recent capture of any page deleted from our website. With no requirements for authentication or management of keys or platform-specific software development kit (SDK) dependencies, our development process was simplified. We chose to incorporate the

Wayback API using Nuxt.js
PRODUCT

, the web framework used to build the new

Getty.edu
ORG

site.

Since the Wayback Machine API is highly performant for simple queries, with a typical response delay in

milliseconds
TIME

, we are able to query the

API
ORG

before rendering the page using a

Nuxt
GPE

route middleware module. API error handling and a request timeout were added to ensure that edge cases such as API failures or network timeouts do not block rendering of the

404
CARDINAL

response page.

The only Internet Archive API feature missing for our initial list of requirements was access to snapshot page thumbnails in the JSON data payload received from the API. Access to these images would allow us to enhance our

404
CARDINAL

page with a visual cue of archived page content.

Results and Next Steps

Our ability to include a link to an archived version of a deleted web page on our

404
CARDINAL

response page helped ease the tough decisions content stakeholders were obliged to make about what content to archive and then delete from the website. We could guarantee availability of content in perpetuity without incurring the long term cost of maintaining the information ourselves.

The

API
ORG

brings back the most recent

Wayback Machine
PRODUCT

capture by default which is sometimes not created by us and hasn’t necessarily passed through our archive quality assurance process. We intend to develop our application further so that we privilege the display of

Getty
PERSON

’s own page captures. This will ensure we’re delivering the highest quality capture to users.


Google Analytics
ORG

has been configured to report on traffic to our

404
CARDINAL

pages and will track clicks on links pointing to Internet Archive pages, providing useful feedback on what portion of archived page traffic is referred from our

404
CARDINAL

error page.

To work around the challenge of providing navigational affordances to legacy content and ensure web page titles of old content remains accessible to search engines, we intend to provide an up-to-date index of all archived getty.edu pages.

As we continue to retire obsolete website pages and complete this monumental content archiving and retirement effort, we’re grateful for the Internet Archive API which supports our goal of making archived content accessible in perpetuity.