Testing ChatGPT-4 for ‘UX Audits’ Shows an 80% Error Rate & 14–26% Discoverability Rate – Articles – Baymard Institute

By admin

Key Takeaways
ORG

At Baymard we’ve tested ChatGPT-4’s ability to conduct a

UX
ORG

audit of

12
CARDINAL

different webpages, by comparing the

AI
ORG

model’s

UX
ORG

suggestions to those of a qualified human UX professional

Our tests show that ChatGPT-4 has an

80%
PERCENT

false-positive error rate and a

20%
PERCENT

accuracy rate in the

UX
ORG

suggestions it makes

in the

UX
ORG

suggestions it makes When compared to the human experts, ChatGPT-4 discovered

26%
PERCENT

of the

UX
ORG

issues in the screenshot of the webpage, and

just 14%
PERCENT

of the

UX
ORG

issues actually present on the live webpage (as interaction-related UX issues cannot be ascertained from an image)

on the live webpage (as interaction-related UX issues cannot be ascertained from an image) On the

12
CARDINAL

webpages tested, GPT-4 on average: correctly identified

2.9
CARDINAL


UX
PRODUCT

issues, but then overlooked

18.5
CARDINAL


UX
PRODUCT

issues on the live webpage and

9.4
CARDINAL


UX
PRODUCT

issues in the screenshot of the webpage, came up with

1.3
CARDINAL

suggestions that are likely harmful to UX, and made

10.6
CARDINAL

suggestions that are a waste of time (when compared to a human UX professional)

The human experts used in our testing were

6
CARDINAL

different highly trained

UX
ORG

benchmarkers working at

Baymard
FAC

(relying on our

130,000+ hours
TIME

of large-scale UX research)

Why This Test?

OpenAI recently opened up access for image upload in ChatGPT-4. This allows anyone to upload a screenshot of a webpage and ask, “What UX improvements can be made to this page?”. This gives a seemingly impressive response, where the results are clearly tailored to the webpage screenshot uploaded, and with a tone of high confidence.

We decided to test how accurate GPT-4 actually is when it comes to discovering UX issues on a webpage — to get a better understanding of how far the

AI
ORG

model is from a qualified human conducting a

UX
ORG

audit of the same webpage.

Test Methodology


1
CARDINAL

of

12
CARDINAL

webpage screenshots analyzed (full version here). GPT-4’s response for the uploaded webpage screenshot.

GPT
ORG

was compared to the results of a human UX professional spending

2-10 hours
TIME

analyzing the same page.

We uploaded a screenshot of

12
CARDINAL

different e-commerce webpages to GPT-4 and asked, “What UX improvements can be made to this page?”. We then manually compared the responses of GPT-4 with

6
CARDINAL

human UX professionals’ results for the

12
CARDINAL

very same webpages.

The humans are

6
CARDINAL

different highly trained

UX
ORG

benchmarkers working at

Baymard
ORG

who are all relying on the results of

Baymard
ORG

’s

130,000+ hours
TIME

of large-scale UX testing of

more than 4,400
CARDINAL

+ real end users (see more at baymard.com/research).

The humans spent

2–10 hours
TIME

on each of the

12
CARDINAL

webpages (this was performed as part of past UX benchmarking work). We spent

an additional 50 hours
TIME

on a detailed line-by-line comparison of the humans’

257
CARDINAL

UX suggestions against ChatGPT-4’s

178
CARDINAL


UX
PRODUCT

suggestions.

The

12
CARDINAL

webpages tested were a mix of product pages, product listings, and checkout pages at:

Lego
PERSON

,

Cabelas
GPE

,

Argos
ORG

,

Overstock
ORG

,

Expedia
ORG

,

REI
ORG

,

Thermo Fisher
ORG

,

TireRack
PERSON

,

Zalando
PERSON

,

Sigma Aldrich
PERSON

,

Northern Tool
ORG

, and

Cructhfield
GPE

.

(Note: other prompts were also tested but they gave largely the same responses.)

The Results

Below are the results of analyzing the

12
CARDINAL

pages,

257
CARDINAL


UX
PRODUCT

issues identified by humans, and

178
CARDINAL


UX
PRODUCT

issues identified by ChatGPT-4.

We’ve uploaded our raw test data here and calculations here.

The results give the following GPT-4 discovery-, accuracy-, and error-rates:


14.1%
PERCENT

UX discovery rate overall (on the live webpage)


25.5%
PERCENT

UX discovery rate for only issues seen in the screenshot


19.9%
PERCENT

accuracy rate of the

GPT
ORG

suggestions


80.1%
PERCENT

false-positive error rate of

GPT
ORG

suggestions (overall)


8.9%
PERCENT

false-positives where

GPT
ORG

suggestions are likely harmful


71.1%
PERCENT

false-positives where

GPT
ORG

suggestions are a waste of time

(The above percentage stats have been rounded in the rest of the article).

GPT-4 Discovers

26%
PERCENT

of UX Issues in the Screenshot, and

14%
PERCENT

of UX Issues on the Webpage

Our tests show that GPT-4 discovered

26%
PERCENT

of the

UX
ORG

issues verified to be present in the screenshot of the webpage, when compared to a human UX professional.

If we want to understand how a human UX professional compares to the method of ‘giving ChatGPT-4 a screenshot of a webpage’, then we need to instead consider all UX issues the human identified using the live webpage. Here, ChatGPT-4 discovered

14%
PERCENT

of the

UX
ORG

issues actually present on the live webpage, because GPT-4 here only analyzed screenshots of the webpage, compared to the human UX professionals who used the live interactive website.

Uploading screenshots of a webpage and asking any AI model to assess it will always be a greatly limiting method, as a very large portion of website

UX
PRODUCT

issues are interactive in nature. Discovering many of the

UX
ORG

issues present on a given webpage requires interacting with the webpage (e.g., click buttons, hover images, etc.) but also requires that one navigates between pages and takes information from other pages into account when assessing the current page.

Being able to truly navigate and click around websites, will likely lift the discoverability rate of

UX
ORG

issues from

14%
PERCENT

close to

26%
PERCENT

. That of course still means overlooking

3
CARDINAL

out of

4
CARDINAL

UX issues on the given webpages (or, in absolute numbers, overlooking

10.6
CARDINAL


UX
PRODUCT

issues per page).

(In comparison, heuristic evaluations are often said to have a

60%
PERCENT

inter-rater reliability, and at

Baymard
FAC

we’ve found that this can be pushed to 80–90% with the right heuristics, training, and technologies.)


The 80%
PERCENT

Error Rate:

1/8
CARDINAL

Is Harmful, and

7/8
CARDINAL

Is a Waste of Time

ChatGPT-4 had a

80%
PERCENT

false-positive error rate.

Of these errorful suggestions

around 1/8
CARDINAL

of were advice that would likely be harmful to UX. For example:


GPT
ORG

suggested

LEGO
ORG

, which already have a simplified footer, to simplify it further (essentially removing it)


GPT
ORG

suggested

Overstock
ORG

, that uses pagination, to instead “either use infinite scrolling or ‘load more” (infinite scrolling is observed to be harmful to UX, whereas ‘load more’ is a good suggestion)

The vast majority of the errorful

UX
ORG

suggestions made by GPT-4 (

7/8
CARDINAL

) were however not harmful, but simply a waste of time; typical examples observed:


GPT
PERSON

very often made the same overly generic suggestions based on things not viewable in the screenshot. For example, for all

12
CARDINAL

webpages, it suggested: “Make the site mobile responsive…” despite clearly being provided a desktop screenshot

GPT made several suggestions for things the site already had implemented, e.g. suggesting to implement breadcrumbs for

REI
ORG

’s product page, when the site clearly already have this

GPT made suggestions for unrelated pages, e.g. for

Argos
ORG

’ product page

one
CARDINAL

of the suggestions was “While not visible in the current screenshot, ensuring the checkout process is as streamlined and straightforward as possible can further improve the user experience

Some of the suggestions by

GPT
ORG

were also so inaccurate that it’s unclear what is meant, e.g.

GPT
ORG

suggested for

TireRack
PERSON

’s product details page that: “Instead of static images, an image carousel for the tire and its features might be more engaging and informative.” (the site already has a

PDP
ORG

image gallery, so this advice is prone to be interpreted as an auto-rotating carousel, which would be harmful)


GPT
ORG

also made some outdated suggestions based on

UX
ORG

advice that used to be true, but in

2023
DATE

is no longer valid due to an observed change in general user behavior (e.g., our

2023
DATE

UX research shows that users’ understanding of checkout inputs have improved greatly, compared to what we have observed in

the past 10 years
DATE

)

In Summary: ChatGPT-4 Is Not (Yet) Useful for UX Auditing

In the

UX
LOC

world, large language models like ChatGPT are increasingly proving as indispensable work tools, e.g., for analyzing large datasets of qualitative customer support emails, brainstorming sale page copy ideas, and transcribing videos. It’s also impressive that this technology can now interpret a screenshot of a webpage.

However, when it comes to having GPT-4 help with performing UX audits, we would caution against it because:

Of the

12
CARDINAL

webpages tested, GPT-4 on average correctly identified

2.9
CARDINAL


UX
PRODUCT

issues, but then overlooked

18.5
CARDINAL


UX
PRODUCT

issues on the live webpage,

9.4
CARDINAL


UX
PRODUCT

issues in the screenshot of the webpage, made

1.3
CARDINAL

suggestions that would be harmful to UX, and gave

10.6
CARDINAL

suggestions that would be a waste of time.

This means that ChatGPT-4 is very far from human performance — assuming the human is skilled in

UX
LOC

, has

2–10 hours
TIME

per page type, and uses a comprehensive database of

UX
ORG

heuristics.

It’s especially the combination of ChatGPT-4’s low discoverability rate of

UX
ORG

issues and the low accuracy rate of its suggestions that’s problematic. Even as a “quick free UX check”:

It simply will not find enough of the

UX
ORG

issues present on the live webpage/screenshot of the webpage (respectively a

14%
PERCENT

and

26%
PERCENT

discoverability rate)

Additionally, work is required to parse its response to identify the correct

UX
ORG

suggestions, among the errorful suggestions. With an accuracy rate of

20%
PERCENT

, this is not “a useful supplemental tool that will give you a

fifth
ORDINAL

of

UX
ORG

issues for free” as it will require a human UX professional to parse

Furthermore, considering the cost of implementing any website change (design and development), basing website changes on a

UX
ORG

audit of this low quality is bound to yield a poor overall return of investment.

Getting access: our research findings are available in the

650
CARDINAL

+ UX guidelines in

Baymard Premium
ORG

– get full access to learn how to create a “State of the Art” e-commerce user experience.

If you want to know how your desktop site, mobile site, or app performs and compares, then learn more about getting

Baymard
PERSON

to conduct

a UX Audit
LAW

of your site or app.