Testing ChatGPT-4 for ‘UX Audits’ Shows an 80% Error Rate & 14–26% Discoverability Rate – Articles – Baymard Institute

Created on November 12, 2023 at 10:25 am

Key Takeaways ORG At Baymard we’ve tested ChatGPT-4’s ability to conduct a UX ORG audit of 12 CARDINAL different webpages, by comparing the AI ORG model’s UX ORG suggestions to those of a qualified human UX professional

Our tests show that ChatGPT-4 has an 80% PERCENT false-positive error rate and a 20% PERCENT accuracy rate in the UX ORG suggestions it makes

in the UX ORG suggestions it makes When compared to the human experts, ChatGPT-4 discovered 26% PERCENT of the UX ORG issues in the screenshot of the webpage, and just 14% PERCENT of the UX ORG issues actually present on the live webpage (as interaction-related UX issues cannot be ascertained from an image)

on the live webpage (as interaction-related UX issues cannot be ascertained from an image) On the 12 CARDINAL webpages tested, GPT-4 on average: correctly identified 2.9 CARDINAL

UX PRODUCT issues, but then overlooked 18.5 CARDINAL

UX PRODUCT issues on the live webpage and 9.4 CARDINAL

UX PRODUCT issues in the screenshot of the webpage, came up with 1.3 CARDINAL suggestions that are likely harmful to UX, and made 10.6 CARDINAL suggestions that are a waste of time (when compared to a human UX professional)

The human experts used in our testing were 6 CARDINAL different highly trained UX ORG benchmarkers working at Baymard FAC (relying on our 130,000+ hours TIME of large-scale UX research)

Why This Test?

OpenAI recently opened up access for image upload in ChatGPT-4. This allows anyone to upload a screenshot of a webpage and ask, “What UX improvements can be made to this page?”. This gives a seemingly impressive response, where the results are clearly tailored to the webpage screenshot uploaded, and with a tone of high confidence.

We decided to test how accurate GPT-4 actually is when it comes to discovering UX issues on a webpage — to get a better understanding of how far the AI ORG model is from a qualified human conducting a UX ORG audit of the same webpage.

Test Methodology

1 CARDINAL of 12 CARDINAL webpage screenshots analyzed (full version here). GPT-4’s response for the uploaded webpage screenshot. GPT ORG was compared to the results of a human UX professional spending 2-10 hours TIME analyzing the same page.

We uploaded a screenshot of 12 CARDINAL different e-commerce webpages to GPT-4 and asked, “What UX improvements can be made to this page?”. We then manually compared the responses of GPT-4 with 6 CARDINAL human UX professionals’ results for the 12 CARDINAL very same webpages.

The humans are 6 CARDINAL different highly trained UX ORG benchmarkers working at Baymard ORG who are all relying on the results of Baymard ORG ’s 130,000+ hours TIME of large-scale UX testing of more than 4,400 CARDINAL + real end users (see more at baymard.com/research).

The humans spent 2–10 hours TIME on each of the 12 CARDINAL webpages (this was performed as part of past UX benchmarking work). We spent an additional 50 hours TIME on a detailed line-by-line comparison of the humans’ 257 CARDINAL UX suggestions against ChatGPT-4’s 178 CARDINAL

UX PRODUCT suggestions.

The 12 CARDINAL webpages tested were a mix of product pages, product listings, and checkout pages at: Lego PERSON , Cabelas GPE , Argos ORG , Overstock ORG , Expedia ORG , REI ORG , Thermo Fisher ORG , TireRack PERSON , Zalando PERSON , Sigma Aldrich PERSON , Northern Tool ORG , and Cructhfield GPE .

(Note: other prompts were also tested but they gave largely the same responses.)

The Results

Below are the results of analyzing the 12 CARDINAL pages, 257 CARDINAL

UX PRODUCT issues identified by humans, and 178 CARDINAL

UX PRODUCT issues identified by ChatGPT-4.

We’ve uploaded our raw test data here and calculations here.

The results give the following GPT-4 discovery-, accuracy-, and error-rates:

14.1% PERCENT UX discovery rate overall (on the live webpage)

25.5% PERCENT UX discovery rate for only issues seen in the screenshot

19.9% PERCENT accuracy rate of the GPT ORG suggestions

80.1% PERCENT false-positive error rate of GPT ORG suggestions (overall)

8.9% PERCENT false-positives where GPT ORG suggestions are likely harmful

71.1% PERCENT false-positives where GPT ORG suggestions are a waste of time

(The above percentage stats have been rounded in the rest of the article).

GPT-4 Discovers 26% PERCENT of UX Issues in the Screenshot, and 14% PERCENT of UX Issues on the Webpage

Our tests show that GPT-4 discovered 26% PERCENT of the UX ORG issues verified to be present in the screenshot of the webpage, when compared to a human UX professional.

If we want to understand how a human UX professional compares to the method of ‘giving ChatGPT-4 a screenshot of a webpage’, then we need to instead consider all UX issues the human identified using the live webpage. Here, ChatGPT-4 discovered 14% PERCENT of the UX ORG issues actually present on the live webpage, because GPT-4 here only analyzed screenshots of the webpage, compared to the human UX professionals who used the live interactive website.

Uploading screenshots of a webpage and asking any AI model to assess it will always be a greatly limiting method, as a very large portion of website UX PRODUCT issues are interactive in nature. Discovering many of the UX ORG issues present on a given webpage requires interacting with the webpage (e.g., click buttons, hover images, etc.) but also requires that one navigates between pages and takes information from other pages into account when assessing the current page.

Being able to truly navigate and click around websites, will likely lift the discoverability rate of UX ORG issues from 14% PERCENT close to 26% PERCENT . That of course still means overlooking 3 CARDINAL out of 4 CARDINAL UX issues on the given webpages (or, in absolute numbers, overlooking 10.6 CARDINAL

UX PRODUCT issues per page).

(In comparison, heuristic evaluations are often said to have a 60% PERCENT inter-rater reliability, and at Baymard FAC we’ve found that this can be pushed to 80–90% with the right heuristics, training, and technologies.)

The 80% PERCENT Error Rate: 1/8 CARDINAL Is Harmful, and 7/8 CARDINAL Is a Waste of Time

ChatGPT-4 had a 80% PERCENT false-positive error rate.

Of these errorful suggestions around 1/8 CARDINAL of were advice that would likely be harmful to UX. For example:

GPT ORG suggested LEGO ORG , which already have a simplified footer, to simplify it further (essentially removing it)

GPT ORG suggested Overstock ORG , that uses pagination, to instead “either use infinite scrolling or ‘load more” (infinite scrolling is observed to be harmful to UX, whereas ‘load more’ is a good suggestion)

The vast majority of the errorful UX ORG suggestions made by GPT-4 ( 7/8 CARDINAL ) were however not harmful, but simply a waste of time; typical examples observed:

GPT PERSON very often made the same overly generic suggestions based on things not viewable in the screenshot. For example, for all 12 CARDINAL webpages, it suggested: “Make the site mobile responsive…” despite clearly being provided a desktop screenshot

GPT made several suggestions for things the site already had implemented, e.g. suggesting to implement breadcrumbs for REI ORG ’s product page, when the site clearly already have this

GPT made suggestions for unrelated pages, e.g. for Argos ORG ’ product page one CARDINAL of the suggestions was “While not visible in the current screenshot, ensuring the checkout process is as streamlined and straightforward as possible can further improve the user experience

Some of the suggestions by GPT ORG were also so inaccurate that it’s unclear what is meant, e.g. GPT ORG suggested for TireRack PERSON ’s product details page that: “Instead of static images, an image carousel for the tire and its features might be more engaging and informative.” (the site already has a PDP ORG image gallery, so this advice is prone to be interpreted as an auto-rotating carousel, which would be harmful)

GPT ORG also made some outdated suggestions based on UX ORG advice that used to be true, but in 2023 DATE is no longer valid due to an observed change in general user behavior (e.g., our 2023 DATE UX research shows that users’ understanding of checkout inputs have improved greatly, compared to what we have observed in the past 10 years DATE )

In Summary: ChatGPT-4 Is Not (Yet) Useful for UX Auditing

In the UX LOC world, large language models like ChatGPT are increasingly proving as indispensable work tools, e.g., for analyzing large datasets of qualitative customer support emails, brainstorming sale page copy ideas, and transcribing videos. It’s also impressive that this technology can now interpret a screenshot of a webpage.

However, when it comes to having GPT-4 help with performing UX audits, we would caution against it because:

Of the 12 CARDINAL webpages tested, GPT-4 on average correctly identified 2.9 CARDINAL

UX PRODUCT issues, but then overlooked 18.5 CARDINAL

UX PRODUCT issues on the live webpage, 9.4 CARDINAL

UX PRODUCT issues in the screenshot of the webpage, made 1.3 CARDINAL suggestions that would be harmful to UX, and gave 10.6 CARDINAL suggestions that would be a waste of time.

This means that ChatGPT-4 is very far from human performance — assuming the human is skilled in UX LOC , has 2–10 hours TIME per page type, and uses a comprehensive database of UX ORG heuristics.

It’s especially the combination of ChatGPT-4’s low discoverability rate of UX ORG issues and the low accuracy rate of its suggestions that’s problematic. Even as a “quick free UX check”:

It simply will not find enough of the UX ORG issues present on the live webpage/screenshot of the webpage (respectively a 14% PERCENT and 26% PERCENT discoverability rate)

Additionally, work is required to parse its response to identify the correct UX ORG suggestions, among the errorful suggestions. With an accuracy rate of 20% PERCENT , this is not “a useful supplemental tool that will give you a fifth ORDINAL of UX ORG issues for free” as it will require a human UX professional to parse

Furthermore, considering the cost of implementing any website change (design and development), basing website changes on a UX ORG audit of this low quality is bound to yield a poor overall return of investment.

Getting access: our research findings are available in the 650 CARDINAL + UX guidelines in Baymard Premium ORG – get full access to learn how to create a “State of the Art” e-commerce user experience.

If you want to know how your desktop site, mobile site, or app performs and compares, then learn more about getting Baymard PERSON to conduct a UX Audit LAW of your site or app.

Connecting to blog.lzomedia.com... Connected... Page load complete