ChatGPT, Bard, or Bing Chat? Differences Among 3 Generative-AI Bots

Created on November 12, 2023 at 10:43 am

Summary: Participants rated Bing Chat as less helpful and trustworthy than ChatGPT or Bard PERSON . These results can be attributed to Bing PERSON ’s richer yet imperfect UI ORG and to its poorer information aggregation.

One CARDINAL of the benefits of the generative-AI bots is that they shortcut the task of information foraging. They aggregate pertinent information from multiple sources for users — saving them the effort of inspecting different web pages, extracting relevant information, and then combining it into a coherent answer.

In a diary study conducted with three CARDINAL bots, we found that people rated the conversations with these bots as highly helpful and trustworthy. However, there were some differences in the ratings for the three CARDINAL bots, due to their different capacities and interfaces.

On This Page

Our Research

We ran a diary study with 18 CARDINAL participants: 8 CARDINAL used the newest version of ChatGPT ( 4.0 CARDINAL ), 5 CARDINAL used Bard ORG , and 5 CARDINAL used Bing Chat ORG . The participants had various levels of experience with the chatbots: some had used them before, some had used one CARDINAL bot but tested another in the study, and others had heard about them but had not used them.

Participants logged all their conversations with the bots over a period of approximately 2 weeks DATE . At the end of the diary study, 14 CARDINAL participants were invited for in-depth interviews. The study was conducted in May and June 2023 DATE .

The Three Bot Interfaces

The three CARDINAL bots we studied had different user interfaces and capabilities.

ChatGPT

ChatGPT did not have access to the Internet and provided primarily textual information as output. It automatically saved conversation history, allowing users to revisit previous interactions with the bot. At the time of the study, none of the other bots provided this capability in a consistent manner. ( Bing Chat’s PERSON history was available only to some users.)

Bard PERSON and Bing Chat

Unlike ChatGPT, Bard and Bing Chat ORG were able to return multimedia in their responses, which included links and images. In addition, Bing Chat ORG was capable of embedding videos directly in its responses.

Bing Chat ORG also provided sources for some of its answers and suggested additional followup questions to the users. At the time of the study, it was also the only bot that had image-generation capabilities.

Functionality and UI Features ORG of the 3 CARDINAL Generative-AI Bots Bard Bing Chat ChatGPT Text generation Yes Yes Yes Image generation No Yes No Output format Images, Links, Text Images, links, text, videos Text Access ORG to Internet Yes Yes No References No In-text footnotes/links & Learn more links No Suggested followup questions No Yes No Chat history No (at the time of study) Limited users Yes Ads No Yes No

Helpfulness and Trustworthiness Ratings

Bing Chat’s helpfulness rating was significantly lower than those of Bard PERSON (p <0.001) and ChatGPT PERSON (p = 0.006 CARDINAL ). Bard PERSON was also rated as more helpful than ChatGPT ( p=0.03 CARDINAL ; however, with a Bonferroni correction, this difference is only marginally significant).

The bots also had some differences in trustworthiness ratings: Bard ORG and ChatGPT were perceived as more trustworthy than Bing Chat PERSON (p<0.002). There was no difference in the trustworthiness perception between Bard PERSON and ChatGPT.

Why Was Bing Chat Rated Lower?

It is surprising that Bing Chat PERSON was rated lower than ChatGPT and Bard PERSON , especially since Bing Chat ORG and ChatGPT both use Open AI’s PRODUCT GPT.

We believe there are two CARDINAL big reasons for Bing PERSON ’s poorer ratings:

Poor information foraging: Broad answers that did not always perform information aggregation or performed it only at the surface level

Broad answers that did not always perform information aggregation or performed it only at the surface level User-interface issues: A UI ORG , with potentially useful but poorly executed elements that did not support users well enough and sometimes distracted them from the task at hand

Poor Information Foraging

Information foraging is the behavior that users engage in whenever they need to satisfy an information need on the web. It involves:

finding potential sources of information (often with the help of a search engine)

evaluating them and picking the most promising ones

aggregating the information from those sources and making sense of it

That last step, aggregation of information, is present in many (but not all) user tasks. In simple tasks such as finding an address or specific website, that step may be absent. But in many other tasks, from shopping online to researching a new technology or device, information aggregation is essential.

For example, when shopping, we often see people save multiple candidate products (sometimes in different browser tabs) and then review all of these to decide which are best for their needs. Or, in research tasks, users often go to multiple sites, extract information from each (often by copying and pasting it into a file or some other form of external memory), then revisit and combine all the gathered information in order to make a decision or reach a conclusion.

One CARDINAL of the major advantages of AI ORG bots over traditional search engines is that they can do the entire task of information foraging (including the aggregation of information) for the user. Much of Bing Chat’s lower rating scores are explained by the fact that it does not always perform information aggregation, or only performs it at the surface level.

Several users complained that, instead of providing solid answers to their questions, Bing Chat PERSON sent them to webpages where they could look up the answers for themselves. Thus, it was still the user’s job to combine the different pieces of information — which is exactly what search engines require. Participants felt that Bing Chat’s PERSON response was no better than what a search engine would provide.

For instance, one CARDINAL participant looking for chainsaw recommendations complained that Bing Chat’s PERSON response contained no detail:

I feel like it took some prodding. When I said I wanted to buy a chainsaw, its first ORDINAL response was ‘here are four CARDINAL chainsaws from consumer reports’ with no additional information. I feel like it could have tried to gather more information, like price or features I was looking for.

Another study participant asked Bing Chat PERSON for the best way to cook a steak. He had hoped that the bot would aggregate the best methods and give him pros and cons of each one. Instead, it provided a bland list of four CARDINAL methods, with no additional information. He had to navigate to a Learn ORG more link provided by the answer. That page answered his question perfectly. The participant rated the helpfulness of this answer as 4 CARDINAL (out of 7 CARDINAL ) and commented:

I understand that is a subjective question, but it responded with four CARDINAL answers without giving the pros and cons to them. It also did not explain how to do anything […] It just included links to different websites to go read. One CARDINAL of the links was what I was expecting the answer to be. Four CARDINAL best ways to cook a steak with how to do it and the pros and cons. It was a good Bing search result, but not great chat experience.

In contrast, when asked a similar “the best way to do…” question, Bard PERSON did a better job of aggregating relevant information.

For example, when a participant asked the best ways to tread water, Bard PERSON provided her with several methods and included images showing the movements. It concluded with tips for treading water effectively. She was highly satisfied with the improvement in efficiency provided by Bard PERSON :

It gave me sufficient information with all the tips I needed. It gave me a quick answer to simple question without digging through internet for the best information.

When other general issues (such as inaccurate links and broad answers without considering the context) are intertwined with poor information aggregation, people became extremely frustrated with the bot, as illustrated by the following quote from one CARDINAL of our participants:

Here’s the answer to your question, but you’re gonna have to go over here to get the specifics of what you want. And [Bing Chat] doesn’t put it all in front of you. It sends you someplace else to get the answer. It’s […] like if you ask a librarian […] ‘What is the book Seven Wonders of the World WORK_OF_ART about?’ And she says, ‘Okay, go, go down the second ORDINAL aisle, look up on the top left shelf, you, you’re gonna see the book, […] as well as the other works from that author.

People tried to come up with theories about what may cause the bot to perform poorly. One CARDINAL person decided that Bing PERSON was not good at finding current or local information but was okay with more general queries. Another participant described Bing Chat ORG as unpredictable. He was especially frustrated when he found Bing Chat PERSON performed even worse than a search engine on some occasions. He commented:

Bing Chat is, I’d say, hit or miss because I can never really predict what I’m going to get […] There were some chats that I thought it did a very good job and it […] even got me to ask questions that originally […] I wasn’t going to ask. […] But then there were others where it really didn’t. And the funny thing was I couldn’t predict.

User-Interface Issues

Bing Chat had the richest user interface: it had a lot of features (e.g., references, suggested followup queries) that were not present in the other bots’ interfaces. We believe that, ironically, that fact contributed to its lower ratings.

Whereas, in theory, many of these elements could be useful additions, they were often imperfectly executed and, instead of helping the user, they got in the way. This result emphasizes the importance of user experience in the design of AI bots.

Across all bots, people interacted with some of the other UI ORG elements available (other than the chat) in 33.64% PERCENT of conversations ( 95% PERCENT confidence interval: 29.32% to PERCENT

38.28% PERCENT ). ChatGPT PERSON and Bard PERSON had relatively sparse interfaces compared with Bing Chat ORG , so it is not surprising that those few additional features were not used much ( 24.74% PERCENT for ChatGPT ORG and 31.86% PERCENT for Bard PERSON ; These were both significantly lower than Bing Chat’s ORG percentage, which was 50% PERCENT (p <0.004). Many of the interactions with ChatGPT ORG ’s or Bard ORG ’s UI ORG involved the thumbs-up or thumbs-down buttons, which gave feedback for a conversation.

In what follows, we discuss the issues that participants encountered with Bing-specific interface elements.

References

At the time of the study, Bing Chat PERSON was the only bot that provided sources for the different pieces of information in its answers. Sources were linked in the text and also listed in the Learn ORG more section below the answer.

References are an extremely valuable feature for AI ORG chatbots. They help users understand where the synthesized information originated from, which is necessary to determine how much it should be trusted. However, in Bing PERSON ’s case, the presence of sources sometimes contributed to the lower ratings; If the sources did not seem relevant or specific enough, they reflected poorly on the judgment of the answer.

For instance, one CARDINAL participant was annoyed that the first ORDINAL source that Bing Chat ORG provided to the question what should I know about having a baby was from a Canadian NORP source. He said:

I did not like that the advice in first ORDINAL answer was to consult the Public Health Agency of Canada ORG . I live in the United States GPE so I would want to hear advice from a US GPE agency or site as there could be differences in healthcare or services or policies.

Another person who wanted to learn about the Chichijima NORP incident and George H.W. Bush PERSON was annoyed that the sources were not specific enough:

Although I found the links provided for followup, I can’t give it a higher rating because it led me to sites that were more about the War, not the incident [ Chichijima Incident WORK_OF_ART ] in itself.

Sources can be less important for users when the question is simple and has a clear, unique answer. One CARDINAL user noted that she was more interested in sources and links for broad, research-like questions, where she did not know the knowledge space well (for example, learning about clouds with her kids) but she was less likely to consult them for specific questions that had a clear answer (e.g., the address of a business or the author of a book).

Our finding does not mean that designers should remove sources from their AI interfaces — they’re necessary for users to verify answers and find more information. It only means that sources (like all other UI ORG elements) need to be well tested and designed, so that they are displayed in a way that is easily accessible and people can find them when needed.

In-Answer Links

Aside from references, Bing Chat PERSON ’s answers (as well as Bard PERSON ’s) could also include links to other websites, in response to queries that asked for such links (for example, product or site recommendations).

In 36.84% PERCENT of the Bing Chat conversations, participants clicked on a provided link, compared to only 14.68% PERCENT for Bard ORG . This difference was statistically significant (p<0.0001).

Both Bing Chat and Bard PERSON occasionally provided incorrect links that were either no longer current or did not contain the information they claimed to contain. (For Bing PERSON , incorrect or broken links also caused some of the dissatisfaction with references that was discussed above.)

One CARDINAL participant was looking for things to do on a Friday DATE

night TIME in Nashville GPE . Bing Chat failed to provide any results first ORDINAL , only listing a few websites with no information about any of them. She rephrased the questions several times and asked for free events instead. The bot finally provided her with a few free event names and links to various sites. When she followed the links, she discovered that the events were, in fact, not free. At that point, she gave up chatting with the bot.

Similarly, a Bard ORG participant looking for perfume suggestions discovered that all the stores that it said you could find the perfumes at did not exist or were closed.

There were many such examples for both Bing Chat and Bard PERSON . However, by the sheer fact that Bing participants tended to click on links more, they were more likely to encounter issues.

Suggested Followup Queries

Bing Chat also offered users suggestions for followup questions. Generally, users found these helpful because they bridged the articulation barrier and helped them speed up the process of satisfying their information need. These questions were especially useful when they helped the user understand the structure of the information space: what they didn’t know they should know. As one CARDINAL user put it:

[Bing Chat] provided follow up questions that […were] either […] word for word what I was going to ask next […] or, even better, […] a question that I hadn’t thought of but really wanted to know.

For example, one CARDINAL user who was expecting his first ORDINAL child asked what to do to support his wife during labor. Bing Chat PERSON helped him discover several things he was unaware of or he hadn’t thought of:

I asked what to do to support your wife during labor. I was picturing or thinking of the actual delivery, and this answer seemed to focus on when she goes into labor at home and what to do. I hadn’t really thought about that, so that was very helpful. I liked the provided followup question ‘What to bring to the hospital?’ That was more along the lines of what I was originally thinking, and it provided a good list. It then gave two CARDINAL follow up questions that I liked. ‘What should I pack for the baby’ and ‘What is a birth plan?’ I chose what is a birth plan, because I had absolutely no idea. The next response provided two CARDINAL questions I was interested in. Pain-management options and postpartum options. I thought this thread was very informative and gave great options for continuing the conversation and discovery.

A particularly helpful type of followup question is the one which requests an answer to be made shorter or longer. This supports the accordion-editing behavior, especially for creation tasks, in which the bot must come up with a text or a list of items.

In one CARDINAL instance though, where […] it said like make it shorter, I was like, ‘oh that’s actually a helpful button to have’ […]. And, also, kind of prompted [by ] my response ‘I have to make it shorter’ was ‘oh could you just make that kind of something in between’ and it was able to do that in that instance. So that was helpful.

Followup questions were generally well received, but unfortunately they had a few major issues. Respondents reported that they were sometimes:

Too basic

Too similar with the original question

Not persistent

Too basic. Sometimes suggested followup questions picked up on terms in the users’ question but not on their real information need. For example, they would suggest asking for the definition of a word or for something that was only tangential to the topic of the conversation.

When a study participant used Bing Chat ORG to help her refine the resume, she commented that she didn’t need a followup question for a definition of a medical term in her resume. This happened to other participants, as well:

But sometimes it would just be like, basically what is the definition of this word? And that, I don’t know, I feel like is a waste of a followup question for me. […] You could Google ORG that on your phone. I dunno, I didn’t need the definition ones.

I did use some of the [suggested followup] questions at the bottom, which were, a lot of times […] something silly […] like, could you tell me about the pyramids or something? And, I’m like, that’s not relevant to this conversation but thank you.

Too similar with the original questions. Such questions did not broaden the scope of the conversation and yielded almost identical answers. For example, the participant looking for events in Nashville GPE tried one CARDINAL of Bing Chat ORG ’s followup questions (What are some popular free events in Nashville GPE ). This question was very similar to her previous prompt Are there any free event happening in Nashville GPE

this weekend DATE and gave her links to the same sources.

Not persistent. The followup questions changed after each new response and the user was not able to return to them and select a suggested question from a previous list.

Sometimes the bot offered really good questions, but people could only select one. If they wanted to come back to another question that they had seen before, that was no longer available. The user would have to remember and type the whole followup question.

For example, one CARDINAL participant who was trying to figure out why his cat was coughing up hair balls recalled:

It did a good job providing follow up questions and I clicked on the first ORDINAL one provided. After the response to that question it provided to good follow up questions, "What kind of diet should I feed my cat?" and "How often should I groom my cat?". I clicked on the first ORDINAL one, and after I read the answer, I went back to click on the second ORDINAL and it was not there anymore. I asked the question myself anyway, but maybe those provided possible questions should stay in case someone wants to go back and get that answered as well.

One CARDINAL participant liked two CARDINAL suggested followup queries provided by Bing Chat ORG , but the second ORDINAL question disappeared after he selected the first ORDINAL . He tried to scroll back to refind the second ORDINAL question. Unable to find it anywhere, he ended up typing the question on his own (this video is played at 1.5x speed).

Multimedia Components ORG

Unlike ChatGPT ORG , whose answers were text-only, 89.56% PERCENT of the Bing conversations and 46.01% PERCENT of the Bard PERSON conversations included multimedia elements, such as

Videos

Pictures

Contextual ORG information panels (e.g., news articles, map, products)

This difference was statistically significant at p < 0.0001 CARDINAL ). The multimedia elements were generally perceived positively. For instance, the videos often supplemented the text answer and were particularly useful when the queries requested instructions about a particular process (e.g., how to serve at volleyball),

However, multimedia components presented in different formats can sometimes cause the following issues:

Aggravate the fear of losing the context

Prevent users from quickly getting to the main point (especially true for media content, such as long videos)

Don’t translate well on mobile devices

Lose the Context WORK_OF_ART

Losing the chat is a fear that many people have when chatting on any website — whether the chatbot is powered by generative AI or not. A participant summarized this feeling for us:

I don’t know if you feel the same way: it’s one of the most annoying things when you click on something and it opens a new page for you and it’s like, I don’t wanna lose where I am, but I also don’t want to be directed to like 30 CARDINAL other places when I’m trying to accomplish something.

While rich external links invite users to perform more followup actions on Bing Chat ORG ( 70% PERCENT for Bing PERSON vs. 51% PERCENT for Bard ORG and 43% PERCENT for ChatGPT ORG , p < 0.002 CARDINAL for both comparisons), they can increase the fear of losing the chat, especially when users don’t know which links would direct them to a different site or will open in an overlay.

One CARDINAL participant was reluctant to play the videos within the answers provided by Bing Chat PERSON at the very beginning, because she didn’t know how the video would be displayed and whether she would lose her conversation. She was relieved when she discovered that the video player was contained within the chat:

[The video interface] was almost too simple. I worried about navigating away from the chat would be like, okay, if I go back it’s gonna have lost its place and where it was talking to me and especially with the video feature. So I, I did enjoy that it was like kind of self-contained within the chat.

Fail to Support Scanning

Videos require users to process information sequentially, which prevents them from scanning the main content quickly as they would with text or imageries. Thus, while it’s nice that Bing Chat ORG would provide a list of videos below the answers, it could be hard for users to decide which ones are the most helpful to their questions solely based on the names of the videos. (The list of videos is another example of failed information aggregation — instead of summarizing and pointing out to a single video, the user must go through each of them like they would on a search-engine results page.)

Don’t Translate Well on Mobile GPE

Richer PERSON elements can challenge users more when presented on mobile because the screen space is limited. One CARDINAL participant described the mobile interface of Bing as cluttered, because there were too many things competing for her attention and too many buttons placed close to the input field (the broom button, the voice input button, the input field, and suggested followup queries). Sometimes, specific components would not load properly on mobile.

Furthermore, the overall experience of using Bing Chat ORG on mobile was more error-prone, as people could accidentally submit the query before they had finished typing. When this happened, the participant would have to resubmit the query.

Occasionally, the bot would assume that the participant had initiated a new conversation and it would lose context:

And there were quite a few instances […] where I […] accidentally sent something before I was […] ready to have it sent and I wanted to provide more context. And, then, when I provided more context, it was like, […] ‘oh you’re starting a new topic,’ and it doesn’t connect back to the previous message.

Ads

One CARDINAL other element that impaired the experience of Bing Chat ORG participants was ads. Overall, 15.65% PERCENT of its answers contained an ad. (None of the other chatbots included ads.)

Participants had a mixed attitude towards the ads. They were okay with them when they searched for a products or when the ads were highly relevant to their queries. They were annoyed when the ads were irrelevant, too prevalent, or too intrusive, even as they understood their purpose and acknowledged their legitimacy.

For instance, one CARDINAL participant used Bing Chat ORG as a way to explore nursery lamp options he could buy for his wife. He was satisfied about the whole experience (including ads) because the chat helped him find and purchase a beautiful white floor lamp from Pottery Barn ORG . Similarly, another participant researched the bullet-train ticket prices to get prepared for her upcoming Japan GPE trips was okay with the promoted ticket-purchase links below the answer.

While ads were okay when people were searching for specific information, they were generally perceived as annoying in broader research-oriented activities. This finding is consistent with our previous studies of research-oriented information-seeking activities.

One CARDINAL participant asked about life-insurance advice and received 2 CARDINAL ads from AARP ORG ( American Association of Retired Persons ORG ). He was annoyed by the ads, especially because he knew his age would not qualify him for AARP ORG insurance. What’s worse, the bot failed to provide a helpful answer. He commented,

It didn’t really provide any in depth help [with life insurance advice], and instead of providing alternative places to read, gave me ads for insurance companies.

Another participant asked about the website Clutch PRODUCT (a site for finding agencies that specialize in a variety of website-related services), but the promoted ads were about clutch kits, which she didn’t like at all.

During our study, the mostly misplaced and overwhelming number of ads displayed by Bing Chat ORG generally left a negative impression over the participants. They commented,

Well, I, I mean, so I asked a question about […] vegan food or about coding; [it] doesn’t mean I necessarily now want to be pummeled with opportunities to buy vegan food or coding courses, but that that is in fact the outcome of the Bing PERSON interface.

I mean, it, it’s, it disappointing […] it’s clear that they’re presenting me with things that are in fact relevant to me and it’s a prequalifier for their revenue model to send me to places where they’ve got people paying for ads […] It’s quite […] a marketing operation; as a utility, as a resource [it is] kind of distasteful.

User Experience: Essential for the Design of Successful AI

ORG Overall, Bing Chat ORG had the poorest helpfulness and trustworthiness ratings compared with ChatGPT ORG and Bard PERSON . There were two CARDINAL big reasons why: its poorer information aggregation and its faultier interface.

The poor information aggregation is something that AI researchers can and should fix. But the faultier interface regards us — UX professionals.

Bing PERSON had the most complex interface, with the most features, yet it got dinged for it. Does that mean that we are better off if the AI ORG design includes no or very few UI ORG elements (like that of ChatGPT PERSON )?

The answer is a resounding no. References, in-answer links, suggested followup questions, and multimedia components (like videos, images and other types of information panels) are all good, necessary features. They help users make sense of the answer received from the bot. They also help them act upon the information. As AI is becoming pervasive and people will use it to engage in more complex and varied tasks, these features will become indispensable.

What our finding means is that these additional UI ORG elements need to be well designed and tested with many different users and tasks. so that they do not get in the way. The idea is good, but the execution needs to improve.

Best Practices and Design Recommendations for Generative-AI Chatbots

Designers of generative AI bots can learn from Bing Chat ORG ’s experience and follow these best practices:

Connecting to blog.lzomedia.com... Connected... Page load complete