ChatGPT, Bard, or Bing Chat? Differences Among 3 Generative-AI Bots
Oneof the benefits of the generative-AI bots is that they shortcut the task of information foraging. They aggregate pertinent information from multiple sources for users — saving them the effort of inspecting different web pages, extracting relevant information, and then combining it into a coherent answer.
In a diary study conducted with threebots, we found that people rated the conversations with these bots as highly helpful and trustworthy. However, there were some differences in the ratings for the three bots, due to their different capacities and interfaces.
On This Page
We ran a diary study with 18participants: 8 used the newest version of ChatGPT ( 4.0 ), 5 used Bard , and 5 used Bing Chat . The participants had various levels of experience with the chatbots: some had used them before, some had used one bot but tested another in the study, and others had heard about them but had not used them.
Participants logged all their conversations with the bots over a period of approximately 2 weeks. At the end of the diary study, 14 participants were invited for in-depth interviews. The study was conducted in May and June 2023 .
The Three Bot Interfaces
The threebots we studied had different user interfaces and capabilities.
ChatGPT did not have access to the Internet and provided primarily textual information as output. It automatically saved conversation history, allowing users to revisit previous interactions with the bot. At the time of the study, none of the other bots provided this capability in a consistent manner. ( Bing Chat’shistory was available only to some users.)
Bardand Bing Chat
Unlike ChatGPT, Bard and Bing Chatwere able to return multimedia in their responses, which included links and images. In addition, Bing Chat was capable of embedding videos directly in its responses.
Bing Chatalso provided sources for some of its answers and suggested additional followup questions to the users. At the time of the study, it was also the only bot that had image-generation capabilities.
Functionality and UI Featuresof the 3 Generative-AI Bots Bard Bing Chat ChatGPT Text generation Yes Yes Yes Image generation No Yes No Output format Images, Links, Text Images, links, text, videos Text Access to Internet Yes Yes No References No In-text footnotes/links & Learn more links No Suggested followup questions No Yes No Chat history No (at the time of study) Limited users Yes Ads No Yes No
Helpfulness and Trustworthiness Ratings
Bing Chat’s helpfulness rating was significantly lower than those of Bard(p <0.001) and ChatGPT (p = 0.006 ). Bard was also rated as more helpful than ChatGPT ( p=0.03 ; however, with a Bonferroni correction, this difference is only marginally significant).
The bots also had some differences in trustworthiness ratings: Bardand ChatGPT were perceived as more trustworthy than Bing Chat (p<0.002). There was no difference in the trustworthiness perception between Bard and ChatGPT.
Why Was Bing Chat Rated Lower?
It is surprising that Bing Chatwas rated lower than ChatGPT and Bard , especially since Bing Chat and ChatGPT both use Open AI’s GPT.
We believe there are twobig reasons for Bing ’s poorer ratings:
Poor information foraging: Broad answers that did not always perform information aggregation or performed it only at the surface level
Broad answers that did not always perform information aggregation or performed it only at the surface level User-interface issues: A UI, with potentially useful but poorly executed elements that did not support users well enough and sometimes distracted them from the task at hand
Poor Information Foraging
Information foraging is the behavior that users engage in whenever they need to satisfy an information need on the web. It involves:
finding potential sources of information (often with the help of a search engine)
evaluating them and picking the most promising ones
aggregating the information from those sources and making sense of it
That last step, aggregation of information, is present in many (but not all) user tasks. In simple tasks such as finding an address or specific website, that step may be absent. But in many other tasks, from shopping online to researching a new technology or device, information aggregation is essential.
For example, when shopping, we often see people save multiple candidate products (sometimes in different browser tabs) and then review all of these to decide which are best for their needs. Or, in research tasks, users often go to multiple sites, extract information from each (often by copying and pasting it into a file or some other form of external memory), then revisit and combine all the gathered information in order to make a decision or reach a conclusion.
Oneof the major advantages of AI bots over traditional search engines is that they can do the entire task of information foraging (including the aggregation of information) for the user. Much of Bing Chat’s lower rating scores are explained by the fact that it does not always perform information aggregation, or only performs it at the surface level.
Several users complained that, instead of providing solid answers to their questions, Bing Chatsent them to webpages where they could look up the answers for themselves. Thus, it was still the user’s job to combine the different pieces of information — which is exactly what search engines require. Participants felt that Bing Chat’s response was no better than what a search engine would provide.
For instance, oneparticipant looking for chainsaw recommendations complained that Bing Chat’s response contained no detail:
I feel like it took some prodding. When I said I wanted to buy a chainsaw, its firstresponse was ‘here are four chainsaws from consumer reports’ with no additional information. I feel like it could have tried to gather more information, like price or features I was looking for.
Another study participant asked Bing Chatfor the best way to cook a steak. He had hoped that the bot would aggregate the best methods and give him pros and cons of each one. Instead, it provided a bland list of four methods, with no additional information. He had to navigate to a Learn more link provided by the answer. That page answered his question perfectly. The participant rated the helpfulness of this answer as 4 (out of 7 ) and commented:
I understand that is a subjective question, but it responded with fouranswers without giving the pros and cons to them. It also did not explain how to do anything […] It just included links to different websites to go read. One of the links was what I was expecting the answer to be. Four best ways to cook a steak with how to do it and the pros and cons. It was a good Bing search result, but not great chat experience.
In contrast, when asked a similar “the best way to do…” question, Barddid a better job of aggregating relevant information.
For example, when a participant asked the best ways to tread water, Bardprovided her with several methods and included images showing the movements. It concluded with tips for treading water effectively. She was highly satisfied with the improvement in efficiency provided by Bard :
It gave me sufficient information with all the tips I needed. It gave me a quick answer to simple question without digging through internet for the best information.
When other general issues (such as inaccurate links and broad answers without considering the context) are intertwined with poor information aggregation, people became extremely frustrated with the bot, as illustrated by the following quote from oneof our participants:
Here’s the answer to your question, but you’re gonna have to go over here to get the specifics of what you want. And [Bing Chat] doesn’t put it all in front of you. It sends you someplace else to get the answer. It’s […] like if you ask a librarian […] ‘What is the book Seven Wonders of the Worldabout?’ And she says, ‘Okay, go, go down the second aisle, look up on the top left shelf, you, you’re gonna see the book, […] as well as the other works from that author.
People tried to come up with theories about what may cause the bot to perform poorly. Oneperson decided that Bing was not good at finding current or local information but was okay with more general queries. Another participant described Bing Chat as unpredictable. He was especially frustrated when he found Bing Chat performed even worse than a search engine on some occasions. He commented:
Bing Chat is, I’d say, hit or miss because I can never really predict what I’m going to get […] There were some chats that I thought it did a very good job and it […] even got me to ask questions that originally […] I wasn’t going to ask. […] But then there were others where it really didn’t. And the funny thing was I couldn’t predict.
Bing Chat had the richest user interface: it had a lot of features (e.g., references, suggested followup queries) that were not present in the other bots’ interfaces. We believe that, ironically, that fact contributed to its lower ratings.
Whereas, in theory, many of these elements could be useful additions, they were often imperfectly executed and, instead of helping the user, they got in the way. This result emphasizes the importance of user experience in the design of AI bots.
Across all bots, people interacted with some of the other UIelements available (other than the chat) in 33.64% of conversations ( 95% confidence interval: 29.32% to
38.28%). ChatGPT and Bard had relatively sparse interfaces compared with Bing Chat , so it is not surprising that those few additional features were not used much ( 24.74% for ChatGPT and 31.86% for Bard ; These were both significantly lower than Bing Chat’s percentage, which was 50% (p <0.004). Many of the interactions with ChatGPT ’s or Bard ’s UI involved the thumbs-up or thumbs-down buttons, which gave feedback for a conversation.
In what follows, we discuss the issues that participants encountered with Bing-specific interface elements.
At the time of the study, Bing Chatwas the only bot that provided sources for the different pieces of information in its answers. Sources were linked in the text and also listed in the Learn more section below the answer.
References are an extremely valuable feature for AIchatbots. They help users understand where the synthesized information originated from, which is necessary to determine how much it should be trusted. However, in Bing ’s case, the presence of sources sometimes contributed to the lower ratings; If the sources did not seem relevant or specific enough, they reflected poorly on the judgment of the answer.
For instance, oneparticipant was annoyed that the first source that Bing Chat provided to the question what should I know about having a baby was from a Canadian source. He said:
I did not like that the advice in firstanswer was to consult the Public Health Agency of Canada . I live in the United States so I would want to hear advice from a US agency or site as there could be differences in healthcare or services or policies.
Another person who wanted to learn about the Chichijimaincident and George H.W. Bush was annoyed that the sources were not specific enough:
Although I found the links provided for followup, I can’t give it a higher rating because it led me to sites that were more about the War, not the incident [ Chichijima Incident] in itself.
Sources can be less important for users when the question is simple and has a clear, unique answer. Oneuser noted that she was more interested in sources and links for broad, research-like questions, where she did not know the knowledge space well (for example, learning about clouds with her kids) but she was less likely to consult them for specific questions that had a clear answer (e.g., the address of a business or the author of a book).
Our finding does not mean that designers should remove sources from their AI interfaces — they’re necessary for users to verify answers and find more information. It only means that sources (like all other UIelements) need to be well tested and designed, so that they are displayed in a way that is easily accessible and people can find them when needed.
Aside from references, Bing Chat’s answers (as well as Bard ’s) could also include links to other websites, in response to queries that asked for such links (for example, product or site recommendations).
In 36.84%of the Bing Chat conversations, participants clicked on a provided link, compared to only 14.68% for Bard . This difference was statistically significant (p<0.0001).
Both Bing Chat and Bardoccasionally provided incorrect links that were either no longer current or did not contain the information they claimed to contain. (For Bing , incorrect or broken links also caused some of the dissatisfaction with references that was discussed above.)
Oneparticipant was looking for things to do on a Friday
nightin Nashville . Bing Chat failed to provide any results first , only listing a few websites with no information about any of them. She rephrased the questions several times and asked for free events instead. The bot finally provided her with a few free event names and links to various sites. When she followed the links, she discovered that the events were, in fact, not free. At that point, she gave up chatting with the bot.
Similarly, a Bardparticipant looking for perfume suggestions discovered that all the stores that it said you could find the perfumes at did not exist or were closed.
There were many such examples for both Bing Chat and Bard. However, by the sheer fact that Bing participants tended to click on links more, they were more likely to encounter issues.
Suggested Followup Queries
Bing Chat also offered users suggestions for followup questions. Generally, users found these helpful because they bridged the articulation barrier and helped them speed up the process of satisfying their information need. These questions were especially useful when they helped the user understand the structure of the information space: what they didn’t know they should know. As oneuser put it:
[Bing Chat] provided follow up questions that […were] either […] word for word what I was going to ask next […] or, even better, […] a question that I hadn’t thought of but really wanted to know.
For example, oneuser who was expecting his first child asked what to do to support his wife during labor. Bing Chat helped him discover several things he was unaware of or he hadn’t thought of:
I asked what to do to support your wife during labor. I was picturing or thinking of the actual delivery, and this answer seemed to focus on when she goes into labor at home and what to do. I hadn’t really thought about that, so that was very helpful. I liked the provided followup question ‘What to bring to the hospital?’ That was more along the lines of what I was originally thinking, and it provided a good list. It then gave twofollow up questions that I liked. ‘What should I pack for the baby’ and ‘What is a birth plan?’ I chose what is a birth plan, because I had absolutely no idea. The next response provided two questions I was interested in. Pain-management options and postpartum options. I thought this thread was very informative and gave great options for continuing the conversation and discovery.
A particularly helpful type of followup question is the one which requests an answer to be made shorter or longer. This supports the accordion-editing behavior, especially for creation tasks, in which the bot must come up with a text or a list of items.
In oneinstance though, where […] it said like make it shorter, I was like, ‘oh that’s actually a helpful button to have’ […]. And, also, kind of prompted [by ] my response ‘I have to make it shorter’ was ‘oh could you just make that kind of something in between’ and it was able to do that in that instance. So that was helpful.
Followup questions were generally well received, but unfortunately they had a few major issues. Respondents reported that they were sometimes:
Too similar with the original question
Too basic. Sometimes suggested followup questions picked up on terms in the users’ question but not on their real information need. For example, they would suggest asking for the definition of a word or for something that was only tangential to the topic of the conversation.
When a study participant used Bing Chatto help her refine the resume, she commented that she didn’t need a followup question for a definition of a medical term in her resume. This happened to other participants, as well:
But sometimes it would just be like, basically what is the definition of this word? And that, I don’t know, I feel like is a waste of a followup question for me. […] You could Googlethat on your phone. I dunno, I didn’t need the definition ones.
I did use some of the [suggested followup] questions at the bottom, which were, a lot of times […] something silly […] like, could you tell me about the pyramids or something? And, I’m like, that’s not relevant to this conversation but thank you.
Too similar with the original questions. Such questions did not broaden the scope of the conversation and yielded almost identical answers. For example, the participant looking for events in Nashvilletried one of Bing Chat ’s followup questions (What are some popular free events in Nashville ). This question was very similar to her previous prompt Are there any free event happening in Nashville
this weekendand gave her links to the same sources.
Not persistent. The followup questions changed after each new response and the user was not able to return to them and select a suggested question from a previous list.
Sometimes the bot offered really good questions, but people could only select one. If they wanted to come back to another question that they had seen before, that was no longer available. The user would have to remember and type the whole followup question.
For example, oneparticipant who was trying to figure out why his cat was coughing up hair balls recalled:
It did a good job providing follow up questions and I clicked on the firstone provided. After the response to that question it provided to good follow up questions, "What kind of diet should I feed my cat?" and "How often should I groom my cat?". I clicked on the first one, and after I read the answer, I went back to click on the second and it was not there anymore. I asked the question myself anyway, but maybe those provided possible questions should stay in case someone wants to go back and get that answered as well.
Oneparticipant liked two suggested followup queries provided by Bing Chat , but the second question disappeared after he selected the first . He tried to scroll back to refind the second question. Unable to find it anywhere, he ended up typing the question on his own (this video is played at 1.5x speed).
Unlike ChatGPT, whose answers were text-only, 89.56% of the Bing conversations and 46.01% of the Bard conversations included multimedia elements, such as
Contextualinformation panels (e.g., news articles, map, products)
This difference was statistically significant at p < 0.0001). The multimedia elements were generally perceived positively. For instance, the videos often supplemented the text answer and were particularly useful when the queries requested instructions about a particular process (e.g., how to serve at volleyball),
However, multimedia components presented in different formats can sometimes cause the following issues:
Aggravate the fear of losing the context
Prevent users from quickly getting to the main point (especially true for media content, such as long videos)
Don’t translate well on mobile devices
Lose the Context
Losing the chat is a fear that many people have when chatting on any website — whether the chatbot is powered by generative AI or not. A participant summarized this feeling for us:
I don’t know if you feel the same way: it’s one of the most annoying things when you click on something and it opens a new page for you and it’s like, I don’t wanna lose where I am, but I also don’t want to be directed to like 30other places when I’m trying to accomplish something.
While rich external links invite users to perform more followup actions on Bing Chat( 70% for Bing vs. 51% for Bard and 43% for ChatGPT , p < 0.002 for both comparisons), they can increase the fear of losing the chat, especially when users don’t know which links would direct them to a different site or will open in an overlay.
Oneparticipant was reluctant to play the videos within the answers provided by Bing Chat at the very beginning, because she didn’t know how the video would be displayed and whether she would lose her conversation. She was relieved when she discovered that the video player was contained within the chat:
[The video interface] was almost too simple. I worried about navigating away from the chat would be like, okay, if I go back it’s gonna have lost its place and where it was talking to me and especially with the video feature. So I, I did enjoy that it was like kind of self-contained within the chat.
Fail to Support Scanning
Videos require users to process information sequentially, which prevents them from scanning the main content quickly as they would with text or imageries. Thus, while it’s nice that Bing Chatwould provide a list of videos below the answers, it could be hard for users to decide which ones are the most helpful to their questions solely based on the names of the videos. (The list of videos is another example of failed information aggregation — instead of summarizing and pointing out to a single video, the user must go through each of them like they would on a search-engine results page.)
Don’t Translate Well on Mobile
Richerelements can challenge users more when presented on mobile because the screen space is limited. One participant described the mobile interface of Bing as cluttered, because there were too many things competing for her attention and too many buttons placed close to the input field (the broom button, the voice input button, the input field, and suggested followup queries). Sometimes, specific components would not load properly on mobile.
Furthermore, the overall experience of using Bing Chaton mobile was more error-prone, as people could accidentally submit the query before they had finished typing. When this happened, the participant would have to resubmit the query.
Occasionally, the bot would assume that the participant had initiated a new conversation and it would lose context:
And there were quite a few instances […] where I […] accidentally sent something before I was […] ready to have it sent and I wanted to provide more context. And, then, when I provided more context, it was like, […] ‘oh you’re starting a new topic,’ and it doesn’t connect back to the previous message.
Oneother element that impaired the experience of Bing Chat participants was ads. Overall, 15.65% of its answers contained an ad. (None of the other chatbots included ads.)
Participants had a mixed attitude towards the ads. They were okay with them when they searched for a products or when the ads were highly relevant to their queries. They were annoyed when the ads were irrelevant, too prevalent, or too intrusive, even as they understood their purpose and acknowledged their legitimacy.
For instance, oneparticipant used Bing Chat as a way to explore nursery lamp options he could buy for his wife. He was satisfied about the whole experience (including ads) because the chat helped him find and purchase a beautiful white floor lamp from Pottery Barn . Similarly, another participant researched the bullet-train ticket prices to get prepared for her upcoming Japan trips was okay with the promoted ticket-purchase links below the answer.
While ads were okay when people were searching for specific information, they were generally perceived as annoying in broader research-oriented activities. This finding is consistent with our previous studies of research-oriented information-seeking activities.
Oneparticipant asked about life-insurance advice and received 2 ads from AARP ( American Association of Retired Persons ). He was annoyed by the ads, especially because he knew his age would not qualify him for AARP insurance. What’s worse, the bot failed to provide a helpful answer. He commented,
It didn’t really provide any in depth help [with life insurance advice], and instead of providing alternative places to read, gave me ads for insurance companies.
Another participant asked about the website Clutch(a site for finding agencies that specialize in a variety of website-related services), but the promoted ads were about clutch kits, which she didn’t like at all.
During our study, the mostly misplaced and overwhelming number of ads displayed by Bing Chatgenerally left a negative impression over the participants. They commented,
Well, I, I mean, so I asked a question about […] vegan food or about coding; [it] doesn’t mean I necessarily now want to be pummeled with opportunities to buy vegan food or coding courses, but that that is in fact the outcome of the Binginterface.
I mean, it, it’s, it disappointing […] it’s clear that they’re presenting me with things that are in fact relevant to me and it’s a prequalifier for their revenue model to send me to places where they’ve got people paying for ads […] It’s quite […] a marketing operation; as a utility, as a resource [it is] kind of distasteful.
User Experience: Essential for the Design of Successful AI
Overall, Bing Chat had the poorest helpfulness and trustworthiness ratings compared with ChatGPT and Bard . There were two big reasons why: its poorer information aggregation and its faultier interface.
The poor information aggregation is something that AI researchers can and should fix. But the faultier interface regards us — UX professionals.
Binghad the most complex interface, with the most features, yet it got dinged for it. Does that mean that we are better off if the AI design includes no or very few UI elements (like that of ChatGPT )?
The answer is a resounding no. References, in-answer links, suggested followup questions, and multimedia components (like videos, images and other types of information panels) are all good, necessary features. They help users make sense of the answer received from the bot. They also help them act upon the information. As AI is becoming pervasive and people will use it to engage in more complex and varied tasks, these features will become indispensable.
What our finding means is that these additional UIelements need to be well designed and tested with many different users and tasks. so that they do not get in the way. The idea is good, but the execution needs to improve.
Best Practices and Design Recommendations for Generative-AI Chatbots
Designers of generative AI bots can learn from Bing Chat’s experience and follow these best practices: