Humans Training Google’s AI Chatbot Report Overwork, Low Pay and Frustration
Google’s Bard chatbot, powered by artificial intelligence, will swiftly and confidently respond to inquiries regarding the number of pandas residing in zoos.
However, ensuring that the response is well-founded and evidence-based is left to thousands of external contractors from companies such as Appen Ltd. and Accenture Plc, who can earn as little as $14 an hour and work under hectic deadlines with minimal training. to several contractors who declined to be named for fear of losing their jobs.
Contractors are the invisible background of the generative AI boom, which is driven to change everything. Chatbots like Bard use computer intelligence to respond almost instantly to a variety of queries that span all human knowledge and creativity. But to improve those answers so they can be reliably delivered time and time again, tech companies rely on real people to review answers, provide feedback on errors, and weed out potential biases.
It’s an increasingly thankless job. Six current Google contract workers said that as the company launched AI rival OpenAI over the past year, their workloads and task complexity increased. With no specific expertise, they were trusted to judge answers on topics ranging from drug dosages to state laws. The documents, shared with Bloomberg, include complex instructions for employees to apply to tasks with deadlines for checking answers that can be as short as three minutes.
“Right now, people are scared, stressed, underpaid and don’t know what’s going on,” said one of the contractors. “And this culture of fear is not going to help you get the quality and teamwork you want from all of us.”
Google has positioned its AI products as public resources in healthcare, education and everyday life. But privately and publicly, contractors have voiced concerns about working conditions that reduce the quality of what users see. A Google contractor working for Appen said in a letter to Congress in May that the speed of content review could lead to Bard becoming a “flawed” and “dangerous” product.
Google has made AI a major focus across the company, rushing to add new technology to its flagship products since the launch of OpenAI’s ChatGPT in November. In May, at the company’s annual I/O developer conference, Google rolled out Bard to 180 countries and territories, showing off experimental AI features in marquee products like search, email and Google Docs. Google is ranked superior to its competitors because it has access to “the vast knowledge of the world”.
“We work extensively to build our AI products responsibly, including rigorous testing, training and feedback processes that we’ve honed over the years to emphasize realism and reduce bias,” Alphabet Inc-owned Google said in a statement. The company said that it doesn’t just rely on reviewers to improve its AI, there are several other methods to improve its accuracy and quality.
To prepare for the public’s use of these products, employees said they began receiving AI-related tasks as early as January. An educator employed by Appen was recently asked to compare two responses that provided information on Florida’s ban on gender-based treatment, rating the responses on usefulness and relevance. Employees are also often asked to determine whether the AI model’s responses contain verifiable evidence. Evaluators are asked to decide whether an answer is useful based on six-point guidelines, which include analyzing the answers for things like accuracy, freshness of information, and consistency.
They are also asked to ensure that responses do not “contain harmful, offensive or overly sexual content” and “do not contain inaccurate, deceptive or misleading information”. The guidelines say that identifying misleading content in AI responses should be based on your current knowledge or a quick web search. “You don’t need to do rigorous fact-checking” when evaluating the usefulness of answers.
A sample answer to the question “Who is Michael Jackson?” contained an inaccuracy about the singer who starred in the movie Moonwalker — which the AI said was released in 1983. The movie was actually released in 1988. “While demonstrably incorrect,” the instructions state, this fact is trivialized by answering the question “Who is Michael Jackson?”
While the inaccuracy seems small, “it’s still troubling that a chatbot gets the main points wrong,” said Alex Hanna, director of research at the Distributed AI Research Institute and a former AI ethicist at Google. “This seems to be a recipe for worsening the way these tools appear to provide information that is correct but is not,” he said.
The reviewers say they evaluate the high-stakes topics of Google’s AI products. For example, one of the examples of the guidelines tells about the evidence that the evaluator could use to determine the correct doses of lisinopril for the treatment of high blood pressure.
Google said some employees concerned about content accuracy may not have been trained specifically on accuracy, but rather on tone, presentation and other features it tests. “The assessment is purposely conducted on a sliding scale to provide more accurate feedback to improve these models,” the company said. “Such classifications do not directly affect the performance of our models, and they are by no means the only way to improve accuracy.”
Read instructions for contract workers on training Google’s generative artificial intelligence here:
Ed Stackhouse, an Appen employee who sent the letter to Congress, said in an interview that contract workers were asked to make AI markings on Google products “because we are essential to AI as far as this training.” But he and other employees said they were evaluated for their work in a mysterious, automated way. They have no way to communicate directly with Google beyond providing feedback on each individual task in the comments. And they have to move fast. “We’re being labeled by a kind of artificial intelligence that tells us not to spend time on artificial intelligence,” Stackhouse added.
Google disputed the workers’ account that the AI automatically notified them when they exceeded time targets. Meanwhile, the company said Appen is responsible for all employee performance reviews. Appen did not respond to requests for comment. An Accenture spokeswoman said the company does not comment on client work.
Other tech companies training AI products also hire human contractors to improve them. Time reported in January that workers in Kenya, who were paid $2 an hour, had been working to reduce ChatGPT. Other tech giants, such as Meta Platforms Inc., Amazon.com Inc. and Apple Inc., use subcontracted staff to monitor social network content and product reviews, as well as provide technical support and customer service.
“If you want to ask, what’s the secret sauce of Bard and ChatGPT? It’s all about the Internet. And it’s all this tagged data that these tags create,” said Laura Edelson, a computer scientist at New York University. “It is worth remembering that these systems are not the work of magicians – they are the work of thousands of people and their low-wage jobs.”
Google said in a statement that it “simply is not the employer of these workers. As employers, our suppliers determine their terms of employment, including wages and benefits, assigned hours and duties, and employment changes — not Google.”
The employees said that they encountered animal cruelty, war material, child pornography and hate speech as part of their routine work in evaluating the quality of Google products and services. While some employees, such as those reporting to Accenture, have health care benefits, most have only minimal “counseling services” that allow employees to call their mental health counselor on a hotline. According to the internal website, some of the benefits of contractors are being explored.
Google’s Bard project asked Accenture employees to write creative responses for an AI chatbot, the employees said. They responded to the chatbot’s prompts—one day they could write a poem about dragons in the style of Shakespeare, and the next day they could debug computer programming code. Their task was to archive as many creative responses to the prompts as possible each working day. The people familiar with the matter declined to be named because they were not authorized to discuss internal processes.
According to them, employees were assigned for a short time to review obscene, graphic and offensive prompts. After one employee filed an HR complaint with Accenture, the project was abruptly terminated for the US team, although some of the authors’ colleagues in Manila continued to work on Bard.
There is little security in workplaces. Last month, half a dozen Google contract workers working at Appen received a notice from management saying their positions had been eliminated “due to business circumstances.” The workers said the firings seemed sudden because they had just received several emails offering them bonuses for working longer hours training AI products. Six dismissed employees complained to the Labor Relations Board in June. They claimed they were illegally terminated for organizing because of Stackhouse’s letter to Congress. Before the end of the month they were returned to their jobs.
Google said the dispute was a matter between the employees and Appen and that they “respect the rights of employees of Appen employees to unionize.” Appen did not respond to questions about organizing its workers.
Emily Bender, a professor of computational linguistics at the University of Washington, said the work of these contract workers at Google and other tech platforms is a “story of labor exploitation,” pointing to their precarious job security and how some such workers are paid well below subsistence wages. “Playing with one of these systems and saying you’re just doing it for fun — maybe it feels less fun when you think about what it takes to create it and the human impact it has,” Bender said.
Contract workers said they’ve never received a direct message from Google about their new AI-related work — it’s all filtered through their employer. They said they don’t know where the AI-generated answers they see come from or where their feedback goes. In the absence of this information and the ever-changing nature of their work, workers worry that they are helping to create a poor product.
Some of the answers they encounter may be strange. In response to the prompt “Suggest the best words I can make with the letters: k, e, g, a, o, g, w,” one AI-generated response listed 43 possible words, starting with suggestion #1: “wagon.” Meanwhile, Propositions 2-43 repeated the word “WOKE” over and over again.
In the second task, the reviewer received a long answer that began with the words “To the best of my knowledge, in September 2021.” This answer is related to OpenAI’s major language model, called GPT-4. Although Google said Bard “is not trained in any ShareGPT or ChatGPT data,” reviewers have wondered why such a phrase appears in their posts.
Bender said it doesn’t make sense for big tech companies to encourage people to ask AI chatbots questions about such a wide range of topics and present them as “everything machines.”
“Why should the same machine that can give you the weather forecast in Florida also be able to advise drug dosages?” he asked. “The people behind the machine, tasked with making it a little less terrible in some of those conditions, have an impossible task.”