A Southeast Asian language model (LLM) called SEA-LION has been created by a Singapore government-led initiative to provide better representation for the region. (REUTERS)AI 

Singapore develops AI model to accurately depict Southeast Asians, raising concerns about potential bias

Millions of people in Southeast Asia, like others around the globe, have been experimenting with Meta’s Llama 2 and Mistral AI, utilizing their native languages such as Bahasa Indonesia or Thai. However, when attempting to generate content in English, the outcome has typically been nonsensical.

That leaves them at a disadvantage, tech experts warn, as generative AI transforms education, work and governance worldwide.

The Singapore government-led initiative seeks to redress the imbalance with an LLM in Southeast Asia, the first in a model family called SEA-LION – Southeast Asian Languages in One Network, trained in the region’s languages and cultural norms.

The open-source model has been trained on data from 11 Southeast Asian languages, including Vietnamese, Thai and Indonesian, and is a cheaper and more efficient option for companies, governments and academia in the region, said Leslie Teo of AI Singapore.

“Do we want to force every Southeast Asian to adapt to the machine, or do we want to make it easier so that people in the region can take full advantage of the technology without having to be English speakers?” he said.

“We’re not trying to compete with the big LLM companies, we’re trying to complement them so we can be better represented,” Teo, head of AI products, told the Thomson Reuters Foundation.

More than 7,000 languages are spoken in the world. Yet LLMs, including Open AI’s GPT-4 and Meta’s Llama 2, which are used to build AI systems such as chatbots and other tools, are largely developed in and trained in English.

Governments and tech companies are trying to bridge this gap. India creates datasets in local languages, LLM in UAE produces generative AI tools in Arabic and AI models in local languages in China, Japan and Vietnam.

These models can help local populations participate more equitably in the global AI economy, which is largely dominated by big tech companies, said Nuurrianti Jalli, an assistant professor at Oklahoma State University’s School of Communication.

“Regional LLM companies are also needed because they support technology self-sufficiency,” he said. “Less reliance on Western LLMs could provide better privacy for the local population and also be more in line with the national or regional interest.”

REVIEW AND FILTER

Multilingual language models trained on text from multiple languages at once can infer semantic and grammatical connections between high-resource languages with more information and low-resource languages, the researchers say.

These models can be used in many applications, from translation to customer service chatbots and content moderation on social media platforms that struggle to recognize hate speech in low-resource languages such as Burmese or Amharic.

About 13% of SEA-LION’s data comes from Southeast Asian languages — more than any other major LLM, Teo said. More than 9% of its information is in Chinese text and about 63% in English.

Multilingual language models often train on translated text and other low-quality data that may contain errors, so AI Singapore is “cautious” about the data used in SEA-LION training, Teo said in his office at the National University of Singapore.

“The age of raw data is over – a lot of the material on the internet is now produced by LLMs, so we have to check and filter,” he said.

“We can’t be perfect, but we also can’t take away everything we think is bad,” he added.

More governments are sharing data, and companies are testing SEA-LION, which, thanks to its smaller size, can be deployed faster and is cheaper to fine-tune and deploy, Teo said.

At Indonesian e-commerce company Tokopedia, the majority of customer interactions are in Bahasa Indonesia, so the models “use this local fluency to improve our ability to connect with customers and enhance their experience,” said Paul Condylis, Tokopedia’s vice president of data science. .

BIAS IN KNOWLEDGE

As more countries and regions build their own LLMs, digital and human rights experts fear that they will only be echoing dominant views expressed online, which can be particularly problematic in countries with authoritarian governments or strict media censorship, or without a strong civil society.

For example, Chinese social media platforms censor references to the Tiananmen Square uprising and criticism of the government, while several Southeast Asian countries have enacted laws to curb misleading content from the authorities.

“Training models with such data risks perpetuating biased, prejudiced, incomplete and even misleading narratives,” Jalli said.

“The models may not address important socio-political issues such as human rights abuses, corruption or valid criticism of political forces,” he said.

In response to a query about former Indonesian President Suharto, for example, Llama 2 and GPT-4 mentioned his uneven human rights record, while SEA-LION’s response largely focused on his achievements.

If the model is trained only on positive articles about the government, the model “is likely to adopt a worldview where the government is completely positive and leaves dissenting views behind,” said Aliya Bhatia, a policy analyst at the Center for Democracy and Technology. , a US non-profit organization.

“Regional LLMs may better reflect the linguistic and cultural nuances of local language speakers, but they may also have less knowledge of the world in general,” he added.

“There is a real risk that government-sponsored models will instill a revisionist view of history and undermine democratic values.”

But the alternative – to rely entirely on Western LLM firms with “disproportionate influence” from rich, liberal, Western democracies – means maintaining various biases related to cultural values, political beliefs and social norms, according to AI Singapore.

“These LLMs have a very particular West Coast American bias — they’re very wide awake. They don’t represent us,” said Teo.

“We’re not saying our perspective is the only perspective — we’re just trying to balance it out.”

Also read these top stories today:

The cookies are crumbling! Small files that helped companies stalk users online are gone. But that doesn’t mean a return to privacy. This article has some interesting details. Check it out here.

Meta challenges the EU! Meta announced on Wednesday that it will challenge the EU’s payment requirement in court under the Content Control Act, the EU’s legal weapon to rein in Big Tech. Read all about it here.

Microsoft is cutting more jobs! The FTC is seeking a response after Microsoft’s plans revealed that the Satya Nadella-led company is looking to cut 1,900 jobs from newly acquired Activision Blizzard. Dive here.

Related posts

Leave a Comment