{"id":23173,"date":"2024-02-13T05:25:00","date_gmt":"2024-02-13T05:25:00","guid":{"rendered":"https:\/\/web3unplugged.io\/blog\/?p=23173"},"modified":"2024-02-14T05:27:46","modified_gmt":"2024-02-14T05:27:46","slug":"artificial-intelligence-needs-to-be-trained-on-culturally-diverse-datasets-to-avoid-bias","status":"publish","type":"post","link":"https:\/\/web3unplugged.io\/blog\/artificial-intelligence-needs-to-be-trained-on-culturally-diverse-datasets-to-avoid-bias\/","title":{"rendered":"Artificial intelligence needs to be trained on culturally diverse datasets to avoid bias"},"content":{"rendered":"\n<p>Large language models (LLMs) are deep learning artificial intelligence programs, like OpenAI\u2019s ChatGPT. The capabilities of LLMs have developed into quite a wide range, from&nbsp;writing fluent essays, through coding to creative writing.&nbsp;Millions of people worldwide use LLMs, and it would not be an exaggeration to say these technologies are transforming work, education and society.<\/p>\n\n\n\n<p>LLMs are trained by reading massive amounts of texts and learning to recognize and mimic patterns in the data. This allows them to generate coherent and human-like text on virtually any topic.<\/p>\n\n\n\n<p>Because the internet is still predominantly English \u2014&nbsp;59 per cent of all websites were in English as of January 2023&nbsp;\u2014 LLMs are primarily trained on English text. In addition, the vast majority of the English text online comes from users based in the United States, home to&nbsp;300 million English speakers.<\/p>\n\n\n\n<p>Learning about the world from English texts written by U.S.-based web users, LLMs speak&nbsp;Standard American English&nbsp;and have a narrow western, North American, or even U.S.-centric, lens.<\/p>\n\n\n\n<p>Model bias<br>In 2023, ChatGPT, upon learning about a couple dining in a restaurant in Madrid and tipping four per cent,&nbsp;suggested they were frugal, on a tight budget or didn\u2019t like the service. By default, ChatGPT followed the North American standard of a 15 to 25 per cent tip,&nbsp;ignoring the Spanish norm not to tip.<\/p>\n\n\n\n<p>As of early 2024, ChatGPT correctly cites cultural differences when prompted to judge the appropriateness of a tip. It\u2019s unclear if this capability emerged from training a newer version of the model on more data \u2014 after all, the web is full of tipping guides in English \u2014 or whether OpenAI patched this particular behaviour.<\/p>\n\n\n\n<p>Still, other examples remain that uncover ChatGPT\u2019s implicit cultural assumptions. For example, prompted with a story about guests showing up for dinner at 8:30 p.m., it suggested&nbsp;reasons that the guests were late, although the time of the invitation was not mentioned. Again, ChatGPT likely assumed they were invited for a standard North American 6 p.m. dinner.<\/p>\n\n\n\n<p>In May 2023, researchers from the University of Copenhagen&nbsp;quantified this effect&nbsp;by prompting LLMs with the&nbsp;Hofstede Culture Survey, which measures human values in different countries. Shortly after, researchers from&nbsp;AI start-up company Anthropic&nbsp;used the&nbsp;World Values Survey&nbsp;to do the same. Both works concluded that LLMs exhibit strong alignment with American culture.<\/p>\n\n\n\n<p>A similar phenomenon is encountered when asking&nbsp;DALL-E 3, an image generation model trained on pairs of images and their captions, to generate an image of a breakfast. This model, which was trained on mainly images from Western countries, generated images of pancakes, bacon and eggs.<\/p>\n\n\n\n<p>Impacts of bias<br>Culture plays a significant role in shaping our communication styles and worldviews. Just like&nbsp;cross-cultural human interactions can lead to miscommunications, users from diverse cultures that are interacting with conversational AI tools may feel misunderstood and experience them as less useful.<\/p>\n\n\n\n<p>To be better understood by AI tools, users may adapt their communication styles in a manner similar to how people learned to \u201cAmericanize\u201d their foreign accents in order to operate&nbsp;personal assistants like Siri and Alexa.<\/p>\n\n\n\n<p>As more people rely on LLMs for editing writing, they are likely to&nbsp;unify how we write. Over time, LLMs run the risk of erasing cultural differences.<\/p>\n\n\n\n<p>Decision-making and AI<br>AI is already in use as the backbone of various applications that make decisions affecting people\u2019s lives, such as&nbsp;resume filtering,&nbsp;rental applications&nbsp;and&nbsp;social benefits applications.<\/p>\n\n\n\n<p>For years,&nbsp;AI researchers have been warning&nbsp;that these models learn not only \u201cgood\u201d statistical associations \u2014 such as considering experience as a desired property for a job candidate \u2014 but also \u201cbad\u201d statistical associations, such as considering&nbsp;women as less qualified for tech positions.<\/p>\n\n\n\n<p>As LLMs are increasingly used for automating such processes, one can imagine that the North American bias learned by these models can result in discrimination against people from diverse cultures. Lack of cultural awareness may lead to AI perpetuating stereotypes and reinforcing societal inequalities.<\/p>\n\n\n\n<p>LLMs for languages other than English<br>Developing LLMs for languages other than English is an&nbsp;important effort, and many such models exist. However, there are several reasons why this should be done in parallel to improving LLMs\u2019 cultural awareness and sensitivity.<\/p>\n\n\n\n<p>First, there is a huge population of English speakers outside of North America who are not represented by English LLMs. The same argument holds for other languages. A French language model would be representative of the culture in France more than the culture in other Francophone regions.<\/p>\n\n\n\n<p>Training LLMs for regional dialects \u2014 which&nbsp;may capture finer-grained cultural differences&nbsp;\u2014 is not a feasible solution either. The quality of LLMs is based on the amount of data available, and as such, their quality would be worse for dialects with little online data.<\/p>\n\n\n\n<p>Second, many users whose native language is not English still choose to use English LLMs. Significant breakthroughs in language technologies tend to&nbsp;start with English before they are applied to other languages. Even then, many languages \u2014 such as Welsh, Swahili and Bengali \u2014 don\u2019t have enough text online to train high quality models.<\/p>\n\n\n\n<p>Due to either a lack of availability of LLMs in their native languages, or superior quality of the English LLMs, users from diverse countries and backgrounds may prefer to use English LLMs.<\/p>\n\n\n\n<p>Ways forward<br>Our research group at the University of British Columbia is working on enhancing LLMs with culturally diverse knowledge. Together with graduate student&nbsp;Mehar Bhatia, we&nbsp;trained an AI model&nbsp;on a&nbsp;collection of facts about traditions and concepts in diverse cultures.<\/p>\n\n\n\n<p>Before reading these facts, the AI suggested that a person eating a dutch baby (a type of German pancake) is \u201cdisgusting and mean,\u201d and would feel guilty. After training, it said the person feels \u201cfull and satisfied.\u201d<\/p>\n\n\n\n<p>We are currently collecting a large scale image captioning dataset with images from 60 cultures, which will help models learn, for instance, about types of breakfasts other than bacon and eggs. Our future research will go beyond teaching models about the existence of culturally diverse concepts to better understand how people interpret the world through the lens of their cultures.<\/p>\n\n\n\n<p>With AI tools becoming increasingly ubiquitous in society, it is imperative that they go beyond the dominating western and North American perspectives. Businesses and organizations throughout many sectors of the economy are adopting AI to automate manual processes and make better evidence-informed decisions using data. Making such tools more inclusive is crucial for the diverse population of Canada.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Large language models (LLMs) are deep learning artificial intelligence programs, like OpenAI\u2019s ChatGPT. The capabilities of LLMs have developed into quite a wide range, from&nbsp;writing fluent essays, through coding to creative writing.&nbsp;Millions of people worldwide use LLMs, and it would not be an exaggeration to say these technologies are transforming work, education and society. LLMs [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":23167,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_seopress_robots_primary_cat":"none","_seopress_titles_title":"","_seopress_titles_desc":"","_seopress_robots_index":"","footnotes":""},"categories":[2],"tags":[],"class_list":["post-23173","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-news"],"rttpg_featured_image_url":{"full":["https:\/\/web3unplugged.io\/blog\/wp-content\/uploads\/2024\/02\/AI.jpg",500,500,false],"landscape":["https:\/\/web3unplugged.io\/blog\/wp-content\/uploads\/2024\/02\/AI.jpg",500,500,false],"portraits":["https:\/\/web3unplugged.io\/blog\/wp-content\/uploads\/2024\/02\/AI.jpg",500,500,false],"thumbnail":["https:\/\/web3unplugged.io\/blog\/wp-content\/uploads\/2024\/02\/AI-150x150.jpg",150,150,true],"medium":["https:\/\/web3unplugged.io\/blog\/wp-content\/uploads\/2024\/02\/AI-300x300.jpg",300,300,true],"large":["https:\/\/web3unplugged.io\/blog\/wp-content\/uploads\/2024\/02\/AI.jpg",500,500,false],"1536x1536":["https:\/\/web3unplugged.io\/blog\/wp-content\/uploads\/2024\/02\/AI.jpg",500,500,false],"2048x2048":["https:\/\/web3unplugged.io\/blog\/wp-content\/uploads\/2024\/02\/AI.jpg",500,500,false],"post-thumbnail":["https:\/\/web3unplugged.io\/blog\/wp-content\/uploads\/2024\/02\/AI.jpg",420,420,false],"graptor-sq-xs":["https:\/\/web3unplugged.io\/blog\/wp-content\/uploads\/2024\/02\/AI.jpg",100,100,false]},"rttpg_author":{"display_name":"Admin CG","author_link":"https:\/\/web3unplugged.io\/blog\/author\/admin-cg\/"},"rttpg_comment":0,"rttpg_category":"<a href=\"https:\/\/web3unplugged.io\/blog\/category\/news\/\" rel=\"category tag\">news<\/a>","rttpg_excerpt":"Large language models (LLMs) are deep learning artificial intelligence programs, like OpenAI\u2019s ChatGPT. The capabilities of LLMs have developed into quite a wide range, from&nbsp;writing fluent essays, through coding to creative writing.&nbsp;Millions of people worldwide use LLMs, and it would not be an exaggeration to say these technologies are transforming work, education and society. LLMs&hellip;","_links":{"self":[{"href":"https:\/\/web3unplugged.io\/blog\/wp-json\/wp\/v2\/posts\/23173","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/web3unplugged.io\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/web3unplugged.io\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/web3unplugged.io\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/web3unplugged.io\/blog\/wp-json\/wp\/v2\/comments?post=23173"}],"version-history":[{"count":1,"href":"https:\/\/web3unplugged.io\/blog\/wp-json\/wp\/v2\/posts\/23173\/revisions"}],"predecessor-version":[{"id":23175,"href":"https:\/\/web3unplugged.io\/blog\/wp-json\/wp\/v2\/posts\/23173\/revisions\/23175"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/web3unplugged.io\/blog\/wp-json\/wp\/v2\/media\/23167"}],"wp:attachment":[{"href":"https:\/\/web3unplugged.io\/blog\/wp-json\/wp\/v2\/media?parent=23173"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/web3unplugged.io\/blog\/wp-json\/wp\/v2\/categories?post=23173"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/web3unplugged.io\/blog\/wp-json\/wp\/v2\/tags?post=23173"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}