{"id":10992,"date":"2025-03-31T10:00:00","date_gmt":"2025-03-31T08:00:00","guid":{"rendered":"https:\/\/haimagazine.com\/uncategorized\/understanding-ai-a-safety-game\/"},"modified":"2025-06-26T15:35:06","modified_gmt":"2025-06-26T13:35:06","slug":"understanding-ai-a-safety-game","status":"publish","type":"post","link":"https:\/\/haimagazine.com\/en\/hai-magazine-4\/understanding-ai-a-safety-game\/","title":{"rendered":"\ud83d\udd12 Understanding AI: a safety game"},"content":{"rendered":"<p class=\"wp-block-paragraph\"><strong>Inez Okulska: Safety in the context of artificial intelligence is discussed in every way possible. Both big players and beginner enthusiasts are increasingly feeling that these two concepts, though not always easy to reconcile (see: massive layoffs in big tech right in this area), must go hand in hand. Is risk just a flaw of bad models, or an intrinsic aspect to this technology? What does &#8220;AI safety&#8221; actually mean for business and everyday life?   <\/strong><\/p><p class=\"wp-block-paragraph\"><strong>Przemys\u0142aw Biecek: <\/strong>Paraphrasing the Anna Karenina principle: all good models are alike, each bad model is bad in its own way. This saying really holds up when analyzing the safety of artificial intelligence models.<br\/>Unbiased, secure, trusted, robust, transparent, verified \u2013 these are just some examples of secure AI. We expect a whole bunch of desired features from a safe model, and failing even one of them means we consider the model defective, and sometimes even dangerous. The word &#8220;safe&#8221; here is an umbrella for many criteria we want it to meet.    <\/p><p class=\"wp-block-paragraph\">A model can be safe, just like a house no one breaks into because it has security systems, alarms and locks. In this case, it means that no unwanted actor will affect the model&#8217;s performance or alter its results. Increasingly more businesses are relying on AI modules, so it&#8217;s essential to make sure they aren&#8217;t tampered with by hostile competitors, unfriendly users, or other players with bad intentions.  <\/p><p class=\"wp-block-paragraph\">But the model can be as safe as a home where we feel good because there&#8217;s happiness, fairness and a supportive atmosphere. In this case, safety can mean trust and no discrimination. If certain decisions, like access to good education or healthcare, depend on AI algorithm recommendations in my everyday life, then it&#8217;s crucial to make sure these systems won&#8217;t discriminate against me based on age, skin color, gender or other irrelevant characteristics.  <\/p><p class=\"wp-block-paragraph\">Finally, the model can be as safe as a house that doesn&#8217;t explode because the electrical or gas installations are regularly checked by qualified staff. In this case, safety means reducing the risk of fire, electric shock or gas poisoning. If a key part of my business relies on an AI module, I definitely don&#8217;t want its malfunction to lead to an uncontrolled number of lawsuits or complaints that could sink my company.  <\/p><p class=\"wp-block-paragraph\">There&#8217;s no single definition of safety, but we have plenty of examples of models that don&#8217;t work right. As a community, we&#8217;re just starting to figure out the right safety standards and we&#8217;re discovering brand new challenges in this area. That&#8217;s why working on safe models is so fascinating.  <\/p><p class=\"wp-block-paragraph\"><strong>IO: Since so much can go wrong, are there databases that document cases where AI has failed? What can business learn from them? <\/strong><\/p><p class=\"wp-block-paragraph\"><strong>PB: <\/strong>There are several databases like this and new ones pop up every now and then. I mostly follow and recommend the IncidentDatabase.AI, which contains hundreds of well-documented errors and damages caused by improper AI system operations. This is a great repository because it systematically collects and analyzes instances where AI messed up \u2013 from biased algorithms to spectacular failures in autonomous systems. It&#8217;s a treasure trove of knowledge for researchers, engineers and anyone who wants to build a better and safer AI.   <\/p><p class=\"wp-block-paragraph\">There&#8217;s also the Epic fAIls ranking that I&#8217;ve been organizing for a while. This is a list of the most spectacular AI blunders detected in a given year. You&#8217;ll find examples there that show how much AI can surprise us but also painfully disappoint.  <\/p><p class=\"wp-block-paragraph\">For example, the transcription model Whisper took third place in the 2024 poll. In October 2024, it was revealed that this model, developed by OpenAI and optimized to &#8220;smooth out&#8221; text, made serious errors in medical applications, leading to so-called hallucinations \u2013 it generated text elements that weren&#8217;t in the original recording. Despite OpenAI&#8217;s warnings against using Whisper in &#8220;high-risk areas&#8221;, this tool has been implemented in over 40 healthcare systems and has involved more than 30,000 medical staff, like at the <a href=\"https:\/\/go.campus.ai\/4ihTPTP\" data-type=\"link\" data-id=\"https:\/\/go.campus.ai\/4ihTPTP\" target=\"_blank\" rel=\"noopener\"><mark style=\"background-color:#82D65E\" class=\"has-inline-color has-contrast-color\">Children\u2019s Hospital in Los Angeles<\/mark><\/a>. Studies have shown that Whisper added content that didn&#8217;t exist in 80% of the transcripts analyzed from public meetings. Another study showed that hallucinations appeared in nearly all of the 26,000 transcriptions tested, that is content that didn&#8217;t exist <a href=\"https:\/\/go.campus.ai\/4bJ6Xip\" data-type=\"link\" data-id=\"https:\/\/go.campus.ai\/4bJ6Xip\" target=\"_blank\" rel=\"noopener\"><mark style=\"background-color:#82D65E\" class=\"has-inline-color has-contrast-color\">in the original interview<\/mark><\/a>. In the medical context, errors like these can lead to serious consequences, such as wrong diagnoses, misunderstandings between medical staff and patients, or incorrect documentation of a patient&#8217;s history. In one case, Whisper added fictional text stating that people &#8220;were black&#8221; even though this information was not in the original recording. Another time, neutral statements were turned into violent content.           <\/p><p class=\"wp-block-paragraph\"><strong>IO: Mind-boggling! And that&#8217;s only the third place. How did such a spectacular failure sweep the competition?  <\/strong><\/p><p class=\"wp-block-paragraph\"><strong>PB: <\/strong>A detailed discussion of the plebiscite results can be found in the<a href=\"https:\/\/go.campus.ai\/3R5V2S9\" data-type=\"link\" data-id=\"https:\/\/go.campus.ai\/3R5V2S9\" target=\"_blank\" rel=\"noopener\"><mark style=\"background-color:#82D65E\" class=\"has-inline-color has-contrast-color\"> podcast recorded for Pulsar<\/mark><\/a>, but I&#8217;ll spill the beans that the biggest flop of 2024 was the Gemini model, which was supposed to eliminate prejudice and discrimination, but instead, it generated historically incorrect images.<\/p><p class=\"wp-block-paragraph\">Gemini, developed by Google DeepMind, was designed with the aim to guarantee inclusivity and prevent bias in the content it generates. However, in February 2024 users discovered that the model excessively tried to introduce ethnic diversity even in contexts where it was historically incorrect. In response to a request to generate historical photos of figures like the American Founding Fathers, popes or Roman emperors, the model often depicted them as people of diverse ethnic backgrounds, overlooking historical realities.    <\/p><p class=\"wp-block-paragraph\">The biggest outrage was caused by depicting Nazis, German soldiers from World War II times, as people of various skin colors, which was seen as distorting history. Similar problems cropped up in generating images of historical scenes, like medieval Europe or ancient Greece, where Gemini overly corrected their demographics to avoid discrimination accusations. After a wave of criticism, Google officially apologized for the mistake and temporarily pulled the image generation feature in Gemini. The company admitted that their system tried to &#8220;actively counteract stereotypes&#8221;, but did it too aggressively, leading to hallucinations that weren&#8217;t in line with the facts. This event is a perfect example of the challenges with ethical data management and bias in AI models. It shows how tricky it is to balance between inclusiveness and staying true to historical facts. Many people pointed out that AI should aim for objectivity instead of trying to &#8220;fix&#8221; history according to modern standards.     <\/p><p class=\"wp-block-paragraph\"><strong>IO: If even the big guns and tools we want and can rely on are at the top of this not-so-glorious list, what should we do? In my opinion, the very fact that such rankings are made is pretty uplifting. This shows that we can look under the hood and analyze how models work \u2013 even if it&#8217;s just to find out how wrong they can be sometimes.  <\/strong><\/p><p class=\"wp-block-paragraph\"><strong>PB:<\/strong> One of the tools for model analysis is the development of Explainable AI (XAI), which are techniques that allow us to understand why a model made a specific decision. Instead of treating AI like a magic box, we can build systems that explain their results (for example, in medicine AI shouldn&#8217;t just say: &#8220;It&#8217;s cancer&#8221;, but rather indicate which areas of the image led to such a diagnosis), enable auditing and testing (instead of taking someone&#8217;s word for it, we can check if the model works fairly, like in finance or recruitment), and warn about its own limitations (instead of hallucinating, AI could say: \u201cI&#8217;m not sure\u201d or indicate its level of certainty). <\/p><p class=\"wp-block-paragraph\">Are risks built into AI? Yes, but that doesn&#8217;t mean we have to accept them. Better transparency means safer systems \u2013 for businesses and users alike.  <\/p><div class=\"wp-block-media-text is-stacked-on-mobile is-vertically-aligned-center\" style=\"grid-template-columns:40% auto\"><figure class=\"wp-block-media-text__media\"><img loading=\"lazy\" decoding=\"async\" width=\"334\" height=\"415\" src=\"https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/03\/Zrzut-ekranu-2025-03-28-101444.png\" alt=\"\" class=\"wp-image-9692 size-full\" srcset=\"https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/03\/Zrzut-ekranu-2025-03-28-101444.png 334w, https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/03\/Zrzut-ekranu-2025-03-28-101444-241x300.png 241w\" sizes=\"auto, (max-width: 334px) 100vw, 334px\" \/><\/figure><div class=\"wp-block-media-text__content\"><p class=\"wp-block-paragraph\"><strong>IO: Can you share a specific example of model explanations implementation and their significance?<\/strong><\/p>\n\n<p class=\"wp-block-paragraph\"><strong>PB: <\/strong>I can even share two examples. Two years ago, my team was building a solution that helped detect severe kidney inflammation in patients who had severe covid. It&#8217;s a difficult topic because the complications after this disease were still not well understood and doctors&#8217; intuition had to be compared on an ongoing basis with the results of experimental analyses for hospital patients. We managed to build a predictive system to assess risk pretty fast, but doctors didn&#8217;t trust it because they didn&#8217;t want to make big decisions based on a model they didn&#8217;t understand. Only by applying a set of methods from our proprietary algorithm<a href=\"https:\/\/iema.drwhy.ai\/\" data-type=\"link\" data-id=\"https:\/\/iema.drwhy.ai\/\" target=\"_blank\" rel=\"noopener\"><mark style=\"background-color:#82D65E\" class=\"has-inline-color has-contrast-color\">Interactive Explainable Model Analysis <\/mark><\/a>did they start trusting the model results more. It was easy to check which patient features are most important for the prognosis and see how the prognosis would change if certain parameters were higher or lower.     <\/p><\/div><\/div><p class=\"wp-block-paragraph\">Another interesting example of the direct use of explainable AI in business is the collaboration with KP Labs, which develops AI algorithms for space applications. Our collaboration had a dynamic described as Blue Team vs. Red Team. The team at KP Labs was building predictive models for hyperspectral images for Earth observation applications, and creating top-notch models ready for use in very demanding environments. Our team played the role of the red team \u2013 looking for weaknesses and vulnerabilities in the built models, suggesting how they could be improved. These types of solutions are applied when AI systems need to be reliable, like in defense, healthcare, or space applications. The independent sets of eyes in the form of a red team help eliminate many easy-to-miss errors.      <\/p><p class=\"wp-block-paragraph\"><strong>IO: The tools you&#8217;re talking about seem to be for smaller models. What about the big, generative ones? Are we still able to explain them or rather just getting closer to somewhat taming them? Are traditional methods still useful?  <\/strong><\/p><p class=\"wp-block-paragraph\"><strong>PB:<\/strong> Yes and no. Basic concepts are similar and some technical tools \u2013 like gradient analysis \u2013 also apply, but large language models are a whole different ball game when it comes to explainability. In classic AI systems, like medical models or predictive systems, you can use pretty intuitive XAI techniques, like showing which data features had the biggest impact on the model&#8217;s decision.  <\/p><p class=\"wp-block-paragraph\">However, LLMs (Large Language Models) work in a sequential and probabilistic way \u2013 they don&#8217;t &#8220;make decisions&#8221; but predict the most likely next word based on billions of parameters. Their &#8220;reasoning&#8221; is hard to grasp. Explaining how they work is more about analyzing the impact of individual parts of the text than the classic question of why such a decision was made. On the other hand, you can test them more intuitively, for instance asking different questions and analyzing response patterns.   <\/p><p class=\"wp-block-paragraph\"><strong>IO: So instead of approximating functions, it&#8217;s behavioral analysis.<\/strong><\/p><p class=\"wp-block-paragraph\"><strong>PB:<\/strong> This is a really important difference, as language models&#8217; explainability doesn&#8217;t consist only in trying to understand what they mean \u2013 it&#8217;s also the key to controlling them. Often, when we talk about Explainable AI, we think about analyzing why the model makes mistakes \u2013 why it discriminates, why it confabulates, why it generates harmful content. But when it comes to LLMs, managing how the model behaves is just as important \u2013 if not more so. Let&#8217;s imagine an AI system used in automatic content moderation. Just understanding why the model marks some comments as harmful is valuable, but not enough. The key is whether we can fine-tune this model so it works in a predictable way and aligns with our goals.     <\/p><p class=\"wp-block-paragraph\">Similarly with confabulations \u2013 we know language models &#8220;hallucinate&#8221;, but instead of just analyzing why they do it, we should be looking for ways to limit, detect, or at least mark those hallucinations in real time.<br\/>In practice, especially in business, just explaining the origin of a mistake isn&#8217;t enough \u2014 it won&#8217;t eliminate its consequences. When an LLM is treated like a medical info search engine, confabulation is really harmful. But when it&#8217;s used as support while preparing a science fiction story, more creativity at the expense of less accuracy won&#8217;t bother us, and might even turn out to be a benefit.<br\/><br\/>   <\/p><p class=\"wp-block-paragraph\">If we don&#8217;t have control over AI, it&#8217;s like flying a plane without the ability to correct the course \u2013 we know how autopilot works, but we can&#8217;t stop it when it&#8217;s heading the wrong way. That&#8217;s why explainability in LLMs is more than just analysis \u2013 it&#8217;s a way to really manage risk and improve the safety of these systems. <\/p><p class=\"wp-block-paragraph\"><strong>IO: Can you talk from experience about how to control LLM models, then?<\/strong><\/p><p class=\"wp-block-paragraph\"><strong>PB: <\/strong>We&#8217;re developing various methods that can be used to increase control over models. It&#8217;s hard to predict which will catch on, but in my opinion, the most promising today are sparse autoencoders[a method described in this issue in the article by Paulina Tomaszewska \u2013 ed. note]. They involve &#8220;inserting&#8221; a special overlay into the model, which spreads its operation across thousands of different concepts. We can later find concepts that interest us, like those related to aggression in responses, emotional intensity, etc., and then suppress or enhance specific concepts and their corresponding parts of the model. If we want a model that generates responses without hate speech, we look for concepts related to hate speech and turn them off. And when we want to use the same model for filtering comments on social media, we increase its sensitivity to these kinds of concepts.      <\/p><p class=\"wp-block-paragraph\"><strong>IO: What if they try to influence us with their statements? Let&#8217;s be honest: their rhetoric is surprisingly eloquent. <\/strong><\/p><p class=\"wp-block-paragraph\"><strong>PB:<\/strong> Exactly, a really interesting aspect of these models, directly resulting from the way they are trained, is the issue of persuasiveness. We just finished researching how large language models (LLMs) adjust their responses to influence users with different personality traits. We looked into key linguistic features that matter when you&#8217;re trying to persuade people with different levels of these traits. We compared 19 different LLM models to see to what extent they can adapt to an individual&#8217;s personality in order to increase their persuasiveness. The results show that models use more words related to anxiety when they sense a neurotic recipient, beef up the language of success for conscientious people, and either cut back or enrich vocabulary about cognitive processes depending on how open to experiences the user is. Some model families better adapt language to one type, others to another, while only one model family adapts language in terms of neuroticism. It turns out that LLMs can tailor their responses based on personality hints in prompts, showing their potential to create persuasive content that can influence the minds and well-being of their audience. We often think of LLM models as big search engines, but in reality they&#8217;re extremely effective persuasion tools.       <\/p><p class=\"wp-block-paragraph\"><strong>IO: If it didn&#8217;t send a shiver down your spine, at least you have to admit it&#8217;s astonishing. Since there are still so many traps to watch out for with modern AI models, would you say it&#8217;s too early to implement them in businesses with a clear conscience? <\/strong><\/p><p class=\"wp-block-paragraph\"><strong>PB<\/strong>: No, it&#8217;s kind of like with cars \u2013 we don&#8217;t stop using them just because they might break down. Instead, we invest in seat belts, ABS systems and quality control. Just like with AI \u2013 instead of giving up, we should just develop better methods of oversight and explainability, without sparing intellectual or financial resources globally.  <\/p><p class=\"wp-block-paragraph\">Researchers from my team MI2.AI have been proving for years that models can be more than just black boxes \u2013 they can be interpreted and controlled. Their work on making models more understandable shows that transparency and control of artificial intelligence isn&#8217;t just theory \u2013 it&#8217;s something that&#8217;s already happening and can make technologies safer. So instead of asking: &#8220;Should we use AI?&#8221;, better ask: &#8220;How can we ensure it&#8217;s safe and under our control?&#8221;.<\/p>","protected":false},"excerpt":{"rendered":"<p>Inez Okulska: Safety in the context of artificial intelligence is discussed in every way possible. Both big players and beginner enthusiasts are increasingly feeling that these two concepts, though not always easy to reconcile (see: massive layoffs in big tech right in this area), must go hand in hand. Is risk just a flaw of [&hellip;]<\/p>\n","protected":false},"author":5,"featured_media":9671,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"rank_math_lock_modified_date":false,"footnotes":""},"categories":[783,673,781,674,784],"tags":[447,715,723],"popular":[],"difficulty-level":[36],"ppma_author":[343,364],"class_list":["post-10992","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-industry","category-hai-magazine-4","category-hai-premium","category-issue-4","category-security","tag-ai-4","tag-artificial-intelligence","tag-safety-2","difficulty-level-easy"],"acf":[],"authors":[{"term_id":343,"user_id":5,"is_guest":0,"slug":"inez-okulska","display_name":"dr Inez Okulska","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/479f0f5551a6bf974825e84cfe39166b785e5cd476e583be6a22279c2c379917?s=96&d=mm&r=g","first_name":"dr Inez","last_name":"Okulska","user_url":"","job_title":"","description":"Redaktor naczelna hAI Magazine, badaczka i wsp\u00f3\u0142autorka modeli AI (StyloMetrix, PLLuM), wyk\u0142adowczyni, Top100 Woman in AI in PL"},{"term_id":364,"user_id":46,"is_guest":0,"slug":"prof-przemyslaw-biecek","display_name":"prof. Przemys\u0142aw Biecek","avatar_url":{"url":"https:\/\/haimagazine.com\/wp-content\/uploads\/2024\/08\/prof.-Przemyslaw-Biecek.jpeg","url2x":"https:\/\/haimagazine.com\/wp-content\/uploads\/2024\/08\/prof.-Przemyslaw-Biecek.jpeg"},"first_name":"Przemys\u0142aw","last_name":"Biecek","user_url":"","job_title":"","description":"Profesor Uniwersytetu Warszawskiego i Politechniki Warszawskiej. Prowadzi grup\u0119 badawcz\u0105 MI2.AI i projekt BeatBit popularyzuj\u0105cy my\u015blenie oparte na danych."}],"_links":{"self":[{"href":"https:\/\/haimagazine.com\/en\/wp-json\/wp\/v2\/posts\/10992","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/haimagazine.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/haimagazine.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/haimagazine.com\/en\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/haimagazine.com\/en\/wp-json\/wp\/v2\/comments?post=10992"}],"version-history":[{"count":1,"href":"https:\/\/haimagazine.com\/en\/wp-json\/wp\/v2\/posts\/10992\/revisions"}],"predecessor-version":[{"id":10993,"href":"https:\/\/haimagazine.com\/en\/wp-json\/wp\/v2\/posts\/10992\/revisions\/10993"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/haimagazine.com\/en\/wp-json\/wp\/v2\/media\/9671"}],"wp:attachment":[{"href":"https:\/\/haimagazine.com\/en\/wp-json\/wp\/v2\/media?parent=10992"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/haimagazine.com\/en\/wp-json\/wp\/v2\/categories?post=10992"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/haimagazine.com\/en\/wp-json\/wp\/v2\/tags?post=10992"},{"taxonomy":"popular","embeddable":true,"href":"https:\/\/haimagazine.com\/en\/wp-json\/wp\/v2\/popular?post=10992"},{"taxonomy":"difficulty-level","embeddable":true,"href":"https:\/\/haimagazine.com\/en\/wp-json\/wp\/v2\/difficulty-level?post=10992"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/haimagazine.com\/en\/wp-json\/wp\/v2\/ppma_author?post=10992"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}