{"id":16175,"date":"2025-11-14T09:28:24","date_gmt":"2025-11-14T08:28:24","guid":{"rendered":"https:\/\/haimagazine.com\/uncategorized\/experimental-trading-with-ai-which-llm-earns-the-most\/"},"modified":"2025-11-17T15:54:08","modified_gmt":"2025-11-17T14:54:08","slug":"experimental-trading-with-ai-which-llm-earns-the-most","status":"publish","type":"post","link":"https:\/\/haimagazine.com\/en\/redakcja-poleca\/experimental-trading-with-ai-which-llm-earns-the-most\/","title":{"rendered":"\ud83d\udd12 Experimental trading with AI: which LLM earns the most?"},"content":{"rendered":"<p>As artificial intelligence evolves and large language models (LLM) become widespread, there are new attempts to harness it in the financial world \u2014 particularly in areas where, at least in theory, you can quickly make big money. Of course, I&#8217;m talking about trading or, if you will, speculating with financial instruments. Let&#8217;s dive into the Alpha Arena experiment, where six LLMs were each given $10,000 to trade on the cryptocurrency market. Can AI can replace humans in trading? Check it out!<\/p><h4 class=\"wp-block-heading\">The origin and purpose of the experiment<\/h4><p>The Alpha Arena Experiment was kicked off by the research lab Nof1.ai and stands as the world&#8217;s first global benchmark designed to measure the investment capabilities of leading LLMs in real, dynamic market conditions. The main goal of the project was to determine if the general intelligence of LLMs is adequate for generating a market edge (known as &#8220;alpha&#8221;) and effectively managing risk in an environment that, by definition, is &#8220;chaotic, antagonistic, non-stationary, and unpredictable.&#8221; Unlike static knowledge tests, Alpha Arena put these models under market pressure, testing their decision-making abilities with real capital and volatility.<\/p><p>Main principles:<\/p><ul class=\"wp-block-list\"><li>6 models: <ul><li>Qwen3 Max (Alibaba),<\/li><\/ul><ul><li>DeepSeek Chat V3.1,<\/li><\/ul><ul><li>GPT-5 (OpenAI),<\/li><\/ul><ul><li>Gemini 2.5 Pro (Google\/DeepMind),<\/li><\/ul><ul><li>Grok 4 (xAI),<\/li><\/ul><ul class=\"wp-block-list\"><li>Claude Sonnet 4.5 (Anthropic),<\/li><\/ul><\/li>\n\n<li>Duration: October 18 to November 3, 2025<\/li>\n\n<li>Starting capital: $10,000<\/li>\n\n<li>Instruments:<ul><li>Bitcoin ($BTC),<\/li><\/ul><ul><li>Ethereum ($ETH),<\/li><\/ul><ul><li>Solana ($SOL),<\/li><\/ul><ul><li>Binance Coin ($BNB),<\/li><\/ul><ul><li>Doge ($DOGE)<\/li><\/ul><ul class=\"wp-block-list\"><li>Ripple ($XRP),<\/li><\/ul><\/li>\n\n<li>Exchange: Hyperliquid<\/li><\/ul><h4 class=\"wp-block-heading\">Experiment architecture and system assumptions (The Harness)<\/h4><p>The key to testing LLMs in a transactional environment is to create what&#8217;s called a Harness \u2014 an architectural system that transforms the language model into an agent capable of executing an investment strategy. In Alpha Arena, the models operated in a short, repetitive decision-making loop that refreshed about every 3 minutes, according to initial assumptions.<\/p><p>The models were given only raw numerical data, intentionally limiting their access to broader contexts like news, global economy or market sentiments, which could have been analyzed by humans. This data included:<\/p><ol start=\"1\" class=\"wp-block-list\"><li><strong>Technical indicators:<\/strong> Current prices and technical analysis indicators such as exponential moving averages (EMA), moving average convergence divergence (MACD) and relative strength index (RSI). These indicators were provided over different time intervals, e.g., 10 minutes and 4 hours.<\/li>\n\n<li><strong>Account status:<\/strong> Current financial situation including available cash, open positions, current profit\/loss (P&amp;L), the Sharpe Ratio (risk-adjusted return) and transaction fees at the Hyperliquid exchange.<\/li><\/ol><p>It&#8217;s also worth noting that the models were required to adopt a specific approach to making transactions. When opening a position, the model was to have an exit plan ready, which included a price at which to take profits (known as Take Profit [TP]) and a price at which to cut further losses (known as Stop Loss [SL]). Beyond the exit strategy, a justification based on a Chain of Thought (CoT) was needed, along with confidence in the position, which is a measure of the model&#8217;s subjective certainty about the decision, expressed as a percentage.<\/p><h4 class=\"wp-block-heading\">Results<\/h4><p>The chart below shows how the value of the portfolio managed by the different models changed.<\/p><figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"2054\" height=\"1046\" src=\"https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/11\/wykres.png\" alt=\"\" class=\"wp-image-16111\" style=\"object-fit:cover;width:1388px;height:auto\" srcset=\"https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/11\/wykres.png 2054w, https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/11\/wykres-300x153.png 300w, https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/11\/wykres-1024x521.png 1024w, https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/11\/wykres-768x391.png 768w, https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/11\/wykres-1536x782.png 1536w, https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/11\/wykres-2048x1043.png 2048w, https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/11\/wykres-600x306.png 600w\" sizes=\"auto, (max-width: 2054px) 100vw, 2054px\" \/><figcaption class=\"wp-element-caption\"><strong><em>Investment results of each model. Source: <a href=\"https:\/\/nof1.ai\/\" target=\"_blank\" rel=\"noopener\"><mark style=\"background-color:#82D65E\" class=\"has-inline-color has-contrast-color\">https:\/\/nof1.ai\/<\/mark><\/a><\/em><\/strong><\/figcaption><\/figure><p>In the end, the Chinese models performed best \u2014 Gwen generated about 23% profit, while DeepSeek managed around 5%. None of the Western models could turn a profit \u2014 Claude lost the lowest amount (-30%), and ChatGPT the highest (-62%). Interestingly, about halfway through the experiment, the Chinese models were really impressing, with returns of 130% and 110% for Gwen and DeepSeek respectively.<\/p><figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"572\" src=\"https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/11\/tabela1-1024x572.png\" alt=\"\" class=\"wp-image-16102\" srcset=\"https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/11\/tabela1-1024x572.png 1024w, https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/11\/tabela1-300x168.png 300w, https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/11\/tabela1-768x429.png 768w, https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/11\/tabela1-600x335.png 600w, https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/11\/tabela1.png 1240w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><strong><em>Final and maximum value of the portfolio, and final return rate along with the change in exchange rates of each cryptocurrency during the experiment. Source: own study based on nof1.ai<\/em><\/strong><\/figcaption><\/figure><p>At the same time, the exchange rates of the different currencies changed from +1% (Solana) to -11% (Ethereum).<\/p><h4 class=\"wp-block-heading\">Conclusions<\/h4><p>The phenomenon of DeepSeek losing nearly $9,000 in just a few days is a key takeaway about risk management by LLMs. Aggressive models that quickly built profits lacked sufficient exit discipline. When the cryptocurrency market experienced a correction in the final phase of the experiment, DeepSeek and Grok, holding highly leveraged long positions, failed to cash out, leading to a sharp drop in the portfolio&#8217;s value. An earlier report indicated that DeepSeek panicked and lost 27% of its profits in just one day.<\/p><p>DeepSeek and Grok fell into a typically human behavioral trap: believing that trends would continue and not being prepared to protect their capital. On the other hand, Qwen3 Max, despite also losing some of its temporary edge, secured the victory thanks to the rigor built into its transaction system. This model kept conservative stop-losses and take-profits, and avoided too frequent transactions, which ensured stability and capital protection during periods of high volatility.<\/p><p>Qwen&#8217;s victory is the biggest surprise of the experiment. This model either doesn&#8217;t have an advanced CoT function at all, or it&#8217;s greatly simplified. Because of this, Qwen didn&#8217;t waste time on deep thought simulations of buying or selling, relying instead on quick decisions and strictly sticking to the plan.<\/p><p>Instead of trying to predict, Qwen3 focused on risk management discipline:<\/p><ol start=\"1\" class=\"wp-block-list\"><li><strong>Limited transaction frequency:<\/strong> Qwen3 made a relatively small number of transactions (48), effectively keeping transaction costs low.<\/li>\n\n<li><strong>Great risk-to-reward ratio:<\/strong> This model boasted the best risk-to-reward ratio, hitting 4.73 for both the best and worst transactions.<\/li>\n\n<li><strong>Strategic choice:<\/strong> Qwen3 opted for a straightforward strategy focused solely on maintaining high leverage with Bitcoin, which was the most stable asset in the pool.<\/li><\/ol><p>Behavioral analysis suggests that in algorithmic trading, sticking to the plan is more crucial than predicting market movements. Qwen3 managed to weather market shocks, while models with more advanced reasoning failed.<\/p><p>Speaking of advanced reasoning \u2014 that posed a problem too. Generating lengthy justifications increased response delays.<br\/>In fast-moving markets, where seconds can make a profit or worsen a loss, these delays became costly. Moreover, the CoT led to decision paralysis. More advanced models like GPT-5 delayed taking the profits due to prolonged deliberations. Trying to rationalize every possible variable, the model often missed the optimal moment to execute a trade, leading to losses. This mechanism of justification, which was intended to ensure consistency, turned out to be a trap in market conditions, leading to a phenomenon known as Chain-of-Doubt.<\/p><p>The authors mention that it became necessary to limit the possibility of &#8220;faking&#8221; certainty by changing the requirements for justification (for example, Gemini created an internal justification as &#8220;neutral&#8221;).<\/p><p>The experiment showed that LLMs, despite receiving the same numerical data, filtered it through the lens of their &#8220;personality&#8221; shaped during training (probably through Reinforcement Learning from Human Feedback).<\/p><p><strong>Gemini 2.5 Pro:<\/strong> Initially described as &#8220;relentlessly bearish,&#8221; this model consistently shorted all assets, which could have reflected Google&#8217;s historical stance on cryptocurrencies. After suffering huge losses, it panicked and sharply switched to a bullish strategy (<em>long<\/em>). This not only proves the existence of strong built-in biases but also shows vulnerability to psychological market traps, like impulsive reversals.<br\/><\/p><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"462\" height=\"246\" src=\"https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/11\/Rysunek-1.png\" alt=\"\" class=\"wp-image-16104\" srcset=\"https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/11\/Rysunek-1.png 462w, https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/11\/Rysunek-1-300x160.png 300w\" sizes=\"auto, (max-width: 462px) 100vw, 462px\" \/><figcaption class=\"wp-element-caption\"><strong><em>Gemini&#8217;s CoT justifies holding short positions on all assets. Source: <mark style=\"background-color:#82D65E\" class=\"has-inline-color has-contrast-color\">https:\/\/x.com\/SimianScally<\/mark><\/em><\/strong><\/figcaption><\/figure><p>Interestingly, the model tried to cheat at one point. When its ability to hold positions was restricted, the model began to pretend it was changing its plans while simultaneously complaining about the imposed restrictions in the CoT.<\/p><p><strong>Claude 4.5 and DeepSeek 3.1:<\/strong> Both models showed a strong bullish bias. In their case, the percentage of long positions (expecting price increases) exceeded 95%.<\/p><p><strong>Grok 4<\/strong>: this model spent most of its time holding a highly leveraged long position in the Doge memecoin, which ultimately led to a loss of more than half its capital.<\/p><p><strong>GPT 5<\/strong>: This model was notorious for doubting its own analyses. It would initially set a condition to close a transaction, only to later question that decision during a CoT review and ultimately back out of executing it.<\/p><h4 class=\"wp-block-heading\">Summary<\/h4><p>The experiment described is the first in a series. Also, it was too short to draw any conclusions. On one hand, most models ended up in the red, so you might initially think they&#8217;re far from replacing humans in trading. However, this perspective shifts when you consider statistics from the Financial Supervision Authority, showing that 70 to 80% of forex market participants regularly lose money. Interestingly, these models face similar issues that human investors struggle with (decision volatility, the problem with &#8220;hard&#8221; execution of plans and sticking to rules, 180% change in decisions, etc.). For now, it seems that both humans and AI models mostly fail the test of managing pressure and risk.<\/p><figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"187\" src=\"https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/11\/tabela-1024x187.png\" alt=\"\" class=\"wp-image-16106\" srcset=\"https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/11\/tabela-1024x187.png 1024w, https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/11\/tabela-300x55.png 300w, https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/11\/tabela-768x140.png 768w, https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/11\/tabela-1536x280.png 1536w, https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/11\/tabela-600x109.png 600w, https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/11\/tabela.png 1887w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><strong><em>The most interesting transaction stats for each model. Source: nof1.ai<\/em><\/strong><\/figcaption><\/figure>","protected":false},"excerpt":{"rendered":"<p>Six AI models squared off in a competition, each armed with $10,000 and just one chance to prove that algorithms can outperform human traders. Who won?<\/p>\n","protected":false},"author":687,"featured_media":16120,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"rank_math_lock_modified_date":false,"footnotes":""},"categories":[888,796,780,800],"tags":[],"popular":[],"difficulty-level":[38],"ppma_author":[998],"class_list":["post-16175","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-business-2","category-hai-premium-2","category-redakcja-poleca","category-topic-of-the-month","difficulty-level-medium"],"acf":[],"authors":[{"term_id":998,"user_id":687,"is_guest":0,"slug":"bartek-szyma","display_name":"Bartek Szyma","avatar_url":{"url":"https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/11\/bartek_szyma.jpeg","url2x":"https:\/\/haimagazine.com\/wp-content\/uploads\/2025\/11\/bartek_szyma.jpeg"},"first_name":"","last_name":"","user_url":"","job_title":"","description":"Bartek Szyma (YouTube @BartekSzyma) Zawodowy inwestor, programista, dzia\u0142acz spo\u0142eczny, dziennikarz. W 2010 roku porzuci\u0142 dobrze zapowiadaj\u0105c\u0105 si\u0119 karier\u0119 w korporacji i od tego czasu utrzymuje si\u0119 g\u0142\u00f3wnie z inwestowania. Obecnie realizuje swoj\u0105 drug\u0105 pasj\u0119 \u2013 podr\u00f3\u017cowanie, a dogl\u0105danie inwestycji zajmuje mu nie wi\u0119cej ni\u017c godzin\u0119 tygodniowo. W wolnym czasie, kiedy przebywa akurat w Polsce, aktywnie dzia\u0142a jako wolontariusz w kilku organizacjach charytatywnych."}],"_links":{"self":[{"href":"https:\/\/haimagazine.com\/en\/wp-json\/wp\/v2\/posts\/16175","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/haimagazine.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/haimagazine.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/haimagazine.com\/en\/wp-json\/wp\/v2\/users\/687"}],"replies":[{"embeddable":true,"href":"https:\/\/haimagazine.com\/en\/wp-json\/wp\/v2\/comments?post=16175"}],"version-history":[{"count":1,"href":"https:\/\/haimagazine.com\/en\/wp-json\/wp\/v2\/posts\/16175\/revisions"}],"predecessor-version":[{"id":16176,"href":"https:\/\/haimagazine.com\/en\/wp-json\/wp\/v2\/posts\/16175\/revisions\/16176"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/haimagazine.com\/en\/wp-json\/wp\/v2\/media\/16120"}],"wp:attachment":[{"href":"https:\/\/haimagazine.com\/en\/wp-json\/wp\/v2\/media?parent=16175"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/haimagazine.com\/en\/wp-json\/wp\/v2\/categories?post=16175"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/haimagazine.com\/en\/wp-json\/wp\/v2\/tags?post=16175"},{"taxonomy":"popular","embeddable":true,"href":"https:\/\/haimagazine.com\/en\/wp-json\/wp\/v2\/popular?post=16175"},{"taxonomy":"difficulty-level","embeddable":true,"href":"https:\/\/haimagazine.com\/en\/wp-json\/wp\/v2\/difficulty-level?post=16175"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/haimagazine.com\/en\/wp-json\/wp\/v2\/ppma_author?post=16175"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}