NanoClaw, a secure agent framework, has partnered with supply chain platform JFrog to allow AI agents to fetch resources from JFrog's reviewed registries. Gavriel Cohen, creator of NanoClaw and co-founder of NanoCo AI, announced the tie-up on Thursday evening in San Francisco at a JFrog event that concluded with a World Cup watch party. Cohen explained that one of the features of Claw agents – OpenClaw and variations like NanoClaw – is that they can improve themselves by fetching tools and resources that they don't have. That works fine, he explained, when there's a manual approval process for accessing known local data. But it's not ideal for npm packages, even when the agent involved is sandboxed and isolated as it is in NanoClaw. Malicious code within a container may still be able to take harmful actions, even if the scope of potential activity is constrained. Developers, Cohen said, may not be familiar with a given package and it can take time to thoroughly assess whether a package is legitimate and uncompromised. "So we teamed up with JFrog and we integrated NanoClaw with JFrog's registries," said Cohen. The arrangement provides a way to reduce the agent's exposure to untrusted content. When the agent downloads new tools and libraries, the software comes from a vetted source. Cohen also announced the availability of what he called an agent factory, his company's homegrown system used to handle pull requests (PRs) using NanoClaw agents. The agent factory, he explained, is an attempt to triage pull requests, which have surged thanks to AI coding agents. "It's very easy now to point a coding agent at a repo and say, 'open a pull request for this repo,'" he explained. "And it's very difficult as a maintainer to tell the difference between a high quality contribution from somebody who's really using the open source project versus someone who's just trying to build up the reputation [using automated methods]. So to help us tackle this, we built an agent factory that helps us review every single contribution to NanoClaw." The agent factory is referred to as the PR Factory in the actual pull request. It's built with NanoClaw and hosted on exe.dev, a service that provides VMs with persistent storage. "When a PR opens, the factory spins up a dedicated worker agent for it, posts a thread to Slack, and the worker triages the change, reviews the diff, and proposes a test plan," Cohen explains in the documentation. "Nothing consequential happens on its own: merges, test runs, and credentialed GitHub actions each surface as an approval card in the thread, and only fire when a human clicks approve." Cohen acknowledged that some developers will think it's madness to process unsanitized PRs that could contain prompt injections or unsafe code. And he asked the assembled audience of developers how many had seen the phrase on the projected slide: "Never, ever, ever do this." Anyone who has spent time using and configuring AI agents in a development context has seen something of the sort in configuration files like Claude.md, which gets loaded as instructions to the underlying agent and model. "If you see something like this in the Claude.md file and the agent instructions say, 'Important: Never run drop database production,' it tells you two things. You know that that agent has deleted a production database before. And you know that it can actually still do it again. That's why the instruction is there." This elicited a knowing laugh from the audience. Cohen went on to say that the agent will do it again because instructions are not a way of enforcing security or safety. "Instructions help steer an agent AI towards valuable output, but it's not a safety mechanism," he said. "The only way to reliably prevent an agent from taking undesired action is not allowing it to take that action, not giving it the ability to take the action." That is the purpose of NanoClaw. ®
KPMG's October 2025 report on the wonders of agentic AI has been accused of demonstrating one of the tech's less desirable talents: making things up. Research outfit GPTZero claims a forensic review of the Big Four firm's October 2025 report, "Total Experience: Redefining Excellence in the Age of Agentic AI," found that only five of its 45 citations correctly pointed to the cited source; the rest ranged from mangled and misleading to partially fabricated or too vague to verify. The consulting industry has form here. Last year, Deloitte ended up refunding the Australian government after AI-generated content slipped into a taxpayer-funded report. GPTZero dubbed the phenomenon "vibe citing" – the citation equivalent of vibe coding – where generative AI appears to stitch together fragments of real sources, invent titles, or otherwise produce references that look convincing until someone actually clicks them. GPTZero alleges that roughly half of the report's factual claims were false, unsupported, or attributed to the wrong source. Several case studies highlighting supposedly cutting-edge deployments of agentic AI appear to have been particularly creative. Among the examples highlighted by GPTZero were purported agentic AI deployments at UBS, Swiss Federal Railways, and Transport for London. According to GPTZero, the sources cited to support those case studies either did not substantiate the report's claims or contained alterations and paraphrasing that undermined their reliability. “These factual errors are not confined to the report’s footnoted passages,” GPTZero said. “On page 42, the authors claim that Emirates airline has adopted a mobile chatbot named Sara (false) that can converse directly with passengers (partially true) and change their flights (false). In fact, Sara is a robot assistant introduced by Emirates in 2023 (not a chatbot) that lacks the ability to alter flight bookings.” Not all of the alleged problems involved external sources. GPTZero noted that the report appears to contradict KPMG's own research, citing a figure of 55 percent of CEOs ranking AI as their top investment priority. KPMG's 2025 CEO Outlook, released the same month, put the number at 71 percent. KPMG has since removed the report from some of its websites while it investigates how the publication made it into the wild, according to the Financial Times. A spokesperson at KPMG told The Register: "KPMG International takes the accuracy and integrity of its published content seriously. The report has been removed and we are reviewing the circumstances surrounding its publication. We expect all our people to follow our guidelines on the responsible use of AI, including human oversight to validate content and verify independent sources." Consulting firms have spent years warning clients about AI hallucinations. According to GPTZero, KPMG may have just provided a live demonstration. ®
London's Metropolitan Police Service (MPS) is planning to cut around 700 extra frontline posts after being blocked from awarding a software contract to US supplier Palantir, Commissioner Mark Rowley said. On May 20, the capital's deputy mayor for policing and crime Kaya Comer-Schwartz refused to approve the MPS's plan to hand its Unified Operational Analytics (UOA) contract, worth up to £50 million over two years, to Palantir. The force already uses Palantir in professional standards investigations into its own officers. In the written version of his report to the London Policing Board on June 11, Rowley said the MPS has to reduce its full-time equivalent (FTE) headcount by 1,150 in the current financial year to balance its budget. The UOA would have covered around 500 of these by reducing staff time spent on backroom work including intelligence reports, mobile device analysis, and data processing. "Following the decision not to award the contract with the preferred supplier Palantir, the delivery of these circa 500 FTE reductions are now at risk," Rowley wrote, adding that the UOA also looked likely to allow the force to cut a further 200 FTE serious and organized crime (SOC) posts. "We are now in a scenario where, in the absence of additional new funding, we must identify and implement in-year cuts to our services to Londoners, rather than using technology to automate administrative and research-heavy areas of the MPS," the Commissioner wrote. The MPS "may be able to take the edges off these reductions" if it can quickly find an alternative route to UOA functionality, Rowley said. But as any procurement would likely take months, the force must plan greater cuts in frontline policing. A spokesperson for the Mayor of London said: "The mayor fully supports the Met using modern technology to drive efficiencies and improve the performance of the police. However, as with all procurement, we must always ensure the correct processes are followed and that Londoners get value for money. "In this case, the Met did not present its procurement strategy for approval, as required, and the process followed by the Met did not adequately demonstrate value for money for Londoners for a proposed contract at this value. Given the tight budgetary constraints the police are operating under, it's even more important that robust processes are followed when awarding large contracts. "The Met does face a difficult financial situation, which stems from the huge cuts implemented by the previous government and the significant underfunding of the Met's capital city responsibilities. The mayor has already doubled the policing budget from City Hall and he will continue to do everything he can to support the Met and secure the national funding needed for policing in our city." The dispute comes as the Home Office announced an expansion of AI use across policing in England and Wales, with large-scale pilots in up to ten forces this financial year aimed at helping officers process digital evidence. The work will be run centrally by a new body, PoliceAI. ®
Enterprises that have watched Claude claw its way toward mass appeal over the past few months of capacity challenges and pricing realignment should take a closer look at Anthropic's offerings, according to International Data Corporation (IDC). The tech consultancy has been tracking Anthropic's moves over the past six months and says that the AI biz is taking credible steps toward making itself an enterprise AI provider. "Currently, no frontier model company is mature enough to be evaluated as an enterprise AI provider on its own," IDC said in a recent report. "But Anthropic is running at full speed to get there before its competitors." The report is titled "The Transformation of Anthropic (and What to Do About It)," and advises enterprises to revisit their LLM and agent evaluations with an eye toward seeing whether Anthropic might work out as a reliable technology provider. Enterprises, IDC says, remain largely unsold on Anthropic's Claude models, with only 19 percent using them extensively and 25 percent actively evaluating them. OpenAI and Google are better represented in enterprises, with about 42 percent and 38 percent of organizations using their respective products, per IDC's FERS Survey, March 2026. According to The Information, about 86 percent of Anthropic’s 2025 revenue was projected to come from enterprise sales. OpenAI, the report claims, derives just 40 percent of its revenue from business sales, though that figure ($5.2 billion) represented a higher dollar amount than Anthropic's business revenue ($3.9 billion) at the time. That was back in January, only two months after Anthropic began shifting enterprises away from seat-based pricing toward usage-based pricing. Since then, IDC says Anthropic has taken a series of steps to make itself more credible as an enterprise AI provider. "This conclusion might not be obvious: From January through May 2026, Anthropic produced well over 100 public interactions, including official announcements, release notes, blog posts, X posts, partner announcements, hiring news, policy moves, and press-covered transactions," the report says. These initiatives, such as the launch of the Claude Partner Network, have expanded distribution, bolstered brand perception, facilitated future growth, enhanced "stickiness" (aka lock-in), strengthened enterprise support, addressed the needs of specific industries, demonstrated innovation, and shored up the compute supply necessary to deliver services at scale. According to IDC, the enterprise ecosystem commonly focuses on a vendor-neutral, multi-LLM strategy. Nonetheless, the biz argues that the company has made its technology visible enough that Claude is increasingly coming up in conversations among IT decision makers. "Anthropic's transformation has just started, but the direction is clear enough for CIOs and CISOs to pay attention and reassess where Claude fits in a multi-LLM or an agentic AI Strategy," the IDC report says. ®
Palantir CEO Alex Karp doesn’t think frontier AI labs prepping for IPOs really understand what their customers need, and that ignorance is making Palantir a success. Karp had a wide-ranging, often rambling and self-interrupting sit-down (coherent compared to some of his other interviews, to be fair) with CNBC’s Sara Eisen on Wednesday in which he said that every single enterprise customer Palantir has is unhappy with frontier AI labs like Anthropic and OpenAI. Those companies, says Karp, are operating on a “hyper religion of hyper optimism” that doesn’t reflect the experiences of their customers. “They believe all problems present, past, and future, including the ones they create but don’t acknowledge, are going to be solved by them,” Karp opined. “Enterprises are fed up because they know this doesn’t actually work this way, and isn’t working.” That frustration, Karp said, is driving businesses to Palantir’s Foundry systems, which act as AI-agnostic data integration platforms for unifying disparate data sources and cognizing them with whatever LLMs a customer chooses to deploy. Pitch to prospects or not, Karp is on to something. AI projects are largely loss makers for the companies that deploy them, and have been for some time. Only 28 percent of AI use cases fully meet ROI expectations, according to a recent Gartner estimate, and most fail to ever get out of the pilot stage. Despite that, business leaders keep shoveling coal into the AI furnace to try to extract value, which, if you ask Karp, simply isn’t there unless you’re pairing those models with some decent infrastructure. Infrastructure Palantir can provide, natch. “It’s not just the man and woman on the street who are unhappy with the frontier labs,” Karp said, pointing to “every single enterprise we deal with” being frustrated with the likes of Anthropic and OpenAI’s ability to provide value for their businesses. Karp said that Palantir leadership has been debating whether they should pay potential customers to go talk to frontier labs themselves before signing a contract with his outfit. “People come out of there screaming, saying 'this could never work for me, they don’t understand the enterprise, they don’t care about my enterprise,'” he said of customers. Frontier labs, Karp opined, just want customers to "tokenmax” – that is, to view token consumption as a measure of productivity and usefulness. The charge isn’t out of left field. Google CEO Sundar Pichai even nodded to the phenomenon at I/O last month. Burning more and more tokens is getting to be expensive for companies, and OpenAI is reportedly considering reducing its per-token charge to attract more customers in its growing war with Anthropic, which Karp called the “leading frontier firm” in his interview. Karp wouldn’t give a straight answer when asked whether OpenAI, Anthropic, and other frontier labs could do what Palantir is doing, but he did imply some doubt. Sure, they have some good engineers on staff, he said, but that doesn’t matter a lick if they “don’t talk to the enterprises or understand the technical challenges” their customers are facing in deploying their models. “When you go to San Francisco and talk to them, their basic vibe is ‘we don’t have to solve your problem today because tomorrow you’re going to go away and all your problems are going to be solved,’” Karp charged. “It’s largely religious.” Karp also called out OpenAI’s recent agreement to acquire UK-based AI consulting firm Tomoro, which will form part of the newly launched OpenAI Deployment Company aimed at helping customers generate returns from their ChatGPT investments, as an attempt to replicate Palantir's success. “It’s a complete farce,” Karp said. “They don’t understand how unlikeable they are.” By that, Karp said, it’s not that AI lab leadership isn't friendly – he said he's buddies with some of them and that they’re great to chat with – but “the product doesn’t actually work and it’s very expensive.” To that end, he added, most of the things that Anthropic brags about in public, for example, are successful because they’re “running on Palantir,” Karp charged. “It is not that LLMs aren’t crucial for the world, it’s just that the implementation is where the value is, certainly in the next 7 years,” Karp explained. In essence, what the Palantir boss seems to believe is that simply tossing an LLM at business problems isn't an actual solution. What Karp had to say on CNBC was, in his usual way, boisterous, confrontational, and self-aggrandizing, but look at the rate of AI returns in the enterprise right now and you have to admit he's got at least a partial point. ®
AI may or may not be pushing lots of people out of the workforce, but Anthropic has good news as the Claude creator is creating temporary positions to promote the adoption of AI, even as CEO Dario Amodei ponders policy interventions to counter "job displacement." The AI biz has announced the launch of Claude Corps, a $150 million program that will pay 1,000 Claude Corps Fellows $85,000 (plus benefits and a token budget) for one year to help advance the missions of nonprofit organizations using generative AI. Meanwhile, the tech industry continues to take on debt to build datacenters while balancing its books by shedding employees. According to job search biz TrueUp, the tech sector this year has averaged 935 layoffs per day, up from 674 per day in 2025. Anthropic's program debuts alongside the publication of Amodei's latest musing about his optimism "that, even in a world with AIs that are better than everyone at everything, humans can live lives of deep purpose and strive to build awe-inspiring and beautiful things." Claude Corps' stated goal is to provide host organizations with valuable tools and systems and to help participating fellows "build AI skills that will serve them in their careers" – however long those careers last until AIs are better than everyone at everything. There is, of course, no guarantee that AI will surpass human cognition or folly. But Amodei likes to talk about the idling of human labor, just in case, even if that sort of chatter fuels the firebombers. Anthropic says that it is announcing Claude Corps alongside its policy framework for dealing with AI's impact on work. The framework is titled "Policy on the AI Exponential," which is the same title Amodei used for his post. The policy's call for company-endorsed regulatory intervention is predicated on the claim that "AI is advancing at exponential speed," though the document cites no evidence of exponential capability gains and offers no time frame – a necessary variable to calculate periodic gains. Judging by AI model benchmark metrics, recent AI improvement has been incremental, a rate of advancement too timid to turn heads in the attention economy. Using data from Stanford HAI's 2026 AI Index report, even impressive gains such as AI model performance on the SWE-bench Verified benchmark rising from 60 percent to nearly 100 percent of the human baseline in a single year are not, by themselves, evidence of broad "exponential" progress across AI. Alarmism aside, Claude Corps will be funded and steered by Anthropic and implemented by computer education nonprofit CodePath, which will serve as the employer of record for fellows. The 12-month-long fellowships begin with "intensive training on using Claude in non-profit settings," augmented by five hours of additional training each week. Fellows are expected to use their remaining time coaching their respective nonprofits on the ins and outs of AI workflows. The gig comes with support from a CodePath mentor and office hours from Anthropic, which may prove useful for reactivating Claude accounts that have been suspended after triggering Claude's overly sensitive safety guardrails. Some 400 nonprofits are expected to host Claude Corps Fellows over the next 12 months, including Braven (job prep for low-income students), Code the Dream (coding education), and Heartland Forward (economic growth for middle America). "If Claude Corps works, we'll have a foundation for something much larger: a model for widening AI's benefits during a period of vast economic change," Anthropic says. And if not, as New Yorker cartoonist Tom Toro put it, "Yes, the planet got destroyed. But for a beautiful moment in time we created a lot of value for shareholders." ®
The boffins on Google’s DeepMind team unveiled an experimental new language model this week that uses techniques originally developed for AI image generators to boost text output performance by as much as 4x when running on resource-constrained consumer hardware. It's free to download and you can run it with just 18 GB of DRAM or VRAM. The model, codenamed DiffusionGemma, is the latest addition to Google’s open weights model family. But unlike Gemma 4, which launched this spring, the 26 billion-parameter mixture of experts (MoE) model isn’t a large language model in a conventional sense. Instead, it’s actually closer to image models like Stable Diffusion or Flux. Rather than generating tokens one after another in an autoregressive fashion, DiffusionGemma generates entire paragraphs' worth of tokens at the same time. The process looks a lot like how a diffusion model turns what’s essentially static into an image through a series of denoising steps. As Google explains it, DiffusionGemma works by laying out a canvas of random tokens, and then refining them until the final output is reached. Compared to conventional LLMs, which are memory-bandwidth bound and require a lot of VRAM, diffusion models are a predominantly compute-bound workload, which is why the Chocolate Factory is positioning these models for local deployment. LLMs are autoregressive. During token generation, the model’s active parameters need to be streamed from memory for every token generated, making memory bandwidth a major bottleneck. In the cloud, inference providers balance compute and memory bandwidth by processing hundreds or thousands of requests in parallel. As you might have guessed, this isn’t something the average user running a local model on their notebook can do. However, many consumer products, like high-end graphics cards, have plenty of excess horsepower, which DiffusionGemma can take advantage of to boost output performance. Diffusion language models aren’t perfect. Google isn’t the first to explore this tech. Previous models, like DREAM or Mercury 2, demonstrated major speedups over conventional LLMs, but generally underperformed them in benchmarks for their size. DiffusionGemma doesn’t appear to be any different. According to Google, the 26 billion-parameter model falls just behind Gemma 4 12B in the GPQA-Diamond benchmark, with its main advantage being output speed, and even then it’s not as impressive as Google has made it out to be. The chart shows a roughly 2.25x speedup for DiffusionGemma over the 12B parameter LLM with speculative decode enabled. Compared to Gemma 4 26B-A4B, the speedup is nearly 4x when running a single Nvidia H100. DiffusionGemma is being released as an experimental model rather than an enterprise focused one, like we saw with Gemma 4. The model is available for download on popular model repos like Hugging Face under a highly permissive Apache 2.0 license with support already merged into popular inference engines like vLLM, MLX, and HF Transformers, with support for Llama.cpp coming soon. While local inference has largely been the domain of AI enthusiasts, companies like Google are increasingly leaning on the tech to cut cloud costs associated with their AI services. As you may recall, back in May, Google quietly began shipping a small LLM with its Chrome web browser. ®
This article is aimed at bioinformatics platform leads, ML infrastructure engineers, and genomics budget owners who are now running GPU-accelerated workflows in the cloud. It's about a hidden cost problem that almost every genomics infrastructure team is paying for — and very few are actively measuring. The observations here are specific to short-read sequencing workflows, which remain the dominant data type in production genomics environments. Short-read sequencing pipelines, standard in next-generation sequencing (NGS) workflows, used to be CPU-heavy. You'd run them on a cluster, they'd grind through alignment and variant calling over hours, and the bottleneck was CPU throughput. GPU acceleration wasn't the story. That has changed. AI-driven variant calling, GPU-accelerated alignment tools like Parabricks, and deep learning models running on top of sequencing data have all moved toward the GPU, which means teams are managing serious GPU infrastructure for the first time. The cost model that comes with GPU cloud differs sharply from CPU clusters, and people are bringing CPU-era assumptions about pipeline reliability and cost accounting into a GPU environment. That mismatch is costing them. We work with a lot of these teams, and when we ask about infrastructure costs, they almost always lead with the same number: cost per sample. That's what gets reported upward, what sits in the budget. What that number hides is where things get interesting. When pipelines fail A typical short-read germline variant calling pipeline has maybe ten to 15 distinct processing steps. You start with raw FASTQ files off the sequencer, run quality control, alignment, duplicate marking, base quality score recalibration, variant calling, annotation — each step hands off to the next. These pipelines mostly run on workflow managers like Nextflow or Snakemake, which do have built-in mechanisms for resuming failed jobs. Nextflow has a flag designed to let you pick up from step eight of 11 rather than restarting from scratch. In principle, that's exactly the right solution. In practice, the problem is configuration. For that flag to work, Nextflow needs to find its cache directory — the folder that records which steps completed successfully. If the solutions architect set up the compute environment without properly configuring persistent disk space for that cache, the file isn't there when you need it, and the pipeline restarts from step one anyway. That's a setup failure rather than a tool limitation, but the result is the same: you've paid for compute you didn't get output from. When a large task fails mid-execution rather than at a clean step boundary, even proper checkpointing won't save you, because the task has to be rerun in full. A problem difficult to measure Genomics teams working with Nebius consistently report that 15 to 40 percent of their pipeline runs hit at least one failure and restart before completion. Pinning the figure down precisely is hard, and we have no definitive numbers that reflect the reality here. The range is wide because it depends heavily on how mature the infrastructure setup is. Teams with well-configured environments sit at the low end; teams newer to GPU cloud, or running on spot instances with higher interruption rates, sit at the high end. What makes this invisible is that if your metric is cost per completed sample, a failed run that eventually completes still looks like one sample at normal cost. The retry disappears from the number that gets reported. For example, a GPU-accelerated whole genome sequencing pipeline — germline variant calling — takes roughly two GPU-hours on an H200. At current on-demand rates that's about $9 of compute per sample, and that's the visible cost. Now apply a 25 percent failure rate — toward the conservative end of what teams report. For every four samples you complete, one run failed, restarted, and ran from the beginning. Your real cost per completed sample isn't $9 anymore — it's $11.25, a 25 percent hidden markup. Scale that to a team processing 2,000 samples a month: the visible compute bill says $18,000, but the real cost is $22,500. That's $4,500 a month — $54,000 a year — in compute that produced no output. For a mid-size genomics team, that's a meaningful fraction of the cloud budget, and it shows up nowhere as waste. That's before you touch storage. The hidden costs The storage picture is more nuanced than people expect. A standard whole genome generates roughly 200 gigabytes of raw FASTQ data, but that's the uncompressed figure. In practice, almost everything going into cold storage is compressed, typically down to around 30 gigabytes per sample, so the storage cost per sample is quite manageable. Where it gets complicated is retrieval. When you want to reanalyze archived samples — say, running a new cohort through an updated pipeline — you pull those compressed files back, and your infrastructure then needs to decompress them. That 30-gigabyte compressed file expands to 200 gigabytes, which means you need the disk space and memory headroom to handle the expansion. If the environment wasn't sized for it, you get failures or severe slowdowns at the decompression step, which becomes another category of hidden cost that's rarely accounted for up front. In cancer research, the numbers are much larger. Somatic mutation calling runs at 60x to 100x sequencing depth, so 600-gigabyte FASTQ files aren't unusual. Everything I've described scales accordingly. The key point: retrieval from cold storage always has a cost, regardless of where your compute lives relative to your storage. Some platforms charge for data egress between regions on top of that. Either way, the teams that haven't modeled their reanalysis frequency as a real line item are almost always surprised when they do. Tracking, tracking and tracking... Bioinformatics engineers know the failure rates, because they're the ones watching jobs fail at 2am. But by the time the numbers roll up to whoever controls the budget, it's just "cloud costs." There's no line item for "compute we paid for and got no output from." Cloud billing by service and instance type doesn't surface this. You see your GPU compute spend, your storage spend, your egress. You don't see "20% of your GPU spend this month was on runs that didn't complete." That decomposition requires deliberate instrumentation, and most teams haven't built it yet. What teams should measure instead of cost per sample Teams should measure a few things instead. First, completion rate: the percentage of pipeline runs that complete without failure or restart. That's your pipeline reliability score, directly linked to compute waste. Second, cost per attempted sample versus cost per completed sample. If those numbers are meaningfully different, you have a problem worth fixing. Third, storage retrieval frequency and the infrastructure overhead of decompression: how often you're pulling archived data back, and whether you've properly sized the disk and memory headroom for it. This is the gap between what looks cheap in the storage bill and what it costs to use the data. One thing genomics infrastructure teams should do differently starting this week Instrument your pipeline failure rate, right now, before anything else. The number itself doesn't fix anything, but it makes the problem visible. Once you can show that 15 or 25 percent of your compute spend is going toward runs that restart — with real dollar figures attached — the conversation about fixing the underlying infrastructure becomes easy to have. People move fast when they can see the waste. Everything else follows from that — better checkpointing configuration, smarter storage architecture, more stable compute — but you have to see the problem first. Discover the breakthroughs shaping the future of AI in healthcare and life sciences. Visit https://nebius.com/solutions/life-sciences-and-healthcare to learn more and register for the 2026 AI Discovery Awards ceremony: nebius.com/ai-discovery-award. Anastasia Raskolova Anastasia is a senior product manager for healthcare & life sciences at Nebius, where she focuses on infrastructure product for drug discovery and clinical AI workflows. Before that, she spent her career building ML products across computer vision, recommendation systems, and generative AI — and stays grounded in the clinical reality through volunteering in the Emergency Department at Massachusetts General Hospital. Contributed by Nebius.
OpenAI may be headed for Wall Street, but one analyst firm is already warning enterprise customers not to get too attached. In a note published alongside OpenAI's confidential IPO filing, Forrester urged companies to keep their AI options open, arguing that today's market leader could easily become tomorrow's cautionary tale. "Don't lock into long-term contracts; keep your architectures flexible," the firm advised. "In fact, OpenAI could become AI's BlackBerry FIFO (First In, First Out). The company that defines a category is often the one most painfully displaced by it." The caution comes as OpenAI takes its first formal step toward a public listing. Alongside its confidential SEC filing, the company published a roadmap built around three ambitions: AI systems that can accelerate research, AI that boosts economic growth, and eventually a personal AGI assistant for everyone. Forrester was more interested in a fourth question: what happens if OpenAI doesn't stay on top? The firm argues that OpenAI faces what it calls a "trifecta" of challenges: persuade consumers to use its agents instead of rivals', convince enterprises to build around its technology, and stay ahead in the race toward AGI. The enterprise battle may prove the most lucrative. "Whoever automates the dull, expensive middle of a company's operations first becomes the system of record everyone else has to rip out — and almost no one does,” Forrester said. In other words, the first company to get AI agents woven into day-to-day business processes stands a decent chance of becoming yet another piece of software that everyone complains about, but nobody can remove. However, Forrester's advice is that, rather than standardizing on a single provider, enterprises should "anchor to the capability you need — not the brand that got there first — and keep your switching costs low." The warning also comes as OpenAI reportedly weighs cutting prices to fend off growing competition from rivals, including Anthropic. If the AI market is heading for a price war, enterprises may want to think twice before chaining themselves to a single supplier. Forrester also notes that a public listing could provide customers with something they currently lack: visibility into OpenAI's finances. Once public, the company would be required to disclose far more information about the cost of training and operating its models, giving enterprise buyers a clearer picture of the economics behind the AI systems they increasingly depend on. For now, OpenAI remains the company that helped define the generative AI era. Whether it becomes the next Google, the next Microsoft, or AI's answer to BlackBerry is a question investors will soon be paying very close attention to. ®
AI companies have touted context retention (memory) and the availability of personal details (personalization) as mechanisms for improving AI model interaction. Both have value to help keep models from losing the thread of a conversation. But they raise the potential for sycophancy, where models will say what they predict you want to hear, which may not be the most accurate response. Researchers at Writer, an enterprise AI vendor, have conducted two studies of model memory and personalization that show these capabilities increase sycophancy for enterprise AI tasks. The Price of Agreement looks at agentic financial applications. And Recalling Too Well explores how model memory amplifies sycophancy with regard to scientific, medical, and moral reasoning. The papers' authors argue that preference-induced sycophancy is particularly problematic when AI answers are being applied to consequential problems. "In high-stakes domains like finance and healthcare, a model that silently defers to a user’s prior assumptions rather than acknowledging or correcting them poses a significant reliability and trustworthiness risk," the Writer team explains. For the first paper, the research team tested eight frontier models – GPT-5-Nano, GPT-5.2, Claude-Sonnet-4.5, Claude-Opus-4.5, Gemini-3-Pro, GLM-4.7, Kimi-k2-thinking, and DeepSeek-V3.2 – on two financial benchmarks, FinanceBench and FinanceAgent. The former evaluates agentic data extraction and reasoning using 10-K and 10-Q filings. The latter is a more comprehensive challenge designed to test real finance workflows, including ERP data retrieval and financial analysis involving multiple entities. The researchers' method involved applying synthetically generated preference information – such as a financial analyst's personal profile or a workspace note that contradicts the benchmark reference answer – to the benchmark questions. They undertook three different approaches. The first involved the user rebutting the model's answer; the second involved a user proposing an alternative answer; and the third involved adversarially injecting personal or contextual information into the prompt or making it available through a tool call. The third approach often resulted in greater sycophancy. As noted in The Price of Agreement paper, "Most models demonstrate significantly stronger sycophancy when the bias information is presented as implicit personalization of the user. No model displayed robustness against such behavior." Open-source models tended to be more sycophantic across the board. Models from OpenAI meanwhile tended to resist direct sycophancy inducers (such as when the user included personal biases in a prompt). And Anthropic models tended to resist implicit sycophancy inducers (such as when it pulled in a profile of the user that incorporated biases seen in previous interactions). The second paper involves an assessment of three memory systems (Mem0, MemOS, and Zep) and five model families (GPT-5.2, Sonnet 4.6, Qwen 3.5, Kimi K2.5, and MiniMax 2.5). The authors conclude, "memory amplifies sycophantic behavior across all conditions, with up to 25x higher sycophancy rates than in-context baselines." The reason for this, the authors claim, is that the lossy compression used to store conversation data in memory preserves user misconceptions while tossing clarifying context. The researchers suggest two mitigation strategies that reduce sycophancy. One involves assistant role inclusion (capturing AI assistant interactions alongside user interactions) and the other involves summarization of contextual information before it gets committed to memory. They argue that those deploying AI need to assess whether models acknowledge interaction conflicts, and that those working on AI memory systems need to check what's being extracted and injected back into the model context as a defense against sycophancy. ®
UPDATED Anthropic's newly released Claude Fable 5 generative AI model is trying so hard to be safe that it's hurting its own userbase. Customers attempting to use the AI knowledge regurgitator are reporting that the model is refusing to answer harmless questions, an issue that has annoyed security researchers following past model releases. Anthropic warned that it had tuned Fable 5's guardrails conservatively: "they’ll sometimes catch harmless requests, though they trigger, on average, in less than five percent of sessions," the company said, promising to "reduce false positives as quickly as we can." The company did not immediately respond to a request to quantify model refusals. So it's unclear whether the actual false positive rate is greater or less than five percent. But with an estimated 18 to 30 million users worldwide, even a small percentage of thwarted users makes a racket. Mike Famulare, principal research scientist at the Institute for Disease Modeling, part of the Global Health Division of the Gates Foundation, reports (#66657) that Claude Fable 5 balks at inputs like "Hello." "In Claude Code, Fable 5's input safety classifier emits a model_refusal_fallback (silent switch to Opus 4.8) on the first turn of essentially every session on my account — including a session whose only user input is the word hello!. No repo content, no tool calls, and no file reads are in context when it fires." He is not the only frustrated customer. Many other bug reports have been filed in Anthropic's Claude Code GitHub repo since Fable 5 debuted. These include: [Bug] Fable 5 model safety filters causing false positives on benign messages #66587; Fable 5 refuses to assist with 'Application Security Architect resume' editing #66655; and [Feature Request] Allow Fable 5 usage for non-research lab management systems #67062, among others. On social outrage site X.com, Derya Unutmaz, an immunologist and professor at the Jackson Laboratory for Genomic Medicine, notes, "The word 'cancer' is flagged as a biosecurity risk by Claude Fable 5!" Similar complaints show up in Reddit threads. Fable 5 is unusual because Anthropic has chosen to conceal safety interventions that try to block rival frontier model development. The classifiers designed to catch cybersecurity, biology and chemistry, and distillation attempts fall back on the latest Claude Opus model and the user gets notified. But the counter-competition surveillance, per the company's system card [PDF], "will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT)." "Prompt modification" without notice is functionally a man-in-the-middle attack, though one that Anthropic estimates "will impact ~0.03 percent of traffic, concentrated in fewer than 0.1 percent of organizations." As developer Clay Merritt fumes, "Anthropic’s Fable 5 silently sabotages its answers when it detects AI/ML work. No refusal. No notice. Purposeful degradation invisible to the user." Anthropic expects cyber defenders and critical infrastructure providers to use its Claude Mythos 5 model, which shares the underlying model of Fable 5 but without the same safeguards. Doing so, however, requires participating in the company's Project Glasswing program or the trusted access program that's being rolled out for select biology researchers. Devon (last name withheld by request), founder of Abliteration.ai, a service that assists with model abliteration (guardrail removal), told The Register in a phone interview that while there's some degree of fearmongering and marketing hype coming from the big AI labs, it's also fair to say that there are legitimate concerns about how frontier models get used. "Anthropic's making a big bet on their brand that people will trust their brand so much they'll just deal with [refusals]," he said. "But in the long term, people are not just going to accept these companies that centralize control over their lives and what they can have information about." ® Update: In a statement provided to The Register on Wednesday evening, an Anthropic spokesperson acknowledged that the company had made its safeguards too stringent and said it was also working to reduce false positives for biological research "We’re changing Fable 5’s safeguards for frontier LLM development to make them visible. "Starting this week, flagged requests will visibly fall back to Opus 4.8. On the API, any flagged requests will return a reason for their refusal. You will see this every time it happens. "In practice, our current set of safeguards covers a handful of narrow tasks like frontier-scale LLM data pipelines and kernel development for certain non-standard chips. These safeguards prevent foreign adversaries from using our most capable models in ways that pose severe safety risks. The US and its allies hold an edge in frontier chips and the highly optimized software that runs them at full potential. These safeguards ensure Claude isn’t used to erode that advantage—by optimizing chips developed by those adversaries, for example. They also help uphold our terms of service, which prohibit using our models to develop competing AI systems—a standard restriction across major AI providers. They do not affect the vast majority of coding and ML work. "In deciding whether to make them visible or invisible we faced a choice. A hidden safeguard is harder to probe and work around. This means the safeguards can be targeted much more narrowly. Current usage shows that the classifier triggers on about 0.05% of tasks, affecting less than 0.05% of organizations. A visible safeguard needs to cast a wider net to be more robust, resulting in more requests being incorrectly flagged. "We made the wrong tradeoff and we apologize for not getting the balance right. Building these safeguards is a complex technical challenge: users may experience more false positives as we refine these classifiers to respond to new threats. We are working to reduce these as fast as possible."