A free model matched GPT-5.2. No fine-tuning. It rewrote its own skill files until it got there
Kimi-K2.5 scores 21.4% on a difficult benchmark. GPT-5.2 scores 41.1%. A team gave Kimi-K2.5 a system that rewrites its own skill files after every failure. No fine-tuning. No new training data. Af...

Source: DEV Community
Kimi-K2.5 scores 21.4% on a difficult benchmark. GPT-5.2 scores 41.1%. A team gave Kimi-K2.5 a system that rewrites its own skill files after every failure. No fine-tuning. No new training data. After a few rounds, it hit 40.6%. That is GPT-5.2 territory. The paper is MetaClaw (arXiv:2603.17187). It is one of five papers published between March 15 and March 20 that all arrived at the same idea: let agents evolve their own skills. A second team from UCL and HKUST Guangzhou published Memento-Skills (arXiv:2603.18743). They froze Gemini-3.1-Flash and only let it edit external Markdown files. On Humanity's Last Exam (HLE), 2,500 expert-level questions, accuracy went from 17.9% to 38.7%. More than double. The model never changed. The skill files did. EvoSkill, Automating Skill Acquisition, and AgentFactory shipped the same week. Five teams. Different methods. Same direction. Why rewriting skill files works better than fine-tuning Fine-tuning is expensive. You need data, GPUs, and a pipeline