[X.com] by @iruletheworldmo

it’s over

turns out the rl victory lap was premature. new tsinghua paper quietly shows the fancy reward loops just squeeze the same tired reasoning paths the base model already knew. pass@1 goes up, sure, but the model’s world actually shrinks. feels like teaching a kid to ace flash cards and calling it wisdom.

so the grand “self-improving llm” dream? basically crib notes plus a roulette wheel: keep sampling long enough and the base spits the same proofs the rl champ brags about, minus the entropy tax. it’s compression, not discovery.

maybe the endgame isn’t better agents, just sharper funnels. we’ve been coaching silicon parrots to clear increasingly useless olympiad hurdles while mistaking overfit for insight. hard not to wonder if we’re half a decade into the world’s most expensive curve-fitting demo.

Image from tweet


View original on X.com