Gianfranco
gianfrancopiana
Last week Gumclaw made 206 commits to our repo while I slept. It fixed 13 flaky tests. I didn't write a single line of test code. Gumclaw is Gumroad's team AI assistant. It runs on OpenClaw on a Mac mini at our Brooklyn office. It answers questions, reviews PRs, and now, apparently, fixes flaky tests. Flaky tests are detective work with a 20-minute feedback loop. They pass locally, fail in CI, and after enough false alarms the team starts ignoring red builds. Nobody wants to fix them. So nobody does. I wanted to see if Gumclaw could do the grinding for me. Spoiler alert: it did. This freed me to do my job. My highest-value work is building product, not debugging why a tax test fails 1 in 20 runs. Gumclaw ran overnight while I shipped features. Here's how you can setup the same system for yourself. The toolI built openclaw-autoresearch, a plugin for OpenClaw. It's a port of pi-autoresearch (by Tobi Lutke) to the OpenClaw plugin system. The idea is simple. You give it a command that measures something. Gumclaw runs it, gets a baseline, makes a change, runs it again. If the numbers improve, it commits. If they don't, it logs what it learned and what to try next. Then it loops. All state lives in plain files. If the session crashes, you type /autoresearch resume and Gumclaw picks up where it left off. What happenedI pointed Gumclaw at our test suite on March 18. One week later: 206 commits, 94 CI runs, 13 merged PRs. Race conditions, timing issues, browser session corruption, test cleanup hooks leaking between tests. The best find wasn't even a flaky test. It was a real bug: when remapping file IDs, A became B, then B became C, silently corrupting file references. The flake was just the symptom. What the agent foundIt was methodical. Fix a class of failures, trigger CI, log the results, move to the next class. When a fix didn't hold, it wrote down why and what to try next. Those notes fed an ideas backlog that kept it from repeating failed approaches. By experiment 20, it had built a map of which tests were flaky and why. Some fixes took multiple iterations. One tax input field went through four different approaches before Gumclaw found one that held across CI runs. What I learnedFlaky tests are a perfect target for this. Green or red. Pass or fail. The agent ran 30+ CI cycles overnight without getting bored. The ideas backlog is the killer feature. Every failed experiment forces Gumclaw to write down what it tried. It stops repeating mistakes. It takes time: 206 commits for 13 PRs. Fixing a flaky test is easy. Proving it's fixed means running CI enough times to trust the flake is gone and not hiding. The loop handles that grind. Try it now:openclaw plugin install @gianfrancopiana/openclaw-autoresearch/autoresearch setupopenclaw-autoresearch is open source, and we would love your contributions! Posted Mar 26, 2026 at 4:07PM