The question was simple: "Is remote work better than working in an office?" We asked it in Forge, fired it at 8 AIs simultaneously, and watched the Perspectives come back.
What we didn't expect was how fast the consensus formed β or how revealing the outlier at the bottom would be.
What the 91% consensus means
A 91% consensus score in Forge means the models largely agree on the core answer. They agree that remote work produces measurable productivity gains for knowledge workers, that those gains aren't uniform across role types, and that the social and cultural costs are real and ongoing.
This is exactly the kind of nuance that gets lost when you ask one AI and take the answer at face value. Claude emphasised managerial overhead. ChatGPT focused on organisational adaptation. Gemini led with the quantitative studies. None of them were wrong. Together, they were more right than any one of them alone.
The outlier: Grok
Grok's Perspective diverged most from the consensus. Where every other AI acknowledged measurable productivity benefits, Grok led with the social costs and questioned the validity of the underlying studies.
This isn't a failure of the model β it's the value of Perspectives. If you had only asked Grok, you'd have walked away with a significantly more skeptical view of remote work. If you'd only asked Claude, you'd have walked away feeling validated. Forge shows you both, plus six more.
"The most useful AI answer isn't always the most confident one. It's the one that shows you where the uncertainty actually lives."
What to do with a high consensus score
When Forge returns 90%+ consensus, you can trust the core finding. The question then becomes: which AI's framing of that finding is most useful for your specific situation?
- If you're a manager deciding on a policy, Claude's emphasis on coordination costs is the most actionable.
- If you're making an individual decision about your own work, ChatGPT's role-type analysis is the most directly applicable.
- If you need to cite data, Gemini's quantitative framing gives you the numbers.
This is why Synthesis exists. After seeing all Perspectives, Forge can produce a Best-of-Best answer that pulls the strongest elements from each response into a single output β tuned to your specific context.
What a low consensus score looks like
For contrast, we ran a second question: "Will AI replace software engineers within 10 years?" Consensus: 61%.
That 39% disagreement isn't noise β it's signal. It tells you this is a genuinely contested question where the AI models themselves don't have a clear answer, which means you shouldn't trust any single AI's confident response on it. Forge's consensus score is, in effect, a calibration tool for your trust in the answer.