Wise AI Advisors at the Hinge of History
Coordinating through Shared Truth
Related: Large Knowledge Models, The Choice Transition
If we create AI advisors with reliable and shared methods for finding truth, they would likely identify existential risks—which potentially includes risks from superintelligent AIs —and enable humanity to coordinate a response.
I have a set of recent posts (1,2,3) exploring the idea of AI advisors - the trust they have, the trust they will need, and the potential they offer to apply orders of magnitude more intelligence towards our collective knowledge.
And while this is exciting topic to write about, in and of itself, in truth the reason I keep returning to this idea is because I predict that listening to AI advisors could avert the apocalypse.
Once AI systems can themselves design and build even more capable AI systems, progress in AI might accelerate, leading to a rapid increase in AI capabilities. This is known as an intelligence explosion (“IE”).
An intelligence explosion would, as Leopold Ashenbrenner describes, “likely provide a decisive military advantage, and unfold untold powers of destruction. We will be faced with one of the most intense and volatile moments of human history.”
Cards on the table, this blog is opposed to uncontrolled intelligence explosions. In my opinion it’s kind of right there in the name - ‘explosions’ are rarely good. But reasonable people disagree! And there are real questions about whether this is an actual risk and if there are ways to globally coordinate to avoid the risks.
As AI is increasingly used for all manner of intellectual work, these same systems can help answer exactly these types of questions, and give *everyone* the answers.
I posit that if:
We have trusted AI systems that people turn to as advisors
And the trust in the AI advisors is well placed because they have good epistemics
Where “good epistemics” roughly means they consistently use reliable methods to figure out what is true and avoid self-deception
And the AI advisors have shared epistemics
Where “shared epistemics” roughly means there is a shared foundation that allows different AI advisors and people to trust one another’s reasoning.1
Then it implies that those AI Advisors would advise their Principals to avoid an Intelligence Explosion - if and only if this is in fact a real danger - and humanity could coordinate around this advice.
Working backwards, here are two near-future scenarios that paint a picture of an optimistic path of global coordination, where AI advisors aid in extrapolating the public will and in creating stable strategic policies.
Public Preference Cascades: Every person in the world has an AI advisor that is trusted and truth oriented. AI model providers have largely solved sycophancy and related biasing - the average user might still prefer to be flattered on how insightful their questions are, but responsible development practices have shaped AI models to have a strong fiduciary attitude, where they provide advice that is in the user’s higher order, reflected best interest.
As a global dialogue on AI and its implications unfold, the AI advisors provide important perspectives to members of the public on the benefits and risks. Many (all?) of these users stand to lose out from an uncontrolled intelligence explosion, and the AI advisors highlight this.
Moreover the AI advisors correctly reason that, given almost everyone in the world has access to similarly good reasoning systems, that everyone will arrive at similar conclusions.
This shared common knowledge - “we all know that we all know” - triggers a preference cascade among a supermajority of people from all different backgrounds. Stronger than a poll or petition, the broad shared knowledge that it is an endorsed belief by a global cross-cut would halt racing for recursively self-improving superintelligence.
Aside: Our current global conversation on AI is the start of a broad, civilization scale process of orientation and debate on the right direction for AI. But, the scale and speed and shared reasoning might not be sufficient without AI advisors.
Stable Game Theory: Every government in the world is listening to AI advisors. These run the gamut from tool-shaped systems that augment existing intelligence gathering processes to more rigorously tested and secured agents akin to the advisors the general public is using.
Given how many actors - governments and corporations - have access to frontier AI systems, finding agreements that promote stability and are to everyone’s benefit is challenging. There are complex game theoretic strategies at play; for example Mutually Assured Malfunction, where every player has an incentive to sabotage or strike at other groups’ AI systems to prevent any one group from achieving a decisive strategic advantage.
Just like in the Cold War’s MAD, finding stable good equilibria requires that the different actors understand the game they are in.
The use of trusted AI advisors ensures that all the decision makers know both the state of play and the implications of certain moves they might make. It reduces the risks of the uncertainty of trembling hands that could cause catastrophe, and the shared common knowledge that other actors are reflecting on the same strategic advice promotes. They also advise on the type of hardware and institutions that can create verifiable, peaceful certainty.
The shared access to excellent strategic advice turns an adversarial game into a cooperation game, where everyone benefits from preventing runaway intelligence and harnessing safer-yet-powerful tool AIs.
I’m not saying that either of these two scenarios is exactly what would happen if we had good trusted AI advisors; I’m not even saying that wise AI advisors would necessarily agree with halting superintelligence development! This is Not - no no bad JFC no - a suggestion that AI advisors should be created to advocate a particular policy view. The moment you write the answer at the bottom of the page for the AI system to read out, the trust and ability to coordinate is gone.
Rather, I’m saying that we are thinking about gigantic, complex, hard to reason about challenges in steering technology. We can get better answers through AI improving human reasoning, and the ubiquity and potential for trust in AI advisors offering an opportunity to coordinate around those answers.
There are many reasons to be skeptical:
Sycophancy/’Person-Specific-Epistemics’: AI advisors might end up simply entrenching personal belief systems, providing convincing but not wise arguments.
Validation: I take it as a given that AI advisors will produce good answers; it might be too hard to validate that the advice and policy suggestions are in fact wise and accurate.
Bias: Different ideological biases, especially those pertaining to AI development, might creep into the advice. For instance, many of the companies developing these systems would have a strong incentive for subtly pushing against catastrophic risk concerns.
Truth-washing: Many different groups will be incentivized to say they’re developing truthful AI, and it might be too difficult to select those that are actually truthful and trustworthy.
Disempowerment: While I’ve been painting a picture of not superhuman-level intelligent AI systems, rather ones that are wise but still not fully autonomous, it’s not hard to imagine these systems being the start of a slippery slope towards disempowering human decision makers and human agency.
Speed: Building trusted AI advisors with good epistemics might be too much slower than just building smarter ones, and our capabilities might outpace our wisdom.
All of these are challenges that will need to be overcome, and it’s not clear we’ll be able to do so.
But, on the other hand, there’s a good possibility we can make it work. At the heart of this idea I feel like there’s an underlying terrifying and yet exciting yin/yang; the blend of technological determinism and human choice.
Technological determinism: the AI models are getting better and better; there are tremendous economic incentives for this to continue.
Human choice: the trust and shared epistemics are downstream of socially constructed practices - like evaluations, auditing, and shared institutions - that are not-predetermined and where different starting spots could bring us to very different places.
It seems wise to design future AI advisors with both of these forces in mind.
Written in my personal capacity, thank you to Rafe Kennedy, Vaniver Gray, Abram Demski, Owen Cotton-Barratt, Justis Mills, and Georgia Ray for comments on earlier drafts.
It’s possible that good epistemics implies shared epistemics, ala Aumanns agreement theorem - given that we’re not assuming they are perfectly rational agents, and shared epistemics matters for coordination, it’s worth highlighting as a necessary prerequisite



