Joe Carlsmith

Featured in: Favicon

substack.com

Articles

Fake thinking and real thinking — LessWrong

Jan 28, 2025 | lesswrong.com | Joe Carlsmith

(Audio version here, or search for "Joe Carlsmith Audio" on your podcast app.)“There comes a moment when the children who have been playing at burglars hush suddenly: was that a real footstep in the hall?” - C.S. LewisSometimes, my thinking feels more “real” to me; and sometimes, it feels more “fake.” I want to do the real version, so I want to understand this spectrum better. This essay offers some reflections. I give a bunch of examples of this “fake vs.
Incentive design and capability elicitation — LessWrong

Nov 12, 2024 | lesswrong.com | Joe Carlsmith

(This is the fourth in a series of four posts about how we might solve the alignment problem. See the first post for an introduction to this project and a summary of the series as a whole.)In the first post in this series, I distinguished between “motivation control” (trying to ensure that a superintelligent AI’s motivations have specific properties) and “option control” (trying to ensure that a superintelligent AI’s options have specific properties).
Option control - LessWrong

Nov 4, 2024 | lesswrong.com | Joe Carlsmith

(This is the third in a series of four posts about how we might solve the alignment problem. See the first post for an introduction to this project and a summary of the posts that have been released thus far.)In the first post in this series, I distinguished between “motivation control” (trying to ensure that a superintelligent AI’s motivations have specific properties) and “option control” (trying to ensure that a superintelligent AI’s options have specific properties).
Motivation control — LessWrong

Oct 30, 2024 | lesswrong.com | Joe Carlsmith

(This is the second in a series of four posts about how we might solve the alignment problem. See the first post for an introduction to this project and a summary of the posts that have been released thus far.)In my last post, I laid out the ontology I’m going to use for thinking about approaches to solving the alignment problem (that is, the problem of building superintelligent AI agents, and becoming able to elicit their beneficial capabilities, with succumbing to the bad kind of AI takeover).
How might we solve the alignment problem? (Part 1: Intro, summary, ontology) — LessWrong

Oct 28, 2024 | lesswrong.com | Joe Carlsmith

In my last post, I laid out my picture of what it would even be to solve the alignment problem. In this series of posts, I want to talk about how we might solve it. To be clear: I don’t think that humans necessarily need to solve the whole problem – at least, not on our own.
Video and transcript of presentation on Otherness and control in the age of AGI — LessWrong

Oct 8, 2024 | lesswrong.com | Joe Carlsmith

(Cross-posted from my website.)This is the video and transcript for a lecture I gave about my essay series “Otherness and control in the age of AGI” (slides available here). The lecture attempts to distill down what I see as the core thread explored in the series, which I summarize in the following chart:The “Otherness and control in the age of AGI” series in a single chartThe lecture took place at Stanford University, for CS 362: Research in AI Alignment, on 1 October 2024.
What is it to solve the alignment problem? — LessWrong

Aug 24, 2024 | lesswrong.com | Joe Carlsmith

People often talk about “solving the alignment problem.” But what is it to do such a thing? I wanted to clarify my thinking about this topic, so I wrote up some notes.
Value fragility and AI takeover — LessWrong

Aug 5, 2024 | lesswrong.com | Joe Carlsmith

“Value fragility,” as I’ll construe it, is the claim that slightly-different value systems tend to lead in importantly-different directions when subject to extreme optimization. I think the idea of value fragility haunts the AI risk discourse in various ways – and in particular, that it informs a backdrop prior that adequately aligning a superintelligence requires an extremely precise and sophisticated kind of technical and ethical achievement.
A framework for thinking about AI power-seeking — LessWrong

Jul 24, 2024 | lesswrong.com | Joe Carlsmith

This post lays out a framework I’m currently using for thinking about when AI systems will seek power in problematic ways. I think this framework adds useful structure to the too-often-left-amorphous “instrumental convergence thesis,” and that it helps us recast the classic argument for existential risk from misaligned AI in a revealing way.
Loving a world you don’t trust — LessWrong

Jun 18, 2024 | lesswrong.com | Joe Carlsmith

(Cross-posted from my website. Audio version here, or search for "Joe Carlsmith Audio" on your podcast app.)This is the final essay in a series that I'm calling "Otherness andcontrol in the age of AGI." I'm hoping that the individual essays can beread fairly well on their own, butsee here fora brief summary of the series as a whole. There's also a PDF of the whole series here.