Jonas Hallgren

Featured in: Favicon

bmj.com

jech.bmj.com

Articles

The Alignment Mapping Program: Forging Independent Thinkers in AI Safety - A Pilot Retrospective — LessWrong

Jan 10, 2025 | lesswrong.com | Alvin Ånestrand |Jonas Hallgren

The Alignment Mapping Program: Forging Independent Thinkers in AI Safety - A Pilot RetrospectiveThe AI safety field faces a critical challenge: we need researchers who can not only implement existing solutions but also forge new, independent paths. In 2023, inspired by John Wentworth's work on agency and learning from researchers like Rohin Shah and Adam Shimi who have highlighted the limitations of standard AI safety education, we launched the Alignment Mapping Program (AMP).
Meditation insights as phase shifts in your generative priors — LessWrong

Jan 7, 2025 | lesswrong.com | Jonas Hallgren

Okay, so after engaging a lot with Steven Byrnes post on [Intuitive self-models] 6. Awakening / Enlightenment / PNSE I’ve tried to think about the most important framing shift that is different from me compared to what Steven has written. We will be going through a generative model argument from an agency and predictive processing perspective. This will then lead to an understanding of meditation insights and “awakening” as changes in the self-perceived information boundary.
Model Integrity: MAI on Value Alignment — LessWrong

Dec 5, 2024 | lesswrong.com | Jonas Hallgren |Seth Herd

EVERYONE, CALM DOWN!Meaning Alignment Institute just dropped their first post in basically a year and it seems like they've been up to some cool stuff. Their perspective on value alignment really grabbed my attention because it reframes our usual technical alignment conversations around rules and reward functions into something more fundamental - what makes humans actually reliably good and cooperative?
Model Integrity: MAI's Fresh Take on Value Alignment — LessWrong

Dec 5, 2024 | lesswrong.com | Jonas Hallgren

EVERYONE, CALM DOWN! Meaning Alignment Institute just dropped their first post in basically a year and it seems like they've been up to some cool stuff. Their perspective on value alignment really grabbed my attention because it reframes our usual technical alignment conversations around rules and reward functions into something more fundamental - what makes humans actually reliably good and cooperative?
Asteroid Impact Interpretability — LessWrong

Apr 2, 2024 | lesswrong.com | Jonas Hallgren

This post wouldn't have happened without this post, for which I'm forever gratefulIn order to be able to steer away asteroids from earth we first have to be able to model them properly. This is a post on how to do this safely. Coincidentally, this would be perfect technology for Open Asteroid Impact to improve its direction models, yet as it is on the EA forum, it is obviously meant for good and can't be misused in any way.
Neural uncertainty estimation review article (for alignment) — LessWrong

Dec 5, 2023 | lesswrong.com | Charlie Steiner |Nathan Helm-Burger |Jonas Hallgren

EDIT 1/27: This post neglects the entire sub-field of estimating uncertainty of learned representations, as in https://openreview.net/pdf?id=e9n4JjkmXZ. I might give that a separate follow-up post. IntroductionSuppose you've built some AI model of human values. You input a situation, and it spits out a goodness rating.
One Day Sooner - LessWrong

Nov 2, 2023 | lesswrong.com | Thomas Kwa |Jonas Hallgren

There is a particular skill I would like to share, which I wish I had learned when I was younger. I picked it up through working closely with a previous boss (a CTO who had founded a company and raised it up to hundreds of employees and multi-million dollar deals) but it wasn’t until I read The Story of VaccinateCA that I noticed it was a distinct skill and put into words how it worked.
We don't understand what happened with culture enough — LessWrong

Oct 9, 2023 | lesswrong.com | Minus Gix |Roger Dearnaley |Nathan Helm-Burger |Jonas Hallgren

This is a quick response to Evolution Provides No Evidence For the Sharp Left Turn, due to it winning first prize in The Open Philanthropy Worldviews contest. I think part of the post is sufficiently misleading about evolutionary history and the OP first prize gives it enough visibility, that it makes sense to write a post-long response.
High-level interpretability: detecting an AI's objectives — LessWrong

Sep 28, 2023 | lesswrong.com | Paul Colognese |Jonas Hallgren |Nora Belrose

Thanks to Monte MacDiarmid (for discussions, feedback, and experiment infrastructure) and to the Shard Theory team for their prior work and exploratory infrastructure. Thanks to Joseph Bloom, John Wentworth, Alexander Gietelink Oldenziel, Johannes Treuitlein, Marius Hobbhahn, Jeremy Gillen, Bilal Chughtai, Evan Hubinger, Rocket Drew, Tassilo Neubauer, Jan Betley, and Juliette Culver for discussions/feedback. This is a brief overview of our research agenda, recent progress, and future objectives.
Ten Thousand Years of Solitude — LessWrong

Aug 15, 2023 | lesswrong.com | Radford Neal |Gordon Worley |Lost Futures |Jonas Hallgren

This is a linkpost for the article "Ten Thousand Years of Solitude", written by Jared Diamond for Discover Magazine in 1993, four years before he published Guns, Germs and Steel. That book focused on Diamond's theory that the geography of Eurasia, particularly its large size and common climate, allowed civilizations there to dominate the rest of the world because it was easy to share plants, animals, technologies and ideas. This article, however, examines the opposite extreme.