Stefan Heimersheim

Featured in:

Articles

[2502.03407] Detecting Strategic Deception Using Linear Probes

Feb 5, 2025 | arxiv.org | Bilal Chughtai |Stefan Heimersheim |Marius Hobbhahn

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
Selection for error correction in neural networks? — LessWrong

Oct 26, 2024 | lesswrong.com | Lucius Bushnaq |Stefan Heimersheim |Daniel Murfet

Epistemic status: Low confidence. This is a bit of an experiment in lowering my bar for what I make a proper post instead of a shortform. It's just an early stage idea that might go nowhere. It looks to me like neural networks might naturally develop error-correction mechanisms because error correction dramatically increases the volume of solutions in the loss landscape. In other words, neural networks with error correction seem like they’d tend to have much lower learning coefficients.
Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

Oct 16, 2024 | arxiv.org | Stefan Heimersheim

[Submitted on 16 Oct 2024] Title:Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs View a PDF of the paper titled Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs, by Daniel J.

Contact details

Emails

[email protected]

Socials & Sites

Try JournoFinder For Free

Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.

Start Your 7-Day Free Trial →

Stefan Heimersheim

Articles

[2502.03407] Detecting Strategic Deception Using Linear Probes

Selection for error correction in neural networks? — LessWrong

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

Contact details

Emails

Socials & Sites