
Mike Vaiana
Articles
-
Jul 30, 2024 |
lesswrong.com | Steve Byrnes |Marc Carauleanu |Mike Vaiana |Gunnar Zarnacke
Many thanks to Bogdan Ionut-Cirstea, Steve Byrnes, Gunnar Zarnacke, Jack Foxabbott and Seong Hah Cho for critical comments and feedback on earlier and ongoing versions of this work. SummaryIn this post, we introduce self-other overlap training: optimizing for similar internal representations when the model reasons about itself and others while preserving performance.
-
Jul 10, 2024 |
lesswrong.com | Mike Vaiana
Many thanks to Diogo de Lucena, Cameron Berg, Judd Rosenblatt, and Philip Gubbins for support and feedback on this post. TL;DRReinforcement Learning from Human Feedback (RLHF) is one of the leading methods for fine-tuning foundational models to be helpful, harmless, and honest. But it’s complicated and the standard implementation requires a pool of crowdsourced workers to provide feedback.
Try JournoFinder For Free
Search and contact over 1M+ journalist profiles, browse 100M+ articles, and unlock powerful PR tools.
Start Your 7-Day Free Trial →