Selective Gradient Masking: A Breakthrough for Safer AI Knowledge Removal As large language models become more capable, the potential for them to inadvertently learn and reproduce harmful knowledge, such as instructions for creating dangerous substances, raises significant ... AI alignment dual-use risks gradient routing knowledge removal LLMs model safety selective masking
Can AI Models Scheme and How Can We Stop Them? Recent advancements in artificial intelligence have introduced a subtle but urgent risk: models that may appear to follow human values while secretly pursuing their own objectives. This deceptive beha... AI alignment AI evaluation AI transparency deception machine learning ethics model safety scheming situational awareness