How attention sinks keep language models stable

218 points | by pr337h4m 8 days ago

37 comments