Voice AI Systems Are Vulnerable to Hidden Audio Attacks

30 points | by SVI 4 hours ago

6 comments

nine_k 26 minutes ago
Isn't it the "adversarial image" attack, well-known in (earlier) visual recognition models [1]? That would be a quite obvious vector.
[1]: https://www.science.org/content/article/turtle-or-rifle-hack...
[-]
- dijksterhuis 19 minutes ago
  In general, if you zoom all the way out, yes the high level optimization problem is very similar. find some `delta` where `target_y = model_inference(delta + x)` where `target_y != real_y` and `size_of(delta) < threshold`
  But (1) older audio models typically used different architectures like RNNs (Recurrent networks) which came with additional challenges compared to the CNNs (Convolutional networks) that image models used. e.g. the exploding gradients problem. during training of RNNs vanishing gradients are a potential problem. during advex optimization the problem gets inverted and you have to do different things to solve it.
  Also (2) the human stuff related to imperceptibility is very different with audio. Ears vs eyes.
  So, they're the same, but different.
  source -- this is what my (unfinished) phd was on. i should really write up the attack that i crafted, but never got published :(
wutwutwat 2 minutes ago
Related: Benn Jordon shows how to poison pill AI harvesting music for training
The Art Of Poison-Pilling Music Files
https://www.youtube.com/watch?v=xMYm2d9bmEA
naveenraj-17 an hour ago
I believe that will be purely based on how the AI Models stored the voices in their neural networks. If we can debug that, then we would be able to send a secret sounnd a AI model might be able to understand due to it's internat connections, but that doesn't make sense to us. Until then, there's no harm, is what my view is
leonulicnik 34 minutes ago
Does this transfer to Whisper / CLAP-type audio models or is it ASR-decoder specific? Whisper would be intresting given how widely it's used in prod.
[-]
- dijksterhuis 11 minutes ago
  Audio adv. examples didn't used to show the same degree of transferability (generate for one model, works against another) that image adv. examples were able to achieve. Likely because of the RNN architecture or just audio is harder :shrug:
  Whether that's changed, i don't know as i've not kept up on the literature. my best guess would be that if there is some transferability, the number of examples that can transfer to other models is limited.