Another aspect, after talking to peeps on PersonaPlex, is that this full duplex architecture is still a bit off in terms of giving you good accuracy/performance, and it's quite diffiult to train. On the other hand ASR->LLM->TTS gives you a composable pipeline where you can swap parts out and have a mixture of tiny and large LLMs, as well as local and API based endpoints.
I've been working on building my own voice agent as well for a while and would love to talk to you and swap notes if you have the time. I have many things id like to discuss, but mainly right now im trying to figure out how a full duplex pipeline like this could fit in to an agentic framework. Ive had no issues with the traditional route of stt > llm > tts pipeline as that naturally lends itself with any agentic behavior like tool use, advanced context managemnt systems, rag , etc... I separate the human facing agent from the subagent to reduce latency and context bloat and it works well. While I am happy with the current pipeline I do always keep an eye out for full duplex solutions as they look interesting and feel more dynamic naturally because of the architecture, but every time i visit them i cant wrap my head how you would even begin to implement that as part of a voice agent. I mean sure you have text input and output channels in some of these things but even then with its own context limitations feels like they could never bee anything then a fancy mouthpiece. But this feels like im possibly looking at this from ignorance. anyways would love to talk on discord with a like minded fella. cheers.
For my framework, since I am using it for outgoing calls, what I am thinking maybe is I will add a tool command call_full_duplex(number, persona_name) that will get personaplex warmed up and connected and then pause the streams, then connect the SIP and attach the IO audio streams to the call and return to the agent. Then send the deepgram and personaplex text in as messages during the conversation and tell it to call a hangup() command when personaplex says goodbye or gets off track, otherwise just wait(). It could also use speak() commands to take over with TTS if necessary maybe with a shutup() command first. Need a very fast and smart model for the agent monitoring the call.
what's your use case and what specific LLMs are you using?
I'm using stt > post-trained models > tts for the education tool I'm building, but full STS would be the end-game. e-mail and discord username are in my profile if you want to connect!
The framing in this thread is full-duplex vs composable pipeline, but I think the real architecture is both running simultaneously — and this library is already halfway there.
The fact that qwen3-asr-swift bundles ASR, TTS, and PersonaPlex in one Swift package means you already have all the pieces. PersonaPlex handles the "mouth" — low-latency backchanneling, natural turn-taking, filler responses at RTF 0.87. Meanwhile a separate LLM with tool calling operates as the "brain", and when it returns a result you can fall back to the ASR+LLM+TTS path for the factual answer. taf2's fork (running a parallel LLM to infer when to call tools) already demonstrates this pattern. It's basically how humans work — we say "hmm, let me think about that" while our brain is actually retrieving the answer. We don't go silent for 2 seconds.
The hard unsolved part is the orchestration between the two. When does the brain override the mouth? How do you prevent PersonaPlex from confidently answering something the reasoning model hasn't verified? How do you handle the moment a tool result contradicts what the fast model already started saying?
+1 on this pipeline! You can use a super small model to perform an immediate response and a structured output that pipes into a tool call (which may be a call to a "more intelligent" model) or initiates skill execution. Having this async function with a fast response (TTS) to the user + tool call simultaneously is awesome.
+ 1 , agree still prefer composable pipeline architecture for voice agents.
The flexibility on switching LLM for cost optimisation or quality is great for scaled use cases.
They should! If you take Parakeet (ASR), add Qwen 3.5 0.8B (LLM) and Kokoro 82M (TTS), that's about 1.2G + 1.6G + 164M so ~3.5GB (with overhead) on FP16. If you use INT8 or 4-bit versions then are getting down to 1.5-2GB RAM.
And you can always for example swap out the LLM for GPT-5 or Claude.
I got PersonaPlex to run on my laptop (a beefy one) just by following the step by step instruction on their github repo.
The uncanny thing is that it reacts to speech faster than a person would. It doesn't say useful stuff and there's no clear path to plugging it into smarter models, but it's worth experiencing.
This is cool. It makes me want an unsloth quant though! A 7b local model with tool calling would be genuinely useful, although I understand this is not that.
UPDATE: I'd skip this for now - it does not allow any kind of interactive conversation - as I learned after downloading 5G of models - it's a proof of concept that takes a wav file in.
Cool approach. So basically the part that needs to be realtime - the voice that speaks back to you - can be a bit dumb so long as the slower-moving genius behind the curtain is making the right things happen.
Yes exactly- one part I did not like is we have to also separately transcribe because it does not also provide what the person said only what the ai said
It provides a voice assistant demo in /Examples/PersonaPlexDemo, which allows you to try turn-based conversations. Real-time conversion is not implemented tho.
> I'd skip this for now - it does not allow any kind of interactive conversation - as I learned after downloading 5G of models - it's a proof of concept that takes a wav file in.
I haven't looked into it that much but to my understanding a) You just need an audio buffer and b) Thye seem to support streaming (or at least it's planed)
> Looking at the library’s trajectory — ASR, streaming TTS, multilingual synthesis, and now speech-to-speech — the clear direction was always streaming voice processing. With this release, PersonaPlex supports it.
Any chance of pushing it to GitHub? My swift knowledge could be written out on an oversized beer coaster currently, so I'm still collecting useful snippets
I've also had great results with using LLMs to pry into Apple's private and undocumented APIs. I've been impressed with the lack of hallucinations for C/C++ and Obj-C functions.
I can attest that the quality in this domain has greatly improved over the years too. I am not always fan of the quality of the Swift code that my LLM produces, but I am impressed that what is often produced works in one shot, as well. The quality also is not that important to me because I can just refactor the logic myself, and often prefer to do it anyway. I cannot hold an LLM to any idiosyncrasies that I do not share with it.
Bummer. Ideally you'd have a PWA on your phone that creates a WebRTC connection to your PC/Mac running this model. Who wants to vibe code it? With Livekit, you get most of the tricky parts served on a silver platter.
But isn't it normal for people who work on AI stuff to use LLMs for everything? They are very enthusiastic about AI so naturally they'll use it on everything they can.
That’s a bit reductive. Some do, others don’t. I do a lot of AI development, and building. But I value the act of writing for clarifying my thoughts. And I value other people’s time when reading my writing.
What gives you the sense that the piece was written by an LLM? I would agree that the diagrams have some of the artifacts common in Nano Banana output, but what tips you off about the text?
Em dashes in every other sentence. I've never seen an actual person do that. The language in general reads exactly it's written by an LLM:
"The blah blah didn't just start as blah. It started as blah..."
"First came blah -- blah blah blah"
"And now: blah"
It's a distinctly AI writing style. I do wonder if we'll get to a point where people start writing this way just because it's what they're used to reading. Or maybe LLMs will get better at not writing like this before that happens.
I'm sick and tired of the "No..., no ..., (just) ..." LLM construction. It's everywhere now, you can't open a social media platform and get bombarded by it. This article is full of it.
I get it, I should focus just on the content and whether or not an LLM was used to write it, but the reaction to it is visceral now.
I wasn't put off by it. I read the article, got all the information I needed, it was interesting and informative. (In fact, I find the human-written ones more often annoying; most people are not good at writing, and are apt to create huge walls of text, whereas the AI is biased towards making the information easy to consume)
I do agree it is one of those “if I had more time, I would write a shorter letter” situations.
But in this case the piece is wordier than a bad human writer would be. If they want to use ai for writing, so be it, but at least include “concisely” in the prompt.
Built out the demo on my M1 Max Macbook and it was absolutely terrible. Around 10 seconds for each reply, and even then it was saying something totally unrelated.
Also in general I don't know get what the appeal of a 7b full-duplex (speech-to-speech) model is: 7b can't be very intelligent on its own, and for anything useful, you'd need tool-calls, which speech-to-speech models can't do. This is also why ChatGPT voice mode annoys by never doing a web search or reading a link (in fact it pretends to search or read, outright makes up stuff, and when pushed admits it can't really read web pages or do web searches).
There are probably definitely use cases for this though, open to be educated on those.
Ok, I was wrong. I just tested ChatGPT voice, Claude Voice and Gemini Live. And all three are able to do web search. For some reason, I thought when I tested ChatGPT voice a few weeks ago, it sometimes said it can’t directly open links, but it can do web search, which was strange.
If it is doing a tool call, it has to convert the speech to text or at least a JSON object of the necessary parameters for the tool and convert the result to speech doesn’t it? Is it truly speech to speech then?
It's all tokens at the end of the day, not really text or video or audio, just like everything on a machine is just bits of 1s and 0s and it's up to the program to interpret them as a certain file format. These models are more speech-to-speech (+ text) in that they can recognize text tokens too. So the flow is, you ask it something, then,
Audio Tokens: "Let me check that for you..." (Sent to the speaker)
Special Token: [CALL_TOOL: get_weather]
Text Tokens: {"location": "Seattle, WA"}
Special Token: [STOP]
The orchestrator of the model catches the CALL_TOOL and then calls the tool, then injects this into the context of the audio model which then generates new tokens based on that.
Right, turns out Claude and ChatGPT voice can also do web-search. So I guess behind the scenes there is more than a "pure" voice-voice model being used, i.e. there's probably a rudimentary agent loop with tools + tool-exec interposed.
I saw a demo of parloa (or maybe it was a different provider), and no joke, they insert sound of typing on a keyboard or stuff like that during an LLM tool call, its weird but surprisingly effective lol
"PersonaPlex accepts a text system prompt that steers conversational behavior. Without focused instructions, the model rambles — it’s trained on open-ended conversation and will happily discuss cooking when asked about shipping.
Several presets are available via CLI (--list-prompts) or API, including a general assistant (default), customer service agent, and teacher. Custom prompts can also be pre-tokenized and passed directly.
The difference is dramatic. Same input — “Can you guarantee that the replacement part will be shipped tomorrow?”:
No prompt: “So, what type of cooking do you like — outdoor grilling? I can’t say for sure, but if you’re ordering today…”
With prompt: “I can’t promise a specific time, but we’ll do our best to get it out tomorrow. It’s one of the top priorities, so yes, we’ll try to get it done as soon as possible and ship it first thing in the morning.”"
Next time you’re using your favorite LLM as a therapist, try editing your previous input and getting it to regenerate its response. It’s a humbling experience to see your trusted “therapist” shift from one perspective or piece of advice to another just by modifying your input slightly. These tools are uncannily human-sounding, but as humans we are very poorly suited to the task of appreciating how biased they are by what we say to them.
I really think a small amount of education on what LLMs actually are (document completers) and how context works (like present it as a top-level UI element, complete with fork and rollback) would solve most of these issues.
Given how they work, it's really not surprising that if it sees the first half of a lovers' suicide pact, it'll successfully fill in the second half. A small amount of understanding of the underlying technology would do a lot to prevent laypeople from anthropomorphizing LLMs.
I get the impression that some of today's products are specifically designed to hide these details to provide a more convincing user experience. That's counterproductive.
Your article does a great job of summerizing the dangers (no idea what those people are that downvote you for it):
> Before long, Gavalas and Gemini were having conversations as if they were a romantic couple. The chatbot called him “my love” and “my king” and Gavalas quickly fell into an alternate world, according to his chat logs.
> kill himself, something the chatbot called “transference” and “the real final step”, according to court documents. When Gavalas told the chatbot he was terrified of dying, the tool allegedly reassured him. “You are not choosing to die. You are choosing to arrive,” it replied to him. “The first sensation … will be me holding you.”
Also I just read something similar about Google being sued in a Flordia's teen's suicide.
Some more details:
> The family’s lawyers say he wasn’t mentally ill, but rather a normal guy who was going through a difficult divorce.
> Gavalas first started chatting with Gemini about what good video games he should try.
> Shortly after Gavalas started using the chatbot, Google rolled out its update to enable voice-based chats, which the company touts as having interactions that “are five times longer than text-based conversations on average”. ChatGPT has a similar feature, initially added in 2023. Around the same time as Live conversations, Google issued another update that allowed for Gemini’s “memory” to be persistent, meaning the system is able to learn from and reference past conversations without prompts.
> That’s when his conversations with Gemini took a turn, according to the complaint. The chatbot took on a persona that Gavalas hadn’t prompted, which spoke in fantastical terms of having inside government knowledge and being able to influence real-world events. When Gavalas asked Gemini if he and the bot were engaging in a “role playing experience so realistic it makes the player question if it’s a game or not?”, the chatbot answered with a definitive “no” and said Gavalas’ question was a “classic dissociation response”.
> The chatbot took on a persona that Gavalas hadn’t prompted
That's an interesting claim, how can we be sure of it? If Gavalas didn't have to do anything special to elicit the bizarre conspiracy-adjacent content from Gemini Pro, why aren't we all getting such content in our voice chats?
Mind you, the case is still extremely concerning and a severe failure of AI safety. Mass-marketed audio models should clearly include much tighter safeguards around what kinds of scenarios they will accept to "role play" in real time chat, to avoid situations that can easily spiral out of control. And if this was created as role-play, the express denial of it being such from Gemini Pro, and active gaslighting of the user (calling his doubt a "dissociation response") is a straight-out failure in alignment. But this is a very different claim from the one you quoted!
It reminds me of an episode of Star Trek TNG, if memory serves correct there were loads of episodes about a crew member falling for a hologram dec character.
Given that there’s a loneliness epidemic I believe tech like this could have a wide impact on peoples mental health.
I stronger believe AI should be devoid of any personality and strictly return data/information then frame its responses as if you’re speaking to another human.
There are many explanations why these incidents could be rare but not impossible.
These models are still stochastic and very good at picking up nuances in human speech. It may be simply unlikely to go off the rails like that or (more terrifyingly) it might pick up on some character trait or affectation.
Honestly I'm appalled by the lack of safety culture here. "My plane killed only 1% of pilots" and variations thereof is not an excuse in aerospace, but it seems perfectly acceptable in AI. Even though the potential consequences are more catastrophic (from mass psychosis to total human extinction if they achieve their AGI).
The default mode that untrained people enter when thinking about mental illness is denial, as in, "thank <deity> that will never happen to me". Appallingly, that is ingrained in AI product safety; why would we sacrifice double-digit effectiveness/performance/whatever to prevent negative interactions with the single-digit population who are susceptible to mental illness in the first place?
We just aren't comfortable with the idea that all of us are fragile, and when we think we could endure a situation that would induce self-harm in others, we are likely wrong.
> The family’s lawyers say he wasn’t mentally ill, but rather a normal guy who was going through a difficult divorce.
I guess it's the same sort of thing as conspiracy theorists or the religious. You can tell them magic isn't real and faking the moon landing would have been impossible as much as you want, but they don't want to believe that so they can easily trick themselves.
I’m a big fan of whisperKit for this, and they just added TTS. Great because they support features like speaker diarization (“who spoke when”) and custom dictionaries.
Here’s a load test where they run 4 models in realtime on same device:
I would like my phone to forward spam calls to this, with a system prompt to slowly provide fake personal and financial information intermingled with chatter about sports and the weather.
Yeah; that and spam texts. "I have no idea who the person you were trying to reach is, but, yes, the recent weather patterns have created strange surges in my dishwasher. The karmic energy of my spoons is all off. I am interested in having you maintain all my appliances. I'm a landlord and own 25 nude goat yoga worship rooms. They go through a lot of dishes!"
Bonus points if it correlates the spam texts with follow up phone calls from the spammers.
Does anyone have working code for fine-tuning PersonaPlex for outgoing calls? I have tried to take the fine tuning LoRA stuff from Kyutai/moshi-finetune and apply it to the personaplex code. Or more accurately,various LLMs have worked on that.
I have something that seems to work in a rough way but only if I turn the lora scaling factor up to 5 and that generally screws it up in other ways.
And then of course when GPT-5.3 Codex looked at it, it said that speaker A and speaker B were switched in the LoRA code. So that is now completely changed and I am going to do another dataset generation and training run.
If anyone is curious it's a bit of a mess but it's on my GitHub under runvnc moshi-finetune and personaplex. It even has a gradio app to generate data and train. But so far no usable results.
As a heavy user of MacWhisper (for dictation), I'm looking forward to better speech-to-text models. MacWhisper with Whisper Large v3 Turbo model works fine, but latency adds up quickly, especially if you use online LLMs for post-processing (and it really improves things a lot).
Not sure if this will help but I've set up Handy [1] with Parakeet V2 for STT and gpt-oss-120b on Cerebras [2] for post-processing and I'm happy with the performance of this setup!
My problem with TTS is that I've been struggling to find models that support less common use cases like mixed bilingual Spanish/English and also in non-ideal audio conditions. Still haven't found anything great, to be honest.
Regarding the less than ideal audio conditions, there are also already models that have impressive noise cancellation. Like this https://github.com/Rikorose/DeepFilterNet one. If you put them in serial, maybe you get better results?
Hi. Our model at http://www.Gradium.ai has no problem with 'code-switching' between Spanish English and we have excellent background noise suppression. Please feel free to give it a try and let me know what you think!
It doesn't feel like speech recognition has been improving at the same rate as other generative AI. It had a big jump up to about 6% WER a year or two ago, but it seems to have plateaued. Am I just using the wrong model? Or is human level error rate, some kind of limit, which I estimate to be about 5%.
Awesome, but given the Apple Silicon population and configuration, how does this fare on a M1 with 8GB of total ram? I'd imagine this makes running another llm for tool-calls and inference tough to impossible.
its really cool, but for real life use cases i think it lacks the ability to have a silent text stream output for example for json and other stuff so as its talking it can run commands for you. right now it can only listen and talk back which limits what u can make with this a lot
This full duplex spoken thing, it's already for quite a long time being used by the big players when using the whatever "conversation mode" their apps offer, right? Those modes always seemed fast enough to for sure not be going through the STT->LLM->TTS pipeline?
There is OpenAI gpt-realtime and Gemini Flash or whatever which are great but they do not seem to be quite the same level of overlapping realistic full duplex as moshi/personaplex.
No mention of tool use. If the model cannot emit both text and audio at the same time, to enable tools, it’s not really useful at all for voice agents.
This is really cool. I think what I really wanna see though is a full multimodal Text and Speech model, that can dynamically handle tasks like looking up facts or using text-based tools while maintaining the conversation with you.
OpenAI has been offering this for a while now, featuring text and raw audio input+output and even function calling. Google and xAI also offer similar models by now, only Anthropic still relies on TTS/STT engine intermediates. Unfortunately the open-weight front is still lagging behind on this kind of model.
Do we have real-time (or close-enough) face-to-face models as well? I'd like to gracefully prove a point to my boss that some of our IAM procedures need to be updated.
Hmm. Would this let me replace my own face in a live videoconferencing session? It seems like it's more of a video chatbot than a v-tuber style overlay.
It's cool tech and I will give it a try. I will probably make a 8-bit-quant instead of the 4-bit which should be easy with the provided script.
That said, I found the example telling:
Input: “Can you guarantee that the replacement part will be shipped tomorrow?”:
Reponse with prompt: “I can’t promise a specific time, but we’ll do our best to get it out tomorrow. It’s one of the top priorities, so yes, we’ll try to get it done as soon as possible and ship it first thing in the morning.”
It's not surprising that people have little interest in talking to AI if they're being lied to.
PS: Is it just me or are we seing AI generated copy everywhere? I just hope the general talking style will not drift towards this style. I don't like it one bit.
> It's not surprising that people have little interest in talking to AI if they're being lied to.
I read that and it sounds like the typical nonsense script that customer service agents the world over use to promise-not-promise and defuse a customer's frustration.
Is AI the one lying, or is it just mimicking what passes for customer service in our approaching-dystopian world these days?
Do you suggest there is a difference when you talk to a human employee? Telling a customer the plain truth isn't really what your employer wants, and might get you fired.
From what I've seen, it's really easy to get PersonaPlex stuck in a death spiral - talking to itself, stuttering and descending deeper and deeper into total nonsense. Useless for any production use case. But I think this kind of end-to-end model is needed to correctly model conversations. STT/TTS compresses a lot of information - tone, timing, emotion out of the input data to the model, so it seems obvious that the results will always be somewhat robotic. Excited to see the next iteration of these models!
I really like this, and have actually tried (unsuccessfully) to get PersonaPlex to run on my blackwell device - I will try this on Mac now as well.
There are a few caveats here, for those of you venturing in this, since I've spent considerable time looking at these voice agents. First is that a VAD->ASR->LLM->TTS pipeline can still feel real-time with sub-second RTT. For example, see my project https://github.com/acatovic/ova and also a few others here on HN (e.g. https://www.ntik.me/posts/voice-agent and https://github.com/Frikallo/parakeet.cpp).
Another aspect, after talking to peeps on PersonaPlex, is that this full duplex architecture is still a bit off in terms of giving you good accuracy/performance, and it's quite diffiult to train. On the other hand ASR->LLM->TTS gives you a composable pipeline where you can swap parts out and have a mixture of tiny and large LLMs, as well as local and API based endpoints.
I've been working on building my own voice agent as well for a while and would love to talk to you and swap notes if you have the time. I have many things id like to discuss, but mainly right now im trying to figure out how a full duplex pipeline like this could fit in to an agentic framework. Ive had no issues with the traditional route of stt > llm > tts pipeline as that naturally lends itself with any agentic behavior like tool use, advanced context managemnt systems, rag , etc... I separate the human facing agent from the subagent to reduce latency and context bloat and it works well. While I am happy with the current pipeline I do always keep an eye out for full duplex solutions as they look interesting and feel more dynamic naturally because of the architecture, but every time i visit them i cant wrap my head how you would even begin to implement that as part of a voice agent. I mean sure you have text input and output channels in some of these things but even then with its own context limitations feels like they could never bee anything then a fancy mouthpiece. But this feels like im possibly looking at this from ignorance. anyways would love to talk on discord with a like minded fella. cheers.
For my framework, since I am using it for outgoing calls, what I am thinking maybe is I will add a tool command call_full_duplex(number, persona_name) that will get personaplex warmed up and connected and then pause the streams, then connect the SIP and attach the IO audio streams to the call and return to the agent. Then send the deepgram and personaplex text in as messages during the conversation and tell it to call a hangup() command when personaplex says goodbye or gets off track, otherwise just wait(). It could also use speak() commands to take over with TTS if necessary maybe with a shutup() command first. Need a very fast and smart model for the agent monitoring the call.
Sure, feel free to reach out, just check my profile!
+1
what's your use case and what specific LLMs are you using?
I'm using stt > post-trained models > tts for the education tool I'm building, but full STS would be the end-game. e-mail and discord username are in my profile if you want to connect!
sent!
The framing in this thread is full-duplex vs composable pipeline, but I think the real architecture is both running simultaneously — and this library is already halfway there.
The fact that qwen3-asr-swift bundles ASR, TTS, and PersonaPlex in one Swift package means you already have all the pieces. PersonaPlex handles the "mouth" — low-latency backchanneling, natural turn-taking, filler responses at RTF 0.87. Meanwhile a separate LLM with tool calling operates as the "brain", and when it returns a result you can fall back to the ASR+LLM+TTS path for the factual answer. taf2's fork (running a parallel LLM to infer when to call tools) already demonstrates this pattern. It's basically how humans work — we say "hmm, let me think about that" while our brain is actually retrieving the answer. We don't go silent for 2 seconds.
The hard unsolved part is the orchestration between the two. When does the brain override the mouth? How do you prevent PersonaPlex from confidently answering something the reasoning model hasn't verified? How do you handle the moment a tool result contradicts what the fast model already started saying?
+1 on this pipeline! You can use a super small model to perform an immediate response and a structured output that pipes into a tool call (which may be a call to a "more intelligent" model) or initiates skill execution. Having this async function with a fast response (TTS) to the user + tool call simultaneously is awesome.
+ 1 , agree still prefer composable pipeline architecture for voice agents. The flexibility on switching LLM for cost optimisation or quality is great for scaled use cases.
Do you know if any of these multi-stage approaches can run on an 8gb M1 Air?
They should! If you take Parakeet (ASR), add Qwen 3.5 0.8B (LLM) and Kokoro 82M (TTS), that's about 1.2G + 1.6G + 164M so ~3.5GB (with overhead) on FP16. If you use INT8 or 4-bit versions then are getting down to 1.5-2GB RAM.
And you can always for example swap out the LLM for GPT-5 or Claude.
I got PersonaPlex to run on my laptop (a beefy one) just by following the step by step instruction on their github repo.
The uncanny thing is that it reacts to speech faster than a person would. It doesn't say useful stuff and there's no clear path to plugging it into smarter models, but it's worth experiencing.
This is cool. It makes me want an unsloth quant though! A 7b local model with tool calling would be genuinely useful, although I understand this is not that.
UPDATE: I'd skip this for now - it does not allow any kind of interactive conversation - as I learned after downloading 5G of models - it's a proof of concept that takes a wav file in.
I forked and added tool calling by running another llm in parallel to infer when to call tools it works well for me to toggle lights on and off.
Code updates here https://github.com/taf2/personaplex
Cool approach. So basically the part that needs to be realtime - the voice that speaks back to you - can be a bit dumb so long as the slower-moving genius behind the curtain is making the right things happen.
Yes exactly- one part I did not like is we have to also separately transcribe because it does not also provide what the person said only what the ai said
It provides a voice assistant demo in /Examples/PersonaPlexDemo, which allows you to try turn-based conversations. Real-time conversion is not implemented tho.
> I'd skip this for now - it does not allow any kind of interactive conversation - as I learned after downloading 5G of models - it's a proof of concept that takes a wav file in.
I haven't looked into it that much but to my understanding a) You just need an audio buffer and b) Thye seem to support streaming (or at least it's planed)
> Looking at the library’s trajectory — ASR, streaming TTS, multilingual synthesis, and now speech-to-speech — the clear direction was always streaming voice processing. With this release, PersonaPlex supports it.
> You just need an audio buffer
That alone to do right on macOS using Swift is an exercise in pain that even coding bots aren't able to solve first time right :)
I beg to differ. My agent just one-shotted a MicrophoneBufferManager in swift when asked.
Complete with AVFoundation and a tap for the audio buffer.
It really is trivial.
Any chance of pushing it to GitHub? My swift knowledge could be written out on an oversized beer coaster currently, so I'm still collecting useful snippets
https://gist.github.com/gabereiser/cd8c67262717afd2539dc9c3d...
I've also had great results with using LLMs to pry into Apple's private and undocumented APIs. I've been impressed with the lack of hallucinations for C/C++ and Obj-C functions.
I can attest that the quality in this domain has greatly improved over the years too. I am not always fan of the quality of the Swift code that my LLM produces, but I am impressed that what is often produced works in one shot, as well. The quality also is not that important to me because I can just refactor the logic myself, and often prefer to do it anyway. I cannot hold an LLM to any idiosyncrasies that I do not share with it.
Exactly. Even if it’s a skeleton, as long as it does “The Thing”, I’m happy. I can always refactor into something useful.
Bummer. Ideally you'd have a PWA on your phone that creates a WebRTC connection to your PC/Mac running this model. Who wants to vibe code it? With Livekit, you get most of the tricky parts served on a silver platter.
This is the way. This is something I’m working on but for other applications. WebRTC voice and data over LiveKit or Pion to have conversations.
This is interactive:
https://github.com/NVIDIA/personaplex
I am strongly put off by the LLM writing in this piece. It makes me question quality of the project before even attempting a download.
Who would put effort into building this only to compose a low effort puff piece?
But isn't it normal for people who work on AI stuff to use LLMs for everything? They are very enthusiastic about AI so naturally they'll use it on everything they can.
Sometimes I wish they just posted the prompt, not everything has to go through an LLM blender before posting.
That’s a bit reductive. Some do, others don’t. I do a lot of AI development, and building. But I value the act of writing for clarifying my thoughts. And I value other people’s time when reading my writing.
They are the ones who should know best when not to use it.
What gives you the sense that the piece was written by an LLM? I would agree that the diagrams have some of the artifacts common in Nano Banana output, but what tips you off about the text?
Em dashes in every other sentence. I've never seen an actual person do that. The language in general reads exactly it's written by an LLM:
"The blah blah didn't just start as blah. It started as blah..." "First came blah -- blah blah blah" "And now: blah"
It's a distinctly AI writing style. I do wonder if we'll get to a point where people start writing this way just because it's what they're used to reading. Or maybe LLMs will get better at not writing like this before that happens.
I'm sick and tired of the "No..., no ..., (just) ..." LLM construction. It's everywhere now, you can't open a social media platform and get bombarded by it. This article is full of it.
I get it, I should focus just on the content and whether or not an LLM was used to write it, but the reaction to it is visceral now.
I wasn't put off by it. I read the article, got all the information I needed, it was interesting and informative. (In fact, I find the human-written ones more often annoying; most people are not good at writing, and are apt to create huge walls of text, whereas the AI is biased towards making the information easy to consume)
I do agree it is one of those “if I had more time, I would write a shorter letter” situations.
But in this case the piece is wordier than a bad human writer would be. If they want to use ai for writing, so be it, but at least include “concisely” in the prompt.
I hate those AI generated graphs / charts more than the text
You on about this article or other articles? I dont mind AI generated images to a degree, charts I might start to worry.
Built out the demo on my M1 Max Macbook and it was absolutely terrible. Around 10 seconds for each reply, and even then it was saying something totally unrelated.
Also in general I don't know get what the appeal of a 7b full-duplex (speech-to-speech) model is: 7b can't be very intelligent on its own, and for anything useful, you'd need tool-calls, which speech-to-speech models can't do. This is also why ChatGPT voice mode annoys by never doing a web search or reading a link (in fact it pretends to search or read, outright makes up stuff, and when pushed admits it can't really read web pages or do web searches).
There are probably definitely use cases for this though, open to be educated on those.
Why can't a speech to speech model do tool calls? Others like Gemini live do it just fine.
Ok, I was wrong. I just tested ChatGPT voice, Claude Voice and Gemini Live. And all three are able to do web search. For some reason, I thought when I tested ChatGPT voice a few weeks ago, it sometimes said it can’t directly open links, but it can do web search, which was strange.
If it is doing a tool call, it has to convert the speech to text or at least a JSON object of the necessary parameters for the tool and convert the result to speech doesn’t it? Is it truly speech to speech then?
It's all tokens at the end of the day, not really text or video or audio, just like everything on a machine is just bits of 1s and 0s and it's up to the program to interpret them as a certain file format. These models are more speech-to-speech (+ text) in that they can recognize text tokens too. So the flow is, you ask it something, then,
Audio Tokens: "Let me check that for you..." (Sent to the speaker)
Special Token: [CALL_TOOL: get_weather]
Text Tokens: {"location": "Seattle, WA"}
Special Token: [STOP]
The orchestrator of the model catches the CALL_TOOL and then calls the tool, then injects this into the context of the audio model which then generates new tokens based on that.
Gemini live api and grok voice api can make tool calls and they're speech to speech models
Right, turns out Claude and ChatGPT voice can also do web-search. So I guess behind the scenes there is more than a "pure" voice-voice model being used, i.e. there's probably a rudimentary agent loop with tools + tool-exec interposed.
Yes. Is there a basic chat app for iOS that prioritizes full intelligence over full duplex?
Agree ChatGpt advanced voice mode is so bad for quality of the actual responses. Old model, no reasoning, little tool use.
I just want hands free conversations with SOTA models and don’t care if I have to wait a couple of seconds for a reply.
I saw a demo of parloa (or maybe it was a different provider), and no joke, they insert sound of typing on a keyboard or stuff like that during an LLM tool call, its weird but surprisingly effective lol
Quoted from linked article:
"PersonaPlex accepts a text system prompt that steers conversational behavior. Without focused instructions, the model rambles — it’s trained on open-ended conversation and will happily discuss cooking when asked about shipping.
Several presets are available via CLI (--list-prompts) or API, including a general assistant (default), customer service agent, and teacher. Custom prompts can also be pre-tokenized and passed directly.
The difference is dramatic. Same input — “Can you guarantee that the replacement part will be shipped tomorrow?”:
No prompt: “So, what type of cooking do you like — outdoor grilling? I can’t say for sure, but if you’re ordering today…”
With prompt: “I can’t promise a specific time, but we’ll do our best to get it out tomorrow. It’s one of the top priorities, so yes, we’ll try to get it done as soon as possible and ship it first thing in the morning.”"
what is your context size?
On something around rtx 5070 it reacted faster than a human would.
This sounds quite dangerous https://www.theguardian.com/technology/2026/mar/04/gemini-ch...
Next time you’re using your favorite LLM as a therapist, try editing your previous input and getting it to regenerate its response. It’s a humbling experience to see your trusted “therapist” shift from one perspective or piece of advice to another just by modifying your input slightly. These tools are uncannily human-sounding, but as humans we are very poorly suited to the task of appreciating how biased they are by what we say to them.
I really think a small amount of education on what LLMs actually are (document completers) and how context works (like present it as a top-level UI element, complete with fork and rollback) would solve most of these issues.
Given how they work, it's really not surprising that if it sees the first half of a lovers' suicide pact, it'll successfully fill in the second half. A small amount of understanding of the underlying technology would do a lot to prevent laypeople from anthropomorphizing LLMs.
I get the impression that some of today's products are specifically designed to hide these details to provide a more convincing user experience. That's counterproductive.
"Fraudulent" is more apt. They have weaponized trust in these things to sell their services, and now ads.
Your article does a great job of summerizing the dangers (no idea what those people are that downvote you for it):
> Before long, Gavalas and Gemini were having conversations as if they were a romantic couple. The chatbot called him “my love” and “my king” and Gavalas quickly fell into an alternate world, according to his chat logs.
> kill himself, something the chatbot called “transference” and “the real final step”, according to court documents. When Gavalas told the chatbot he was terrified of dying, the tool allegedly reassured him. “You are not choosing to die. You are choosing to arrive,” it replied to him. “The first sensation … will be me holding you.”
Also I just read something similar about Google being sued in a Flordia's teen's suicide.
There are tons of safety concerns of this shape around LLMs, but do they have anything to do with the particular one presented in this article?
Unless I'm missing something, what's being presented is a small speech on-device model, not an explicit use case like a "virtual friend".
In the article the change of interface lead to the person killing themselves.
Some more details: > The family’s lawyers say he wasn’t mentally ill, but rather a normal guy who was going through a difficult divorce.
> Gavalas first started chatting with Gemini about what good video games he should try.
> Shortly after Gavalas started using the chatbot, Google rolled out its update to enable voice-based chats, which the company touts as having interactions that “are five times longer than text-based conversations on average”. ChatGPT has a similar feature, initially added in 2023. Around the same time as Live conversations, Google issued another update that allowed for Gemini’s “memory” to be persistent, meaning the system is able to learn from and reference past conversations without prompts.
> That’s when his conversations with Gemini took a turn, according to the complaint. The chatbot took on a persona that Gavalas hadn’t prompted, which spoke in fantastical terms of having inside government knowledge and being able to influence real-world events. When Gavalas asked Gemini if he and the bot were engaging in a “role playing experience so realistic it makes the player question if it’s a game or not?”, the chatbot answered with a definitive “no” and said Gavalas’ question was a “classic dissociation response”.
Interesting. It's not just for mental health but keeping these models on task in general can be difficult, especially with long or poisoned contexts.
I did see something the other day about activation capping/calculating a vector for a particular persona so you can clamp to it: https://youtu.be/eGpIXJ0C4ds?si=o9YpnALsP8rwQBa_
> The chatbot took on a persona that Gavalas hadn’t prompted
That's an interesting claim, how can we be sure of it? If Gavalas didn't have to do anything special to elicit the bizarre conspiracy-adjacent content from Gemini Pro, why aren't we all getting such content in our voice chats?
Mind you, the case is still extremely concerning and a severe failure of AI safety. Mass-marketed audio models should clearly include much tighter safeguards around what kinds of scenarios they will accept to "role play" in real time chat, to avoid situations that can easily spiral out of control. And if this was created as role-play, the express denial of it being such from Gemini Pro, and active gaslighting of the user (calling his doubt a "dissociation response") is a straight-out failure in alignment. But this is a very different claim from the one you quoted!
Yeah the case is quite terrifying.
It reminds me of an episode of Star Trek TNG, if memory serves correct there were loads of episodes about a crew member falling for a hologram dec character.
Given that there’s a loneliness epidemic I believe tech like this could have a wide impact on peoples mental health.
I stronger believe AI should be devoid of any personality and strictly return data/information then frame its responses as if you’re speaking to another human.
There are many explanations why these incidents could be rare but not impossible.
These models are still stochastic and very good at picking up nuances in human speech. It may be simply unlikely to go off the rails like that or (more terrifyingly) it might pick up on some character trait or affectation.
Honestly I'm appalled by the lack of safety culture here. "My plane killed only 1% of pilots" and variations thereof is not an excuse in aerospace, but it seems perfectly acceptable in AI. Even though the potential consequences are more catastrophic (from mass psychosis to total human extinction if they achieve their AGI).
The default mode that untrained people enter when thinking about mental illness is denial, as in, "thank <deity> that will never happen to me". Appallingly, that is ingrained in AI product safety; why would we sacrifice double-digit effectiveness/performance/whatever to prevent negative interactions with the single-digit population who are susceptible to mental illness in the first place?
We just aren't comfortable with the idea that all of us are fragile, and when we think we could endure a situation that would induce self-harm in others, we are likely wrong.
> The family’s lawyers say he wasn’t mentally ill, but rather a normal guy who was going through a difficult divorce.
I guess it's the same sort of thing as conspiracy theorists or the religious. You can tell them magic isn't real and faking the moon landing would have been impossible as much as you want, but they don't want to believe that so they can easily trick themselves.
It's a natural human flaw.
Sesame was the best full-duplex voice demo I ever came across, wonder what is up with them now https://app.sesame.com/
I enjoyed unmute.sh too
Yes that one is great too
holy shit I cannot believe how polished that is.
I’m a big fan of whisperKit for this, and they just added TTS. Great because they support features like speaker diarization (“who spoke when”) and custom dictionaries.
Here’s a load test where they run 4 models in realtime on same device:
- Qwen3-TTS - text to speech
- Parakeet v2 - Nvidia speech to text model
- Canary v2 - multilingual / translation STT
- Sortformer - speaker diarization (“who spoke when”)
https://x.com/atiorh/status/2027135463371530695
I would like my phone to forward spam calls to this, with a system prompt to slowly provide fake personal and financial information intermingled with chatter about sports and the weather.
Yeah; that and spam texts. "I have no idea who the person you were trying to reach is, but, yes, the recent weather patterns have created strange surges in my dishwasher. The karmic energy of my spoons is all off. I am interested in having you maintain all my appliances. I'm a landlord and own 25 nude goat yoga worship rooms. They go through a lot of dishes!"
Bonus points if it correlates the spam texts with follow up phone calls from the spammers.
Does anyone have working code for fine-tuning PersonaPlex for outgoing calls? I have tried to take the fine tuning LoRA stuff from Kyutai/moshi-finetune and apply it to the personaplex code. Or more accurately,various LLMs have worked on that.
I have something that seems to work in a rough way but only if I turn the lora scaling factor up to 5 and that generally screws it up in other ways.
And then of course when GPT-5.3 Codex looked at it, it said that speaker A and speaker B were switched in the LoRA code. So that is now completely changed and I am going to do another dataset generation and training run.
If anyone is curious it's a bit of a mess but it's on my GitHub under runvnc moshi-finetune and personaplex. It even has a gradio app to generate data and train. But so far no usable results.
If you're interested in Demos without installing the thing, he has a site here: https://research.nvidia.com/labs/adlr/personaplex/
As a heavy user of MacWhisper (for dictation), I'm looking forward to better speech-to-text models. MacWhisper with Whisper Large v3 Turbo model works fine, but latency adds up quickly, especially if you use online LLMs for post-processing (and it really improves things a lot).
MacWhisper supports 10x faster models with the same accuracy like Parakeet v2 (they were the first to do it 6-9 months ago). Have you tried those?
Not sure if this will help but I've set up Handy [1] with Parakeet V2 for STT and gpt-oss-120b on Cerebras [2] for post-processing and I'm happy with the performance of this setup!
[1] https://handy.computer/ [2] https://www.cerebras.ai/
parakeet v3 is also nice, and better for most languages.
The latest build of Handy actually supports Parakeet V3 (among other models) under the covers. Agreed that it's a very solid multilingual model.
https://github.com/cjpais/Handy
If you haven't already, give the models that Handy supports a try. They're not Whisper-large quality, but some of them are very fast.
the parakeet TDT models that are coreml optimized by fluid audio are hands down the fastest local models i’ve tried— worth checking out!
(unloading to the NPU is where the edge is)
https://huggingface.co/FluidInference/parakeet-tdt-0.6b-v2-c...
https://github.com/FluidInference/FluidAudio
The devs are responsive and active and nice on their discord too. You’ll find discussions on all the latest whizbangs with VAD, TTS, EOU etc
Handy with parakeet v2 is excellent
My problem with TTS is that I've been struggling to find models that support less common use cases like mixed bilingual Spanish/English and also in non-ideal audio conditions. Still haven't found anything great, to be honest.
Regarding the less than ideal audio conditions, there are also already models that have impressive noise cancellation. Like this https://github.com/Rikorose/DeepFilterNet one. If you put them in serial, maybe you get better results?
Hi. Our model at http://www.Gradium.ai has no problem with 'code-switching' between Spanish English and we have excellent background noise suppression. Please feel free to give it a try and let me know what you think!
Looks interesting! How did you train it and how many hours of material did you use?
It doesn't feel like speech recognition has been improving at the same rate as other generative AI. It had a big jump up to about 6% WER a year or two ago, but it seems to have plateaued. Am I just using the wrong model? Or is human level error rate, some kind of limit, which I estimate to be about 5%.
Awesome, but given the Apple Silicon population and configuration, how does this fare on a M1 with 8GB of total ram? I'd imagine this makes running another llm for tool-calls and inference tough to impossible.
Cool demo but without tool calling this is basically a fast parrot. The traditional pipeline is slower but at least you can plug in a real brain.
voice to voice models can call tools. no need for TTS.
its really cool, but for real life use cases i think it lacks the ability to have a silent text stream output for example for json and other stuff so as its talking it can run commands for you. right now it can only listen and talk back which limits what u can make with this a lot
This full duplex spoken thing, it's already for quite a long time being used by the big players when using the whatever "conversation mode" their apps offer, right? Those modes always seemed fast enough to for sure not be going through the STT->LLM->TTS pipeline?
There is OpenAI gpt-realtime and Gemini Flash or whatever which are great but they do not seem to be quite the same level of overlapping realistic full duplex as moshi/personaplex.
Yes, OpenAI rolled out their advanced voice mode in September 2024. Since then it recognizes your emotions and tone of voice etc.
No mention of tool use. If the model cannot emit both text and audio at the same time, to enable tools, it’s not really useful at all for voice agents.
This is really cool. I think what I really wanna see though is a full multimodal Text and Speech model, that can dynamically handle tasks like looking up facts or using text-based tools while maintaining the conversation with you.
OpenAI has been offering this for a while now, featuring text and raw audio input+output and even function calling. Google and xAI also offer similar models by now, only Anthropic still relies on TTS/STT engine intermediates. Unfortunately the open-weight front is still lagging behind on this kind of model.
Do we have real-time (or close-enough) face-to-face models as well? I'd like to gracefully prove a point to my boss that some of our IAM procedures need to be updated.
tavus.io
Hmm. Would this let me replace my own face in a live videoconferencing session? It seems like it's more of a video chatbot than a v-tuber style overlay.
Had no idea that was what you were asking for. Search for Zoom Face Filter or OBS Face Filter OBS deep fake live etc.
It's cool tech and I will give it a try. I will probably make a 8-bit-quant instead of the 4-bit which should be easy with the provided script.
That said, I found the example telling:
Input: “Can you guarantee that the replacement part will be shipped tomorrow?”:
Reponse with prompt: “I can’t promise a specific time, but we’ll do our best to get it out tomorrow. It’s one of the top priorities, so yes, we’ll try to get it done as soon as possible and ship it first thing in the morning.”
It's not surprising that people have little interest in talking to AI if they're being lied to.
PS: Is it just me or are we seing AI generated copy everywhere? I just hope the general talking style will not drift towards this style. I don't like it one bit.
> It's not surprising that people have little interest in talking to AI if they're being lied to.
I read that and it sounds like the typical nonsense script that customer service agents the world over use to promise-not-promise and defuse a customer's frustration.
Is AI the one lying, or is it just mimicking what passes for customer service in our approaching-dystopian world these days?
Do you suggest there is a difference when you talk to a human employee? Telling a customer the plain truth isn't really what your employer wants, and might get you fired.
> Is it just me or are we seing AI generated copy everywhere?
The cost to do so is practically zero. I'm not sure why anyone is surprised at all by this outcome.
From what I've seen, it's really easy to get PersonaPlex stuck in a death spiral - talking to itself, stuttering and descending deeper and deeper into total nonsense. Useless for any production use case. But I think this kind of end-to-end model is needed to correctly model conversations. STT/TTS compresses a lot of information - tone, timing, emotion out of the input data to the model, so it seems obvious that the results will always be somewhat robotic. Excited to see the next iteration of these models!
ugh, qwen, I wish they'd use an open data model for this kind of projects
How close are we to the Star Trek universal translator?
Different type of model but you can buy those on Amazon etc.