This is interesting! I once trained a t5 model by removing newlines from Wikipedia text and it worked surprisingly well / at the time the context length was the biggest issue.
Another, not so easy to solve issue was conversational dialogue type data, which wasn’t super well represented in the training data.
I’ve always wanted to come back to working on the problem again, because I think it’s very interesting and we will have a bunch of unstructured text as a result of STT models like whisper that do a great job of transcribing/translating but generally don’t format anything.
In case you need conversational data for the experiment you want to try, I developed an open-source cli tool [1] that create transcripts from voice chats on discord. Feel free to try it out!
you can read the labels this (-y) uses modernBERT and even has an eval comparison to the (-ie) in it's GitHub so you can see the improvement as tested -- although if you want to do vanilla rules based chinking for whatever reason your data needs then (-ie) is still good.
This is absolutely useless. Tried a few examples yesterday using hf demo. Fcking retarded af.
It literally splitted the text in-between of related texts while at the same time kept unrelated texts together, even though the embedding limit was far off.
I genuinely wanted this to work. I mean this. But nop. This shit did not work at all.
RAG is still fcked because if chunking issues. GraphRAG doesn't work correctly either unless you are willing to throw a lot of money during ingestion time.
This is interesting! I once trained a t5 model by removing newlines from Wikipedia text and it worked surprisingly well / at the time the context length was the biggest issue.
Another, not so easy to solve issue was conversational dialogue type data, which wasn’t super well represented in the training data.
I’ve always wanted to come back to working on the problem again, because I think it’s very interesting and we will have a bunch of unstructured text as a result of STT models like whisper that do a great job of transcribing/translating but generally don’t format anything.
In case you need conversational data for the experiment you want to try, I developed an open-source cli tool [1] that create transcripts from voice chats on discord. Feel free to try it out!
[1] https://github.com/naveedn/audio-transcriber
Took me a minute to realize this is not about Chonkie. I would be interested in how this compares to the other's semantic chunking approach
you can read the labels this (-y) uses modernBERT and even has an eval comparison to the (-ie) in it's GitHub so you can see the improvement as tested -- although if you want to do vanilla rules based chinking for whatever reason your data needs then (-ie) is still good.
That example looks terribly useless. Maybe there's an actually useful application you had in mind? I don't know say
Chonk("Hey I forgot my password, this is Tom from X Company") = ("Hey", "I forgot my password", "this is Tom from X Company")
Even then it doesn't quite look helpful.
This is absolutely useless. Tried a few examples yesterday using hf demo. Fcking retarded af.
It literally splitted the text in-between of related texts while at the same time kept unrelated texts together, even though the embedding limit was far off.
I genuinely wanted this to work. I mean this. But nop. This shit did not work at all.
RAG is still fcked because if chunking issues. GraphRAG doesn't work correctly either unless you are willing to throw a lot of money during ingestion time.