What logic is used by "Chunk Separators"?

The “Chunk String” transform block in Pipeline Builder comes with an option to add separators. The text below explains that Pipeline Builder will TRY to keep paragraphs, sentences, and words together.

What is the logic behind TRYING? Is this some badness measure, or just “split on the nearest separator”?

We use an implementation of the algorithm used by java-langchain found here. We don’t follow this to the letter but it’s the same idea.

So basically you just find the nearest chunk separator that is still under the limit. (There’s no hierarchy of separators. Each is as good as the next one.)

1 Like