-
Notifications
You must be signed in to change notification settings - Fork 242
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TextChunker doesn't handle Markdown Tables #371
Comments
hi @Licantrop0, it's not that simple. What if a table is too big for the embedding model? I think a more advanced chunker would split by row, keeping the row header in each chunk. Even in this case a row might be too big, so there's need for more complex logic. |
I described how to approach the problem if the entire table doesn't fit the embedding model, down to the single cell. It's also difficult with the current Memory APIs to get the nearby partitions like it was done in this old example: https://github.com/Azure-Samples/semantic-kernel-rag-chat/blob/cc51e164ac1e559e80437918c671ab6257e7c873/src/chapter2/Chapter2Function.cs#L45 For reference, the current approach is this:
|
Thanks for the details. The existing chunker is a sample with its limitations, and we welcome improvements. The behavior with markdown file is a bare minimum implementation, and there are improvements that could be made with regards to tables, lists, headers etc. If someone wants to work on adding the feature described above or other improvements we'd be open to help with PR reviews. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Context / Scenario
TextChunker doesn't have any code to properly handle Markdown Tables.
In fact, when searching using Memory a markdown file with tables, I often find unusable truncated tables in the recalled partitions.
What happened?
Tables in markdown need to be chunked in a single embedding, it doesn't make sense to split the content strictly based on token limit.
Here an example of a recalled partition:
The top line "no | no | yes |" is a partial row of this full table:
Embedding part of table rows out of context will significantly reduce the grounding ability of the LLM to answer questions.
Here is how I would expect Markdown Tables embeddings to be chunked:
Importance
I cannot use Kernel Memory
Platform, Language, Versions
I'm using C# with the following packages:
Here the code snippet:
Relevant log output
No response
The text was updated successfully, but these errors were encountered: