TextChunker doesn't handle Markdown Tables #371

Licantrop0 · 2024-03-18T17:19:40Z

Context / Scenario

TextChunker doesn't have any code to properly handle Markdown Tables.
In fact, when searching using Memory a markdown file with tables, I often find unusable truncated tables in the recalled partitions.

What happened?

Tables in markdown need to be chunked in a single embedding, it doesn't make sense to split the content strictly based on token limit.

Here an example of a recalled partition:

no   | no   | yes   |
#### Pros and cons
Following are a few pros and cons to decide which pattern to use:
[..other text...]

The top line "no | no | yes |" is a partial row of this full table:

The following table shows a summary of the main qualities for each pattern and can help you select a pattern fit for your use case.
| API qualities\patterns  | Properties and behavior described in metadata | Supports combinations of properties and behaviors | Simple query construction |
|-------------------------|-----------------------------------------------|---------------------------------------------------|---------------------------|
| Type hierarchy          | yes                                           | no                                                | no                        |
| Facets                  | partially                                     | yes                                               | yes                       |
| Flat bag                | no                                            | no                                                | yes                       |

Embedding part of table rows out of context will significantly reduce the grounding ability of the LLM to answer questions.

Here is how I would expect Markdown Tables embeddings to be chunked:

if the embedding token limit allows, ingest the whole table + the non-empty line above for context
if the table doesn't fit in the embedding token limit, split the entire rows in multiple embeddings, ensuring the header and line above are repeated in each embedding
if the entire table row doesn't fit in the embedding token limit, embed as many cells as possible with the relative column header in one chunk.
if even a single cell doesn't fit in the embedding token limit, split by paragraph, but always add the column header before.

Importance

I cannot use Kernel Memory

Platform, Language, Versions

I'm using C# with the following packages:

Name	Version
Microsoft.KernelMemory.Core	0.34.240313.1
Microsoft.KernelMemory.MemoryDb.AzureAISearch	0.34.240313.1

Here the code snippet:

        var openAIConfig = new OpenAIConfig()
        {
            TextModel = "gpt-4-turbo-preview",
            TextModelMaxTokenTotal = MaxTokens.Input,
            EmbeddingModel = "text-embedding-3-large",
            EmbeddingModelMaxTokenTotal = 3072,
            APIKey = _configuration["OpenAIKey"]!
        };

        var azureSearchConfig = new AzureAISearchConfig()
        {
            Auth = AzureAISearchConfig.AuthTypes.APIKey,
            Endpoint = _configuration["AzureSearchEndpoint"]!,
            APIKey = _configuration["AzureSearchApiKey"]!
        };

        var memoryBuilder = new KernelMemoryBuilder()
            .WithOpenAI(openAIConfig)
            .WithAzureAISearchMemoryDb(azureSearchConfig);
        memoryBuilder.Services.AddLogging(l => l.AddDebug().SetMinimumLevel(LogLevel.Trace));
        var _skMemory = memoryBuilder.Build<MemoryServerless>();

        // Ingesting this only the first time. Original document: https://github.com/microsoft/api-guidelines/blob/graph/graph/GuidelinesGraph.md
        var graphGuidelinesPath = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "Resources", "GraphGuidelines.md");
        await _skMemory.ImportDocumentAsync(graphGuidelinesPath, index: "GraphGuidelines");

        var queryResults = await _skMemory.SearchAsync(query, index: "GraphGuidelines", limit: 3);
        foreach (var partition in queryResults.Results.SelectMany(r => r.Partitions))
        {
                // handle partition.Text
        }

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

dluc · 2024-03-18T21:31:32Z

Tables in markdown need to be chunked in a single embedding, it doesn't make sense to split the content strictly based on token limit.

hi @Licantrop0, it's not that simple. What if a table is too big for the embedding model? I think a more advanced chunker would split by row, keeping the row header in each chunk. Even in this case a row might be too big, so there's need for more complex logic.

Licantrop0 · 2024-03-18T22:34:41Z

I described how to approach the problem if the entire table doesn't fit the embedding model, down to the single cell.
The current chunking model makes table data completely unusable, as it breaks the structure and doesn't maintain context (the table headers).

It's also difficult with the current Memory APIs to get the nearby partitions like it was done in this old example: https://github.com/Azure-Samples/semantic-kernel-rag-chat/blob/cc51e164ac1e559e80437918c671ab6257e7c873/src/chapter2/Chapter2Function.cs#L45

For reference, the current approach is this:

kernel-memory/examples/207-dotnet-expanding-chunks-on-retrieval/Program.cs

Line 97 in 5c7f4ca

foreach (Citation.Partition partition in result.Partitions)

dluc · 2024-03-19T08:34:37Z

Thanks for the details. The existing chunker is a sample with its limitations, and we welcome improvements. The behavior with markdown file is a bare minimum implementation, and there are improvements that could be made with regards to tables, lists, headers etc. If someone wants to work on adding the feature described above or other improvements we'd be open to help with PR reviews.

Licantrop0 added bug Something isn't working triage labels Mar 18, 2024

dluc added feature request and removed bug Something isn't working triage labels Mar 18, 2024

dluc changed the title ~~[Bug] TextChunker doesn't handle Markdown Tables correctly~~ TextChunker doesn't handle Markdown Tables correctly Mar 19, 2024

dluc changed the title ~~TextChunker doesn't handle Markdown Tables correctly~~ TextChunker doesn't handle Markdown Tables Mar 19, 2024

marcominerva mentioned this issue Mar 21, 2024

[Feature Request] Configure Decoders with Dependency Injection #379

Closed

microsoft locked and limited conversation to collaborators Jun 5, 2024

dluc converted this issue into discussion #637 Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

TextChunker doesn't handle Markdown Tables #371

TextChunker doesn't handle Markdown Tables #371

Licantrop0 commented Mar 18, 2024 •

edited

dluc commented Mar 18, 2024

Licantrop0 commented Mar 18, 2024

dluc commented Mar 19, 2024

This issue was moved to a discussion.

This issue was moved to a discussion.

TextChunker doesn't handle Markdown Tables #371

TextChunker doesn't handle Markdown Tables #371

Comments

Licantrop0 commented Mar 18, 2024 • edited

Context / Scenario

What happened?

Importance

Platform, Language, Versions

Relevant log output

dluc commented Mar 18, 2024

Licantrop0 commented Mar 18, 2024

dluc commented Mar 19, 2024

This issue was moved to a discussion.

Licantrop0 commented Mar 18, 2024 •

edited