Performance issues with large datasets #291

somegooser · 2023-06-29T08:40:57Z

Hi,

I have performance issues when indexing large datasets with 50000 records. Ik takes 30+ minutes.

The indexed content is not even long. It is approximately 50 characters per row.

This also happens with another datasets with only 500 rows with LONG tekst.

Any information how to boost performance?

stokic · 2023-06-29T08:43:16Z

50k is not a large dataset, it easily indexes millions of rows... can you tell us a bit more about the structure and how/where do you index the data...?

…

On Thu, Jun 29, 2023 at 10:41 AM somegooser ***@***.***> wrote: Hi, I have performance issues when indexing large datasets with 50000 records. Ik takes 30+ minutes. The indexed content is not even long. It is approximately 50 characters per row. This also happens with another datasets with only 500 rows with LONG tekst. Any information how to boost performance? — Reply to this email directly, view it on GitHub <#291>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQMGWR77K52LIWKECB57W3XNU5SNANCNFSM6AAAAAAZYGBQLY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

somegooser · 2023-06-29T08:53:57Z

Thanks for the reply.

I am using a simple dataset with 50000+ company names. I am only using a custom tokenizer.

stokic · 2023-06-29T08:54:50Z

ok can you show us the code for the tokenizer and table structure?

…

On Thu, Jun 29, 2023 at 10:54 AM somegooser ***@***.***> wrote: Thanks for the reply. I am using a simple dataset with 50000+ company names. I am only using a custom tokenizer. — Reply to this email directly, view it on GitHub <#291 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQMGWVZLOBLSFEJXV5OE7DXNU7C7ANCNFSM6AAAAAAZYGBQLY> . You are receiving this because you commented.Message ID: ***@***.***>

somegooser · 2023-07-04T08:02:44Z

Hi,

This is my tokenizer

`<?php

namespace Search;

use TeamTNT\TNTSearch\Support\AbstractTokenizer;
use TeamTNT\TNTSearch\Support\TokenizerInterface;

class Tokenizer extends AbstractTokenizer implements TokenizerInterface
{
static protected $pattern = '/[^\p{L}\-\p{N}]+/u';

public function tokenize($text, $stopwords = [])
{
    if ($text === null) {
        return [];
    }

    $text = mb_strtolower($text, 'UTF-8');
    $text = str_replace(['-', '_', '~'], [' ', ' ', '-'], $text);
    $text = strip_tags($text);
    $split = preg_split($this->getPattern(), $text, -1, PREG_SPLIT_NO_EMPTY);

    return array_diff($split, $stopwords);
}

}
`

My query is super simpel

$indexer->query('SELECT id,nameFROMcompanies');

ultrono · 2023-07-07T12:12:21Z

I'm using this package with a result set of 1.5 million records. Indexing from scratch takes ~5 minutes.

somegooser · 2023-07-07T13:00:18Z

Thats crazy...

There is something weird anyways.

Indexing 10.000 rows takes like 20 seconds on my server.
But when i index 100.000 rows with alike data (so not different form of data or length) it takes about 30 minutes.
Even updating the index is very slow instead of complete reindex.

Could it be something with the size of the index file?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issues with large datasets #291

Performance issues with large datasets #291

somegooser commented Jun 29, 2023

stokic commented Jun 29, 2023 via email

somegooser commented Jun 29, 2023

stokic commented Jun 29, 2023 via email

somegooser commented Jul 4, 2023

ultrono commented Jul 7, 2023

somegooser commented Jul 7, 2023

Performance issues with large datasets #291

Performance issues with large datasets #291

Comments

somegooser commented Jun 29, 2023

stokic commented Jun 29, 2023 via email

somegooser commented Jun 29, 2023

stokic commented Jun 29, 2023 via email

somegooser commented Jul 4, 2023

ultrono commented Jul 7, 2023

somegooser commented Jul 7, 2023