Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues with large datasets #291

Open
somegooser opened this issue Jun 29, 2023 · 6 comments
Open

Performance issues with large datasets #291

somegooser opened this issue Jun 29, 2023 · 6 comments

Comments

@somegooser
Copy link

Hi,

I have performance issues when indexing large datasets with 50000 records. Ik takes 30+ minutes.

The indexed content is not even long. It is approximately 50 characters per row.

This also happens with another datasets with only 500 rows with LONG tekst.

Any information how to boost performance?

@stokic
Copy link
Contributor

stokic commented Jun 29, 2023 via email

@somegooser
Copy link
Author

Thanks for the reply.

I am using a simple dataset with 50000+ company names. I am only using a custom tokenizer.

@stokic
Copy link
Contributor

stokic commented Jun 29, 2023 via email

@somegooser
Copy link
Author

Hi,

This is my tokenizer

`<?php

namespace Search;

use TeamTNT\TNTSearch\Support\AbstractTokenizer;
use TeamTNT\TNTSearch\Support\TokenizerInterface;

class Tokenizer extends AbstractTokenizer implements TokenizerInterface
{
static protected $pattern = '/[^\p{L}\-\p{N}]+/u';

public function tokenize($text, $stopwords = [])
{
    if ($text === null) {
        return [];
    }

    $text = mb_strtolower($text, 'UTF-8');
    $text = str_replace(['-', '_', '~'], [' ', ' ', '-'], $text);
    $text = strip_tags($text);
    $split = preg_split($this->getPattern(), $text, -1, PREG_SPLIT_NO_EMPTY);

    return array_diff($split, $stopwords);
}

}
`

My query is super simpel

$indexer->query('SELECT id,nameFROMcompanies');

@ultrono
Copy link

ultrono commented Jul 7, 2023

I'm using this package with a result set of 1.5 million records. Indexing from scratch takes ~5 minutes.

@somegooser
Copy link
Author

Thats crazy...

There is something weird anyways.

Indexing 10.000 rows takes like 20 seconds on my server.
But when i index 100.000 rows with alike data (so not different form of data or length) it takes about 30 minutes.
Even updating the index is very slow instead of complete reindex.

Could it be something with the size of the index file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants