Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major optional performance boost suggestion when operating on url input #129

Open
guylando opened this issue Aug 28, 2022 · 1 comment
Open

Comments

@guylando
Copy link

This is relevant to the operation on url input.

socid-extractor sends request to the url and only then tries to parse according to its list of supported websites.

On the one hand this allows to handle generic platforms such as vBulletin which can appear under different domains and urls, on the other hand for supporting most if not all of the other websites which can a specific domain/url, there could have been a check if the website is supported before sending the request to avoid unnecessary request for unsupported website.

So by sacrificing support of vBulletin and adding a pre-request url support check, you get a major performance improvement.

To make it optional for those who do not want to sacrifice vBulletin, this can be dependent on a new flag.

For around 180 urls which contain 25 supported urls it can lower execution time from around 400 seconds to around 200 seconds.

However for the check of url support to work, the dictionary of supported websites needs to contain some word appearing in the url so also need to fix the dictionary names (or to add a domain property for those websites which have specific domain).

So need to add (for temporary solution without adding domain property for every supported website which has a specific domain):

  1. in cli.py:
    def check_url_relevance(url):
    lowercaseUrl = url.lower()
    for scheme_name, scheme_data in schemes.items():
    for name_part in scheme_name.lower().split():
    if len(name_part) > 1 and name_part not in ['api', 'user', 'profile', 'group', 'page', 'file', 'html'] and name_part in lowercaseUrl:
    return True
    return False

  2. in cli.py run method after "print(f'Analyzing URL {url}...')" put everything inside the following conditional check:
    if check_url_relevance(args.url):

  3. in schemes.py change dictionary keys:
    'Linktree' -> 'Linktree linktr.ee'
    'Odnoklassniki' -> 'Odnoklassniki ok.ru'
    'Habrahabr HTML (old)' -> 'Habrahabr HTML (old) habra'
    'Habrahabr JSON' -> 'Habrahabr JSON habra'
    'Telegram' -> 'Telegram t.me'

  4. optional parameter which will trigger this behavior and which can be added to the "if check_url_relevance(args.url):" condition

@soxoj
Copy link
Owner

soxoj commented Sep 11, 2022

@guylando thank you for the good idea, can you tell a little bit more about your usecase of socid-extractor? I was sure that in case of massive URLs list checks somebody will have http-responses anyway (only if that list does not contain random links).

Can you also make a draft PR with supposed changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants