Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to run the scraper on local url #574

Open
rubai99 opened this issue Mar 8, 2023 · 0 comments
Open

unable to run the scraper on local url #574

rubai99 opened this issue Mar 8, 2023 · 0 comments

Comments

@rubai99
Copy link

rubai99 commented Mar 8, 2023

hey, I am trying to run the scraper on my Local . In my config when I given the url which is live then the scraper indexing my document , but when I change the url to local url it's don't scrape the documentation . I convert the local port url to ngrock url and still don't scrape not indexing but for prod live url it's working fine (somewhere giving same result more then one on search hits) .

In local it's indexing only normal HTML content and some script and css but not the content of my document which is build by JavaScript .

here is my config file
` {
"index_name": "payment-page",
"js_render": true,
"js_wait": 10,
"use_anchors": false,
"user_agent": "Custom Bot",
"start_urls": [
"https://a65e-103-159-11-202.in.ngrok.io/payment-page/android/overview/pre-requisites",
"https://a65e-103-159-11-202.in.ngrok.io/payment-page/android/base-sdk-integration/session",
"https://a65e-103-159-11-202.in.ngrok.io/payment-page/android/base-sdk-integration/order-status-api",

  "https://a65e-103-159-11-202.in.ngrok.io/payment-page/android/base-sdk-integration/getting-sdk",
  "https://a65e-103-159-11-202.in.ngrok.io/payment-page/android/base-sdk-integration/initiating-sdk",
  "https://a65e-103-159-11-202.in.ngrok.io/payment-page/android/base-sdk-integration/processing-sdk",
  "https://a65e-103-159-11-202.in.ngrok.io/payment-page/android/base-sdk-integration/handle-payment-response",
  "https://a65e-103-159-11-202.in.ngrok.io/payment-page/android/base-sdk-integration/life-cycle-events",

  "https://a65e-103-159-11-202.in.ngrok.io/payment-page/android/resources/error-codes",
  "https://a65e-103-159-11-202.in.ngrok.io/payment-page/android/resources/transaction-status",
  "https://a65e-103-159-11-202.in.ngrok.io/payment-page/android/resources/sample-payloads"

  


],
"sitemap_alternate_links": false,
"selectors": {
  "lvl0":"h1,.screen2 h2 , .heading-text" ,
  "lvl1": "h3, .label" ,
  "lvl2": ".key-header, .step-card-header-text, .th-row",
  "text":".screen2 p:not(:empty), .hero-welcome, .screen2 li, .main-screen, .only-steps p:not(:empty),td"
},
"strip_chars": " .,;:#",
"custom_settings": {
  "separatorsToIndex": "_",
  "attributesForFaceting": [
    "language",
    "version",
    "type",
    "docusaurus_tag"
  ],
  "attributesToRetrieve": [
    "hierarchy",
    "content",
    "anchor",
    "url",
    "url_without_anchor",
    "type"
  ],
  "synonyms": [
    [
      "js",
      "javascript"
    ],
    [
      "es6",
      "ECMAScript6",
      "ECMAScript2015"
    ]
  ]
}

}
`

used the commend to run the Scraper
docker run -it --env-file=/my/clone/scraper/located/path/.env -e "CONFIG=$(cat config.json | jq -r tostring)" d2ebdc22bee2a9f6513e68457d9a3825850f325449a225bc6cde1a1f7339e1e4

my changes broser_handler.py (have to made the changes for rendering JS content on my documentation , before facing same issue for lived url also)
`import re
import os
from selenium import webdriver

from selenium.webdriver.chrome.options import Options
from ..custom_downloader_middleware import CustomDownloaderMiddleware
from ..js_executor import JsExecutor

class BrowserHandler:
@staticmethod
def conf_need_browser(config_original_content, js_render):
group_regex = re.compile(r'(?P<(.+?)>.+?)')
results = re.findall(group_regex, config_original_content)

    return len(results) > 0 or js_render

@staticmethod
def init(config_original_content, js_render, user_agent):
    driver = None

    if BrowserHandler.conf_need_browser(config_original_content,
                                        js_render):
        chrome_options = Options()
        chrome_options.add_argument('--no-sandbox')
        chrome_options.add_argument('--headless')
        chrome_options.add_argument('user-agent={0}'.format(user_agent))
        chrome_options.add_argument('--disable-dev-shm-usage')

        # CHROMEDRIVER_PATH = os.environ.get('CHROMEDRIVER_PATH',
        #                                    "/usr/bin/chromedriver")
        # if not os.path.isfile(CHROMEDRIVER_PATH):
        #     raise Exception(
        #         "Env CHROMEDRIVER_PATH='{}' is not a path to a file".format(
        #             CHROMEDRIVER_PATH))
        driver = webdriver.Remote(command_executor='http://host.docker.internal:4444', options=chrome_options)
        CustomDownloaderMiddleware.driver = driver
        JsExecutor.driver = driver
    return driver

@staticmethod
def destroy(driver):
    # Start browser if needed
    if driver is not None:
        driver.quit()
        driver = None

    return driver

`

output after run the scraper
Screenshot 2023-03-09 at 12 48 03 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant