Skip to content

Commit

Permalink
Merge pull request #1 from FelixMertin/poetry_linters
Browse files Browse the repository at this point in the history
[add] poetry as dependency management and linters for code quality
  • Loading branch information
Suleman-Elahi committed Oct 7, 2022
2 parents c1c2b42 + d8c2f7a commit 0e9ae06
Show file tree
Hide file tree
Showing 7 changed files with 742 additions and 39 deletions.
163 changes: 163 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
# Ignore all generated CSV files
*.csv

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/#use-with-ide
.pdm.toml

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
26 changes: 26 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.3.0
hooks:
- id: trailing-whitespace
- id: check-merge-conflict
- id: check-yaml
args: [--unsafe]
- id: check-json
- id: detect-private-key
- id: end-of-file-fixer

- repo: https://github.com/timothycrosley/isort
rev: 5.10.1
hooks:
- id: isort

- repo: https://github.com/psf/black
rev: 22.8.0
hooks:
- id: black

- repo: https://gitlab.com/pycqa/flake8
rev: 3.9.2
hooks:
- id: flake8
17 changes: 14 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,10 @@ At the end, it saves a CSV file in the current working directory. By default, al

## Running:
1. Install Python
2. Run `pip install requests bs4`
3. Run as `python WpBrokenCheck.py [Domain] [CSV_FileName]`
4. Example: `python WpBrokenCheck.py example.com example.csv`
2. Install [Poetry](https://python-poetry.org/docs/#installation)
3. Run `poetry install`
4. Run as `poetry run python WpBrokenCheck.py [Domain] [CSV_FileName]`
5. Example: `poetry run python WpBrokenCheck.py example.com example.csv`
<p align="center">
<img src="https://res.cloudinary.com/suleman/image/upload/v1665055858/WpBrokenCheck.png">
</p>
Expand All @@ -16,6 +17,16 @@ At the end, it saves a CSV file in the current working directory. By default, al

**Tip** : If target website has large number of posts then change `max_workers` from 5 to 10 at line 60.

## Linters:

There are the following Python linters:
- black for code formatting
- flake8 code formatting and line brakes (PEP8)
- isort for reordering imports

They are run via pre-commit as you commit the code to the repository. You can also run it manually on all files by:
`pre-commit run --all-files`

### To Do:
- Make it filter specific codes.
- Make it run on customized WP sites
Expand Down
94 changes: 58 additions & 36 deletions WpBrokenCheck.py
Original file line number Diff line number Diff line change
@@ -1,73 +1,95 @@
import requests
import csv
import concurrent.futures
from concurrent.futures import as_completed
import csv
import sys
from concurrent.futures import as_completed

import bs4
import requests

domain = sys.argv[1]
csv_file = sys.argv[2]
sess = requests.Session()
links404 = []

headers = {
'authority': 'www.'+domain,
'referer': 'https://'+domain,
'user-agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Mobile Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-dest': 'document',
'accept-language': 'en-US,en;q=0.9,tr;q=0.8',
"authority": "www." + domain,
"referer": "https://" + domain,
"user-agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Mobile Safari/537.36",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"sec-fetch-dest": "document",
"accept-language": "en-US,en;q=0.9,tr;q=0.8",
}
pages = int(sess.get('https://' + domain + '/wp-json/wp/v2/posts', headers=headers).headers['X-WP-TotalPages'])
pages = int(
sess.get("https://" + domain + "/wp-json/wp/v2/posts", headers=headers).headers[
"X-WP-TotalPages"
]
)


def prepare_csv_data(id, post_link, data):
for i in data:
links404.append({
for i in data:
links404.append(
{
"Post ID": id,
"Post Link": post_link,
"Broken Link": i[0][0],
"Status Code":i[1],
"Link Text":i[0][1]
})

"Status Code": i[1],
"Link Text": i[0][1],
}
)


def generate_csv_report(csv_file, csv_data):
with open(csv_file, 'w+',encoding="utf-8") as file:
csvwriter = csv.DictWriter(file, fieldnames=list(csv_data[0].keys()))
csvwriter.writeheader()
csvwriter.writerows(csv_data)


if csv_data:
with open(csv_file, "w+", encoding="utf-8") as file:
csvwriter = csv.DictWriter(file, fieldnames=list(csv_data[0].keys()))
csvwriter.writeheader()
csvwriter.writerows(csv_data)

print("Report saved in file: ", csv_file)

if not csv_data:
print("There were no broken links!")


def getLinks(rendered_content):
soup = bs4.BeautifulSoup(rendered_content, 'html.parser')
return [(link['href'],link.text) for link in soup('a') if 'href' in link.attrs]

soup = bs4.BeautifulSoup(rendered_content, "html.parser")
return [(link["href"], link.text) for link in soup("a") if "href" in link.attrs]


def getStatusCode(link, headers, timeout=5):
print(" checking: ", link[0])
try:
r = sess.head(link[0], headers=headers, timeout=timeout)
except (requests.exceptions.SSLError,
requests.exceptions.HTTPError,
requests.exceptions.ConnectionError,
requests.exceptions.MissingSchema,
requests.exceptions.Timeout,
requests.exceptions.InvalidSchema
) as errh:
except (
requests.exceptions.SSLError,
requests.exceptions.HTTPError,
requests.exceptions.ConnectionError,
requests.exceptions.MissingSchema,
requests.exceptions.Timeout,
requests.exceptions.InvalidSchema,
) as errh:
print("Error in URL, ", link)
return link, errh.__class__.__name__
else:
return link, str(r.status_code)



def executeBrokenLinkCheck(links):
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
futures = [executor.submit(getStatusCode, link, headers) for link in links]
return [future.result() for future in as_completed(futures)]


for i in range(pages):
post_data = sess.get('https://' + domain + '/wp-json/wp/v2/posts?page='+str(i+1), headers=headers).json()
post_data = sess.get(
"https://" + domain + "/wp-json/wp/v2/posts?page=" + str(i + 1), headers=headers
).json()
for data in post_data:
print("Checking post: ",data["link"])
print("Checking post: ", data["link"])
post_links = getLinks(data["content"]["rendered"])
checked_urls = executeBrokenLinkCheck(post_links)
prepare_csv_data(data["id"], data["link"], checked_urls)

generate_csv_report(csv_file, links404)
print("Report saved in file: ", csv_file)

0 comments on commit 0e9ae06

Please sign in to comment.