Introduction

PDF parsing toolkit for preparing text corpus

Introduction

This repo contains a PDF parsing toolkit for preparing text corpus to transfer PDF to Markdown. Based on PDF Parser ToolKits, gathering most-use PDF OCR tools for academic papers, and inspired by grobid_tei_xml, an open-sourced PyPI package, we develop sciparser 1.0 for text corpus pre-processing, in recent works like K2 and GeoGalactica, we use this tool and upgrade grobid backend solution to pre-process the text corpus. Moreover, the online demo is publicly available.

Try DEMO

In this repo and demo, we only share the secondary processing solution on Grobid. In the near future, we will share the multiple-backend combination solution on PDF parsing.

Requirements

git clone https://github.com/Acemap/pdf_parser.git
cd pdf_parser
pip install -r requirements.txt
python setup install

git clone https://github.com/davendw49/sciparser.git
cd sciparser
pip install -r requirements.txt

Usage

python

First we should clone the hold repo.

git clone https://github.com/davendw49/sciparser.git

Then import the pipeline file to do the parsing.

from pipeline import pipeline
data = pipeline('/path/to/your/pdf/')

gradio

python main.py

Citation

@misc{sciparser,
  author = {Cheng Deng},
  title = {Sciparser: PDF parsing toolkit for preparing text corpus},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/davendw49/sciparser}},
}

Reference

PDF Parser ToolKits: https://github.com/Acemap/pdf_parser
TEI-XML Parser (grobid_tei_xml): https://gitlab.com/internetarchive/grobid_tei_xml

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
grobid_parser		grobid_parser
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pipeline.py		pipeline.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grobid_parser

grobid_parser

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

main.py

main.py

pipeline.py

pipeline.py

requirements.txt

requirements.txt

Repository files navigation

PDF parsing toolkit for preparing text corpus

Introduction

Requirements

Usage

Citation

Reference

About

Releases

Packages

Languages

License

davendw49/sciparser

Folders and files

Latest commit

History

Repository files navigation

PDF parsing toolkit for preparing text corpus

Introduction

Requirements

Usage

Citation

Reference

About

Topics

Resources

License

Stars

Watchers

Forks

Languages