Skip to content

davendw49/sciparser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sciparser-logo

PDF parsing toolkit for preparing text corpus

Introduction

This repo contains a PDF parsing toolkit for preparing text corpus to transfer PDF to Markdown. Based on PDF Parser ToolKits, gathering most-use PDF OCR tools for academic papers, and inspired by grobid_tei_xml, an open-sourced PyPI package, we develop sciparser 1.0 for text corpus pre-processing, in recent works like K2 and GeoGalactica, we use this tool and upgrade grobid backend solution to pre-process the text corpus. Moreover, the online demo is publicly available.

In this repo and demo, we only share the secondary processing solution on Grobid. In the near future, we will share the multiple-backend combination solution on PDF parsing.

Requirements

git clone https://github.com/Acemap/pdf_parser.git
cd pdf_parser
pip install -r requirements.txt
python setup install

git clone https://github.com/davendw49/sciparser.git
cd sciparser
pip install -r requirements.txt

Usage

  • python

First we should clone the hold repo.

git clone https://github.com/davendw49/sciparser.git

Then import the pipeline file to do the parsing.

from pipeline import pipeline
data = pipeline('/path/to/your/pdf/')
  • gradio
python main.py

Citation

@misc{sciparser,
  author = {Cheng Deng},
  title = {Sciparser: PDF parsing toolkit for preparing text corpus},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/davendw49/sciparser}},
}

Reference

About

PDF parsing toolkit for preparing academic text corpus

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages