Skip to content

Demonstrating PDF text and image extraction with correct bounds

Notifications You must be signed in to change notification settings

TomasHubelbauer/pdf-scrape

Repository files navigation

  1. Print demo.html to demo.pdf or use your own document
  2. Go to https://mozilla.github.io/pdf.js/getting_started
  3. Download Stable
  4. Extract pdf.js and pdf.worker.js and their corresponding *.map here
  5. Make index.html and reference PDF.js:

index.html

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8" />
    <title>PDF Scrape</title>
    <script src="pdf.js"></script>
  </head>
  <body>

  </body>
</html>
  1. Create index.js and reference it from index.html:

index.js

index.html

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8" />
    <title>PDF Scrape</title>
    <script src="pdf.js"></script>
    <script src="index.js"></script>
  </head>
  <body>

  </body>
</html>
  1. Update index.js with code to load the document and render its page:

index.js

void async function () {
  const document = await pdfjsLib.getDocument('demo.pdf').promise;
  const page = await document.getPage(1);
}()
  1. Add a canvas element to index.html where the page will be rendered:

index.html

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="utf-8" />
    <title>PDF Scrape</title>
    <script src="pdf.js"></script>
    <script src="index.js"></script>
  </head>
  <body>
    <canvas id="pageCanvas"></canvas>
  </body>
</html>
  1. Extend the code to render the page to the canvas context:

index.js

window.addEventListener('load', async () => {
  const document = await pdfjsLib.getDocument('demo.pdf').promise;
  const page = await document.getPage(1);
  const viewport = page.getViewport({ scale: 1 });
  const canvas = window.document.getElementById('pageCanvas');
  canvas.width = viewport.width;
  canvas.height = viewport.height;
  const context = canvas.getContext('2d');
  page.render({ canvasContext: context, viewport });
});
  1. Hook up code to extract text and highlight texts and images (see this repo)

About

Demonstrating PDF text and image extraction with correct bounds

Topics

Resources

Stars

Watchers

Forks

Languages