~ 4 min read

How to Read and Parse PDFs with PDF.js and Create PDFs with PDF Lib in Node.js

share this story on
If you're building LLM and AI-powered chatbots like me you might need to read and parse PDFs or create PDFs in Node.js. Here's how to do it with PDF.js and PDF Lib.

You probably caught up on the title that we are going to mention two different npm packages to handle manipulation of PDF files in Node.js. That’s because the more popular option and the one that is more widely used and maintained - pdf-lib is unfortunately unable to read data off of PDF files and only allows creating new PDFs or modify existing ones by adding pages, images, creating forms and such.

So we’re going to split this write-up into those two parts and the respective npm packages that will be used for each task:

  • pdf-lib - for creating and modifying PDFs. This npm package receives regularly more than 500,000 downloads a week and is maintained by a single developer who updated it lastly around 3 years ago in July 2021.
  • pdfjs-dist - this one is the popular PDF.js library that is maintained by the Mozilla team and is used by many other projects. It is more widely downloaded, at about 2 million weekly downloads and is updated more frequently, including provenance which is a great security feature to be utilized by library maintainers.

How to Create PDFs with PDF Lib in Node.js

PDF creation with PDF Lib is very granular. You can create a new PDF document of a specific width and height, add pages, draw text, images and create or fill-in form elements.

A simple PDF creation program in Node.js is as follows:

import { PDFDocument, StandardFonts, rgb } from 'pdf-lib'
import fs from 'node:fs/promises'

// Create a new PDFDocument
const pdfDoc = await PDFDocument.create()

// Embed the Times Roman font
const timesRomanFont = await pdfDoc.embedFont(StandardFonts.TimesRoman)

// Add a blank page to the document
const page = pdfDoc.addPage()

// Get the width and height of the page
const { width, height } = page.getSize()

// Draw a string of text toward the top of the page
const fontSize = 30
page.drawText('Creating PDFs in JavaScript is awesome!', {
  x: 50,
  y: height - 4 * fontSize,
  size: fontSize,
  font: timesRomanFont,
  color: rgb(0, 0.53, 0.71),
})

// Serialize the PDFDocument to bytes (a Uint8Array)
const pdfBytes = await pdfDoc.save()

// write to file
await fs.writeFile('./uploads/mal.pdf', pdfBytes)

Very simple, intuitive API. You can refer to the GitHub repository of the library to see more examples and use cases.

How to Read and Parse PDFs with PDF.js in Node.js

Reading and parsing text off of PDF files is a bit more involved than it seems at first.

The reason is that you can’t just read the entire contents of a PDF as one giant text strings because there’s granularity involved - pages. And so indeed, the PDF.js library is built around the concept of pages and you have to iterate over each page in the PDF file to extract the text from it.

Here’s how you extract all the text from all pages in a PDF file:

import fs from "node:fs/promises";
import path from "node:path";
import * as pdfjsLib from "pdfjs-dist/legacy/build/pdf.mjs";

const filePath = "uploads/document.pdf";

async function extractTextFromPDF(pdfPath) {

  // Read the PDF file into a buffer and then
  // parse it with PDF.js
  const pdfData = await fs.readFile(pdfPath);
  const pdfDataArray = new Uint8Array(pdfData);
  const pdfDocument = await pdfjsLib.getDocument({
    data: pdfDataArray,
    standardFontDataUrl: path.join(
      import.meta.dirname,
      "node_modules/pdfjs-dist/standard_fonts/"
    ),
  }).promise;

  let extractedText = "";

  // Iterate through all the pages in the PDF file
  // and extract the text from each page, then assign it
  // to an accumulator variable  
  for (let pageNum = 1; pageNum <= pdfDocument.numPages; pageNum++) {
    const page = await pdfDocument.getPage(pageNum);
    const textContent = await page.getTextContent();
    const pageText = textContent.items.map((item) => item.str).join(" ");
    extractedText += pageText + "\n";
  }

  return extractedText;
}

async function main() {
  const text = await extractTextFromPDF(filePath);
  console.log(text);
}

main().catch(console.error);

How to fix “warning standardFontDataUrl” error with PDF.js in Node.js

Lastly, you might encounter a console warning when using Mozilla’s PDF.js library in Node.js like I did, which relates to fonts.

You’ll notice that when we passed the PDF document data to the pdfjsLib.getDocument function, I also specified a standardFontDataUrl option. This is because the PDF.js library needs to be specified some directory to locate fonts related to the PDF document.

If you omit this option you’ll get a console warning when you run the Node.js PDF parsing program such as:

Warning: fetchStandardFontData: failed to fetch file

To solve that, I specified the location of the fonts that are shipped in the pdfjs-dist package:

standardFontDataUrl: path.join(
  import.meta.dirname,
  "node_modules/pdfjs-dist/standard_fonts/"
),

That’s it. Happy hacking!

p.s. to riff off of my hacking blessings to you - here’s a crazy idea: change the color of the text added in the PDF from black or whatever you chose to white on white background, making it invisible, throw in some classic prompt injection instructions and pass that over to an LLM powered service and see what happens :-)