pdf

PDF Extract Data

Extract structured fields from invoice / receipt PDFs — vendor, invoice number, date, total, subtotal, tax, and line items. Pure heuristic: regex over the extracted text, no LLM, runs in your browser. Output is JSON.

Loading…

About PDF Extract Data

Drop an invoice or receipt PDF and get back structured JSON — vendor, invoice number, date, total, subtotal, tax, and a list of line items with their amounts. Pure heuristic: pdf.js extracts the text in your browser, then a labelled-money + date + line-item pass identifies the fields. No model download, no upload, no cost per call. The PDF never leaves your device.

Category
pdf
Input
Accepts: application/pdf.
Output
Outputs: application/json.
Cost
Free, runs in your browser
Memory
medium
Privacy: PDF Extract Data runs entirely on your device. Files you provide never leave your browser — no uploads, no server, no tracking. The page works offline once loaded.

Common uses

  • Pull totals and dates out of a folder of receipts for expense reporting.
  • Bootstrap a small bookkeeping pipeline without paying per-invoice to a hosted API.
  • Verify what a vendor charged you against what your accounting software shows.
  • Quickly diff two invoices from the same vendor by extracting both and comparing JSON.
  • Feed structured invoice fields into a Wyreup chain (e.g., extract → csv-template → render report).

Frequently asked questions

Will it work on any invoice?

It works best on text-based PDFs (the kind your accounting software exports). Scanned image-only PDFs need OCR first — chain through `ocr-pro` or `pdf-vision-ocr`. The heuristics are tuned for English-language invoices using `$`, `£`, `€`, or similar currency symbols.

How accurate is the total detection?

Two passes: first looks for labelled amounts ("Total:", "Amount Due", "Grand Total"), then falls back to the largest currency value in the document. On well-structured invoices accuracy is high; on unusual layouts (e.g., totals embedded in narrative text) results may be off. The `confidence` field and `warnings` list flag low-confidence extractions.

Does it support non-USD currencies?

Yes — set the currency symbol parameter to `£`, `€`, `¥`, or any short prefix. The detection logic works the same way for any single-symbol currency.

Is anything sent to a server?

No. PDF parsing (pdf.js), heuristics, and the entire extraction run in your browser. You can disconnect your network mid-extraction and it still finishes.

Can it extract line items?

Yes — any row that ends in a currency value and isn't the total/subtotal/tax becomes a line item. The description is everything before the amount; the amount is parsed into a structured `{ value, raw }` object.

Keywords

  • pdf
  • invoice
  • receipt
  • extract
  • data
  • structured
  • fields
  • total
  • parse

Try next