PDF Extract Data
Extract structured fields from invoice / receipt PDFs — vendor, invoice number, date, total, subtotal, tax, and line items. Pure heuristic: regex over the extracted text, no LLM, runs in your browser. Output is JSON.
About PDF Extract Data
Drop an invoice or receipt PDF and get back structured JSON — vendor, invoice number, date, total, subtotal, tax, and a list of line items with their amounts. Pure heuristic: pdf.js extracts the text in your browser, then a labelled-money + date + line-item pass identifies the fields. No model download, no upload, no cost per call. The PDF never leaves your device.
- Category
- Input
- Accepts: application/pdf.
- Output
- Outputs: application/json.
- Cost
- Free, runs in your browser
- Memory
- medium
Common uses
- Pull totals and dates out of a folder of receipts for expense reporting.
- Bootstrap a small bookkeeping pipeline without paying per-invoice to a hosted API.
- Verify what a vendor charged you against what your accounting software shows.
- Quickly diff two invoices from the same vendor by extracting both and comparing JSON.
- Feed structured invoice fields into a Wyreup chain (e.g., extract → csv-template → render report).
Frequently asked questions
Will it work on any invoice?
It works best on text-based PDFs (the kind your accounting software exports). Scanned image-only PDFs need OCR first — chain through `ocr-pro` or `pdf-vision-ocr`. The heuristics are tuned for English-language invoices using `$`, `£`, `€`, or similar currency symbols.
How accurate is the total detection?
Two passes: first looks for labelled amounts ("Total:", "Amount Due", "Grand Total"), then falls back to the largest currency value in the document. On well-structured invoices accuracy is high; on unusual layouts (e.g., totals embedded in narrative text) results may be off. The `confidence` field and `warnings` list flag low-confidence extractions.
Does it support non-USD currencies?
Yes — set the currency symbol parameter to `£`, `€`, `¥`, or any short prefix. The detection logic works the same way for any single-symbol currency.
Is anything sent to a server?
No. PDF parsing (pdf.js), heuristics, and the entire extraction run in your browser. You can disconnect your network mid-extraction and it still finishes.
Can it extract line items?
Yes — any row that ends in a currency value and isn't the total/subtotal/tax becomes a line item. The description is everything before the amount; the amount is parsed into a structured `{ value, raw }` object.
Keywords
- invoice
- receipt
- extract
- data
- structured
- fields
- total
- parse