Extract structured data from a document

curl --request POST \
  --url https://api.lighton.ai/api/v3/extract \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: multipart/form-data' \
  --form 'file=(binary)' \
  --form 'schema={
  "type": "object",
  "properties": {
    "invoice_number": {
      "type": "string"
    }
  }
}'

{
  "id": "ext_0196e4b2a3c14d5e8f7a9b2c1d0e3f4a",
  "status": "completed",
  "created_at": "2026-03-31T10:00:00+00:00",
  "completed_at": "2026-03-31T10:00:04+00:00",
  "processing_time_ms": 3200,
  "document": {
    "filename": "invoice.pdf",
    "page_count": 3,
    "file_size_bytes": 245120,
    "mime_type": "application/pdf"
  },
  "result": {
    "data": [
      {
        "invoice_number": "INV-2026-001",
        "total": null,
        "line_items": null
      },
      {
        "invoice_number": null,
        "total": 1250,
        "line_items": [
          {
            "description": "Widget A",
            "quantity": 10,
            "unit_price": 50
          },
          {
            "description": "Widget B",
            "quantity": 5,
            "unit_price": 150
          }
        ]
      },
      {
        "invoice_number": null,
        "total": null,
        "line_items": null
      }
    ],
    "pagination": {
      "page": 1,
      "page_size": 15,
      "total_items": 3,
      "total_pages": 1,
      "has_next": false,
      "has_prev": false
    }
  },
  "usage": {
    "pages_processed": 3
  },
  "progress": {
    "percentage": 100,
    "pages_processed": 3
  }
}

Extract

Extract structured data from a document

Pull specific fields from a document into a typed schema.

Accepts either a file upload (multipart/form-data) or a document URL (JSON body), plus a JSON Schema (the schema field) describing what to extract.

Sync mode (default)

Blocks until extraction completes and returns 200 with the full result.

curl -X POST https://api.lighton.ai/api/v3/extract \
  -H 'Authorization: Bearer $TOKEN' \
  -F file=@invoice.pdf \
  -F 'schema={"type":"object","properties":{"invoice_number":{"type":"string"}}}'

Async mode (`options.async = true`)

Returns 202 immediately with an ext_<token> job id. Poll GET /api/v3/extract/{id} with that same id until status is completed or failed.

curl -X POST https://api.lighton.ai/api/v3/extract \
  -H 'Authorization: Bearer $TOKEN' \
  -H 'Content-Type: application/json' \
  -d '{"document": "https://example.com/report.pdf", "schema": {"type": "object", "properties": {"title": {"type": "string"}}}, "options": {"async": true}}'

For multipart uploads, pass options as a JSON-encoded form field: -F 'options={"async":true}'.

Supported file types: .pdf, .png, .jpg, .jpeg, .pptx, .ppt, .odp, .docx, .odt, .doc, .html

Sync limits: 20 MB file size, 15 pages.

Async limits: 100 MB file size, 1000 pages.

POST

api

extract

Extract structured data from a document

curl --request POST \
  --url https://api.lighton.ai/api/v3/extract \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: multipart/form-data' \
  --form 'file=(binary)' \
  --form 'schema={
  "type": "object",
  "properties": {
    "invoice_number": {
      "type": "string"
    }
  }
}'

{
  "id": "ext_0196e4b2a3c14d5e8f7a9b2c1d0e3f4a",
  "status": "completed",
  "created_at": "2026-03-31T10:00:00+00:00",
  "completed_at": "2026-03-31T10:00:04+00:00",
  "processing_time_ms": 3200,
  "document": {
    "filename": "invoice.pdf",
    "page_count": 3,
    "file_size_bytes": 245120,
    "mime_type": "application/pdf"
  },
  "result": {
    "data": [
      {
        "invoice_number": "INV-2026-001",
        "total": null,
        "line_items": null
      },
      {
        "invoice_number": null,
        "total": 1250,
        "line_items": [
          {
            "description": "Widget A",
            "quantity": 10,
            "unit_price": 50
          },
          {
            "description": "Widget B",
            "quantity": 5,
            "unit_price": 150
          }
        ]
      },
      {
        "invoice_number": null,
        "total": null,
        "line_items": null
      }
    ],
    "pagination": {
      "page": 1,
      "page_size": 15,
      "total_items": 3,
      "total_pages": 1,
      "has_next": false,
      "has_prev": false
    }
  },
  "usage": {
    "pages_processed": 3
  },
  "progress": {
    "percentage": 100,
    "pages_processed": 3
  }
}

Authorizations

Authorization

string

header

required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

Body for POST /api/v3/extract.

schema is the JSON Schema that drives extraction. It arrives as a dict on JSON requests and as a JSON-encoded string on multipart requests — both are coerced to dict.

options is a free-form dict; currently supports {"async": bool}.

schema

Schema · object

required

document

string | null

options

Options · object

Response

Extraction completed (sync mode).

string

required

status

string

required

created_at

string<date-time> | null

completed_at

string<date-time> | null

processing_time_ms

integer | null

document

ExtractDocument · object | null

Show child attributes

result

ExtractResult · object | null

Show child attributes

usage

ExtractUsage · object | null

Show child attributes

progress

JobProgress · object | null

Live progress of a long-running async job while it is in flight.

Shared by the parse (GET /api/v3/parse/<id>) and extract (GET /api/v3/extract/<id>) polling envelopes: pages_processed is the count of pages done so far and percentage is the completion percentage [0, 100] derived from it.

Show child attributes

Render a document as PDF Get the status and result of an async extract job

⌘I