From documents to answers

Have you ever asked a chatbot a question and gotten a confident, well-written answer that was simply wrong? You’ve met the core limitation of a language model on its own: it can only draw on what it absorbed during training. It doesn’t know your contracts, your wiki, last quarter’s report, or anything published after its cutoff. Retrieval-augmented generation (RAG) is the standard fix, and it’s what most of the LightOn API is built to power.

The two halves of RAG

RAG combines two very different kinds of models. Understanding the split is the key to using the API well.

Retrieval models

Given a query, they find the most relevant pieces of text in a corpus. They don’t write anything, they rank existing passages by how well they match the meaning of the query. Fast, factual, and grounded in your actual documents.

Generative models

Given a prompt, they write new text, fluent, coherent, in natural language. On their own they have no access to your data and can hallucinate. Their strength is synthesis and phrasing, not recall.

A retrieval model can tell you which paragraph answers a question but not phrase the answer. A generative model can write a beautiful answer but doesn’t know your documents. RAG bolts them together: retrieve the relevant passages first, then hand those passages to the generative model as context so its answer is grounded in real source material instead of its training-time memory.

RAG flow: a question goes through retrieval over your indexed documents to fetch relevant passages, which generation turns into a grounded answer. Ask does both halves in one call.

The payoff: answers that cite real sources, stay current as you add documents, and can be traced back to where they came from.

Retrieval quality is the ceiling on answer quality. A generative model can only be as accurate as the passages it’s given. If retrieval surfaces the wrong context, even the best model will write a confident, wrong answer. That’s why LightOn invests heavily in the retrieval pipeline: hybrid semantic + lexical search with a reranker on top.

How retrieval works under the hood

When you ingest a document, LightOn turns it into a retrieval-ready index for you. Behind a single API call sits a full document-understanding pipeline: it reads the layout and structure of the document, breaks the content into meaningful units, builds rich semantic representations of them, and indexes everything for fast retrieval, with a lot of careful work in between to keep results accurate across messy, real-world files. You never have to run a vector database, an OCR model, or an embedding pipeline yourself. At query time, retrieval runs a hybrid lookup (vector search for meaning and lexical search for exact terms), then a reranker scores every candidate against the full query and returns the best passages.

Mapping use cases to endpoints

LightOn exposes each piece of this stack as an endpoint, so you can use as much or as little of the RAG pattern as you need.

Build a searchable knowledge base

Ingest documents once into a persistent, indexed corpus, then retrieve over it as often as you like.

Files

Ingestion. Upload documents and LightOn turns them into a searchable index automatically, running the whole document-understanding pipeline for you. This is the “your documents” box in the diagram above.

Search

Retrieval only. Send a natural-language query, get back ranked passages with scores and sources. Use this when you want the raw material and intend to do your own ranking, display, or generation on top.

Ask

Retrieval + generation. Full RAG in a single call: it retrieves and generates a grounded answer with citations. The fastest path from question to cited answer.

The choice between Search and Ask is the choice between owning the generation step or not:

Reach for Ask for straightforward, single-turn question answering. One retrieval, one generation, with a fixed prompt: minimal code on your side.
Reach for Search when you want to drive generation yourself: multi-step retrieval, query rewriting, conversational memory, custom prompts, or your own choice of model. Call Search inside your own agentic loop, then feed the passages to whatever generative model you prefer.

Process documents on the fly

Sometimes you don’t want a persistent corpus at all, you just want to turn one document into something machine-readable for your own pipeline. Nothing is stored.

Parse

Document to clean Markdown. The parsing step from the ingestion pipeline, exposed on its own. Useful when you want to feed text to your own LLM, store it, or display it.

Extract

Document to typed fields. Give it a JSON Schema and it pulls those fields out of every page. Built for mechanical, repetitive processing: stacks of invoices, batches of forms.

Demystifying agentic RAG, it’s just a loop!

Agentic RAG loop: a question goes to an agent (LLM) that repeatedly sends a query to Search over your indexed documents and reads the returned passages, looping until it has enough to write the grounded answer.

“Agentic RAG” sounds like it needs a framework. It doesn’t. At its core it’s a while loop where a model decides what to look up, reads what comes back, and chooses whether to search again or write the answer. Search is the only retrieval tool you need inside that loop. Plain Ask runs one retrieval and one generation. That covers a direct question like “What is the JWT token expiry policy”. It falls short when a question needs several lookups, or when the right query only becomes clear after you’ve seen the first results. Take “How does our token expiry compare to the OAuth refresh window, and which one expires first?”. Answering it well means retrieving the token policy, separately retrieving the refresh window, then reasoning over both. The loop has three moving parts:

The model proposes a search query.
You call Search and hand the passages back to the model.
The model either proposes another query or decides it has enough to answer.

Give the model Search as a tool and let it drive:

import requests

HEADERS = {"Authorization": "Bearer $LIGHTON_API_KEY"}

def search(query, max_results=5):
    response = requests.post(
        "https://api.lighton.ai/api/v3/search",
        headers=HEADERS,
        json={"query": query, "max_results": max_results},
    )
    return response.json()["results"]

# `llm` is any chat model that supports tool calling.
SEARCH_TOOL = {
    "name": "search_documents",
    "description": "Search the document corpus and return relevant passages.",
    "input_schema": {
        "type": "object",
        "properties": {"query": {"type": "string"}},
        "required": ["query"],
    },
}

messages = [{
    "role": "user",
    "content": "How does our token expiry compare to the OAuth refresh window, and which one expires first?",
}]

while True:
    reply = llm.chat(messages=messages, tools=[SEARCH_TOOL])

    if reply.tool_call:
        results = search(reply.tool_call.input["query"])
        passages = "\n\n".join(r["content"] for r in results)
        messages.append(reply.message)
        messages.append({"role": "tool", "content": passages})
        continue

    print(reply.text)  # no more searches needed: this is the grounded answer
    break

That is the whole pattern. On each turn the model reads the passages it has gathered so far and decides the next query, so multi-step retrieval falls out of the loop on its own. When the model stops calling the tool, its final message is the answer, generated over everything it retrieved. Because you own the loop, you also own the parts Ask keeps fixed: the system prompt, which model runs, how many rounds you allow before forcing an answer, and any reranking or filtering you apply to the passages before they go back to the model.

Reach for this when a single query can’t capture the question, or when one answer depends on the result of an earlier lookup. For straightforward, single-turn question answering, Ask does the same retrieve-then-generate step in one call.

Of course, advanced deep-search use cases may call for more than this: specialized sub-agents, planning steps, parallel retrieval. At that point you genuinely do want a framework. That’s out of scope for this tutorial, but the loop above is the foundation it all builds on.

Ship your first RAG

Uploading & managing files

Build the corpus everything else retrieves over.

Searching documents

See retrieval in action and tune it to your needs.

Asking questions

Get a grounded, cited answer in one call.

Overview

Build a searchable knowledge base

Classify and organise documents

Process documents on the fly

The two halves of RAG

Retrieval models

Generative models

How retrieval works under the hood

Mapping use cases to endpoints

Build a searchable knowledge base

Files

Search

Ask

Process documents on the fly

Parse

Extract

Demystifying agentic RAG, it’s just a loop!

Ship your first RAG

Uploading & managing files

Searching documents

Asking questions

​The two halves of RAG

Retrieval models

Generative models

​How retrieval works under the hood

​Mapping use cases to endpoints

​Build a searchable knowledge base

Files

Search

Ask

​Process documents on the fly

Parse

Extract

​Demystifying agentic RAG, it’s just a loop!

​Ship your first RAG

Uploading & managing files

Searching documents

Asking questions

The two halves of RAG

How retrieval works under the hood

Mapping use cases to endpoints

Build a searchable knowledge base

Process documents on the fly

Demystifying agentic RAG, it’s just a loop!

Ship your first RAG