Skip to main content
Inceptron is a platform for hosting and serving optimized open models (Llama, Kimi, MiniMax, GLM, and others) with best-in-class price-performance, exposed through an OpenAI-compatible endpoint. The infrastructure is enterprise-ready (ISO 27001 and GDPR compliant). Pair these models with LightOn search to build RAG pipelines where the retrieval stays on LightOn’s infrastructure and the generation runs on Inceptron. The flow is:
  1. Search LightOn for the passages most relevant to the user’s question.
  2. Pack those passages into the model’s context window.
  3. Call the Inceptron model to generate an answer grounded in the retrieved content.

Prerequisites

  • A LIGHTON_API_KEY, available in the Console → API Keys section.
  • An Inceptron API key, available from the Inceptron console. Store it as INCEPTRON_API_KEY.
  • At least one workspace with indexed documents on LightOn.

Installation

pip install requests openai
The openai package is used here only for its client; Inceptron’s endpoint is fully compatible with it.

Full example

import os
import requests
from openai import OpenAI

LIGHTON_API_KEY = os.environ["LIGHTON_API_KEY"]
INCEPTRON_API_KEY = os.environ["INCEPTRON_API_KEY"]

inceptron = OpenAI(
    base_url="https://api.inceptron.io/v1",
    api_key=INCEPTRON_API_KEY,
)


def search(query: str, workspace_id: list[int] | None = None, max_results: int = 5) -> list[dict]:
    payload = {"query": query, "max_results": max_results}
    if workspace_id:
        payload["workspace_id"] = workspace_id

    response = requests.post(
        "https://api.lighton.ai/api/v3/search",
        headers={"Authorization": f"Bearer {LIGHTON_API_KEY}"},
        json=payload,
    )
    response.raise_for_status()
    return response.json()["results"]


def answer(question: str, workspace_id: list[int] | None = None, model: str = "nvidia/llama-3.3-70b-instruct-fp8") -> str:
    results = search(question, workspace_id=workspace_id)

    context = "\n\n".join(
        f"[{r['source']['filename']}, p.{r['source']['page_start']}]\n{r['content']}"
        for r in results
        if r["content"]
    )

    completion = inceptron.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a helpful assistant. Answer the user's question using only "
                    "the provided context. If the context does not contain enough information, "
                    "say so.\n\nContext:\n" + context
                ),
            },
            {"role": "user", "content": question},
        ],
    )
    return completion.choices[0].message.content


print(answer("What is our data retention policy?"))

LightOn search as a tool

Instead of always searching before calling the model, you can expose LightOn search as a tool and let the model decide when to call it. The model issues a lighton_search tool call when it needs context; your code executes the search and feeds the results back; the model then produces a final answer.
import json
import os
import requests
from openai import OpenAI

LIGHTON_API_KEY = os.environ["LIGHTON_API_KEY"]
INCEPTRON_API_KEY = os.environ["INCEPTRON_API_KEY"]

inceptron = OpenAI(
    base_url="https://api.inceptron.io/v1",
    api_key=INCEPTRON_API_KEY,
)

SEARCH_TOOL = {
    "type": "function",
    "function": {
        "name": "lighton_search",
        "description": (
            "Search the company knowledge base for passages relevant to a query. "
            "Returns ranked excerpts with their source filename and page numbers."
        ),
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "Natural-language search query.",
                },
                "max_results": {
                    "type": "integer",
                    "description": "Number of passages to return (1–50, default 5).",
                    "default": 5,
                },
            },
            "required": ["query"],
        },
    },
}


def run_search(query: str, max_results: int = 5) -> str:
    response = requests.post(
        "https://api.lighton.ai/api/v3/search",
        headers={"Authorization": f"Bearer {LIGHTON_API_KEY}"},
        json={"query": query, "max_results": max_results},
    )
    response.raise_for_status()
    results = response.json()["results"]
    passages = [
        f"[{r['source']['filename']}, p.{r['source']['page_start']}]\n{r['content']}"
        for r in results
        if r["content"]
    ]
    return "\n\n".join(passages) if passages else "No results found."


def answer(question: str, model: str = "nvidia/llama-3.3-70b-instruct-fp8") -> str:
    messages = [{"role": "user", "content": question}]

    while True:
        completion = inceptron.chat.completions.create(
            model=model,
            tools=[SEARCH_TOOL],
            messages=messages,
        )
        choice = completion.choices[0]

        if choice.finish_reason == "tool_calls":
            messages.append(choice.message)
            for call in choice.message.tool_calls:
                args = json.loads(call.function.arguments)
                result = run_search(**args)
                messages.append({
                    "role": "tool",
                    "tool_call_id": call.id,
                    "content": result,
                })
        else:
            return choice.message.content


print(answer("What is our data retention policy?"))
The loop handles the case where the model issues multiple search calls in sequence before producing a final answer.

Scoping retrieval to a workspace

Pass workspace_id to limit search to a specific workspace. This is useful in multi-tenant products where each customer’s data lives in a dedicated workspace.
answer("Summarize the onboarding checklist", workspace_id=[42])

Choosing a model

Inceptron’s catalog includes several hosted models. Pass the model name to the model parameter:
ModelNotes
nvidia/llama-3.3-70b-instruct-fp8Strong general-purpose model, good default choice
zai-org/GLM-5.1-FP8Capable multilingual model
MiniMaxAI/MiniMax-M2.5Long-context model
moonshotai/Kimi-K2.6Reasoning model with strong agentic abilities
moonshotai/Kimi-K2.6-FastFaster variant of Kimi K2.6
moonshotai/Kimi-K2.7-CodeTuned for code generation
You can list the models available to your key at any time:
print([m.id for m in inceptron.models.list().data])
See the Inceptron models catalog for the current list, context limits, and pay-as-you-go pricing.

A note on reasoning models

The Kimi models (moonshotai/Kimi-K2.6, moonshotai/Kimi-K2.6-Fast, moonshotai/Kimi-K2.7-Code) are reasoning models: they spend tokens thinking before producing an answer, and they return that thinking under a separate reasoning field. If you set max_tokens too low, the model can exhaust the budget while still reasoning, so the request finishes with finish_reason="length" and message.content is None. Give reasoning models a generous token budget (a couple thousand tokens or more) to leave room for the final answer. The non-reasoning models in the table above are not affected.

Streaming responses

Inceptron’s endpoint supports streaming. Enable it by passing stream=True and iterating over the response:
stream = inceptron.chat.completions.create(
    model="nvidia/llama-3.3-70b-instruct-fp8",
    messages=[...],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)