Skip to main content
OVHcloud AI Endpoints expose hosted open models (Qwen, Mistral, Llama, gpt-oss, and others) through an OpenAI-compatible endpoint, served from OVHcloud’s European infrastructure. Pair them with LightOn search to build RAG pipelines where the retrieval stays on LightOn’s infrastructure and the generation runs on OVHcloud. The flow is:
  1. Search LightOn for the passages most relevant to the user’s question.
  2. Pack those passages into the model’s context window.
  3. Call the OVHcloud model to generate an answer grounded in the retrieved content.

Prerequisites

  • A LIGHTON_API_KEY, available in the Console → API Keys section.
  • An OVHcloud AI Endpoints access token, available in the OVHcloud Control Panel under Public Cloud → AI Endpoints. Store it as OVH_AI_ENDPOINTS_ACCESS_TOKEN.
  • At least one workspace with indexed documents on LightOn.

Installation

pip install requests openai
The openai package is used here only for its client; OVHcloud’s /chat/completions endpoint is fully compatible with it.

Full example

import os
import requests
from openai import OpenAI

LIGHTON_API_KEY = os.environ["LIGHTON_API_KEY"]
OVH_AI_ENDPOINTS_ACCESS_TOKEN = os.environ["OVH_AI_ENDPOINTS_ACCESS_TOKEN"]

ovh = OpenAI(
    base_url="https://oai.endpoints.kepler.ai.cloud.ovh.net/v1",
    api_key=OVH_AI_ENDPOINTS_ACCESS_TOKEN,
)


def search(query: str, workspace_id: list[int] | None = None, max_results: int = 5) -> list[dict]:
    payload = {"query": query, "max_results": max_results}
    if workspace_id:
        payload["workspace_id"] = workspace_id

    response = requests.post(
        "https://api.lighton.ai/api/v3/search",
        headers={"Authorization": f"Bearer {LIGHTON_API_KEY}"},
        json=payload,
    )
    response.raise_for_status()
    return response.json()["results"]


def answer(question: str, workspace_id: list[int] | None = None, model: str = "Meta-Llama-3_3-70B-Instruct") -> str:
    results = search(question, workspace_id=workspace_id)

    context = "\n\n".join(
        f"[{r['source']['filename']}, p.{r['source']['page_start']}]\n{r['content']}"
        for r in results
        if r["content"]
    )

    completion = ovh.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a helpful assistant. Answer the user's question using only "
                    "the provided context. If the context does not contain enough information, "
                    "say so.\n\nContext:\n" + context
                ),
            },
            {"role": "user", "content": question},
        ],
    )
    return completion.choices[0].message.content


print(answer("What is our data retention policy?"))

LightOn search as a tool

Instead of always searching before calling the model, you can expose LightOn search as a tool and let the model decide when to call it. The model issues a lighton_search tool call when it needs context; your code executes the search and feeds the results back; the model then produces a final answer.
import json
import os
import requests
from openai import OpenAI

LIGHTON_API_KEY = os.environ["LIGHTON_API_KEY"]
OVH_AI_ENDPOINTS_ACCESS_TOKEN = os.environ["OVH_AI_ENDPOINTS_ACCESS_TOKEN"]

ovh = OpenAI(
    base_url="https://oai.endpoints.kepler.ai.cloud.ovh.net/v1",
    api_key=OVH_AI_ENDPOINTS_ACCESS_TOKEN,
)

SEARCH_TOOL = {
    "type": "function",
    "function": {
        "name": "lighton_search",
        "description": (
            "Search the company knowledge base for passages relevant to a query. "
            "Returns ranked excerpts with their source filename and page numbers."
        ),
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "Natural-language search query.",
                },
                "max_results": {
                    "type": "integer",
                    "description": "Number of passages to return (1–50, default 5).",
                    "default": 5,
                },
            },
            "required": ["query"],
        },
    },
}


def run_search(query: str, max_results: int = 5) -> str:
    response = requests.post(
        "https://api.lighton.ai/api/v3/search",
        headers={"Authorization": f"Bearer {LIGHTON_API_KEY}"},
        json={"query": query, "max_results": max_results},
    )
    response.raise_for_status()
    results = response.json()["results"]
    passages = [
        f"[{r['source']['filename']}, p.{r['source']['page_start']}]\n{r['content']}"
        for r in results
        if r["content"]
    ]
    return "\n\n".join(passages) if passages else "No results found."


def answer(question: str, model: str = "Meta-Llama-3_3-70B-Instruct") -> str:
    messages = [{"role": "user", "content": question}]

    while True:
        completion = ovh.chat.completions.create(
            model=model,
            tools=[SEARCH_TOOL],
            messages=messages,
        )
        choice = completion.choices[0]

        if choice.finish_reason == "tool_calls":
            messages.append(choice.message)
            for call in choice.message.tool_calls:
                args = json.loads(call.function.arguments)
                result = run_search(**args)
                messages.append({
                    "role": "tool",
                    "tool_call_id": call.id,
                    "content": result,
                })
        else:
            return choice.message.content


print(answer("What is our data retention policy?"))
The loop handles the case where the model issues multiple search calls in sequence before producing a final answer.

Scoping retrieval to a workspace

Pass workspace_id to limit search to a specific workspace. This is useful in multi-tenant products where each customer’s data lives in a dedicated workspace.
answer("Summarize the onboarding checklist", workspace_id=[42])

Choosing a model

OVHcloud’s catalog includes several hosted models. Pass the model name to the model parameter:
ModelNotes
Meta-Llama-3_3-70B-InstructStrong reasoning, good default choice
Llama-3.1-8B-InstructFaster and cheaper, suitable for simpler queries
Mistral-Small-3.2-24B-Instruct-2506Compact Mistral, low latency
Qwen3-32BStrong multilingual reasoning model
Qwen2.5-VL-72B-InstructVision-language model, accepts image input
You can list the models available to your token at any time:
print([m.id for m in ovh.models.list().data])
Check the OVHcloud AI Endpoints documentation for the current model list and regional availability.

Streaming responses

OVHcloud’s endpoint supports streaming. Enable it by passing stream=True and iterating over the response:
stream = ovh.chat.completions.create(
    model="Meta-Llama-3_3-70B-Instruct",
    messages=[...],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

A note on reasoning models

Some OVHcloud models (for example Qwen3-32B and Qwen3.6-27B) are reasoning models. When called through the raw HTTP API they may return their chain of thought under a reasoning field and the final answer under content. The openai client used in the examples above surfaces the final answer in choices[0].message.content as usual, so no special handling is needed; read message.reasoning only if you want to inspect the thinking trace.