Reduce Oracle APEX AI Costs with Prompt Cache

Natural Language to IR in Oracle APEX 26.1 is great. It is also expensive.

When I looked at the requests going out for NL2IR, a normal call was around 17k tokens. The important part is that roughly 15k of those tokens are not really changing between calls. So if we can cache that stable part of the prompt, we stop paying full price for it every single time.

That is exactly what this post is about.

We are going to route NL2IR to a dedicated Anthropic service, enable prompt caching on the static part of the prompt, and use a request handler package to move the variable structured-data section out of the system prompt.

One important thing up front: this specific implementation is for Anthropic.

The high-level idea can work with other providers too, but not in exactly the same way. This version depends on Anthropic prompt caching and on being able to shape the request so that Anthropic sees a stable prefix it can reuse.

What we will build:

A dedicated AI service for NL2IR
A static system prompt stored in the service Additional Parameters
Anthropic prompt caching enabled on that static prompt
A request handler package that rewrites the NL2IR request into a cache-friendlier shape

The result is that after the first request pays the initial cost, the vast majority of the prompt gets cached and reused on every subsequent call. In practice, that means every NL2IR call across your reports can benefit from the same cache.

What Anthropic prompt caching actually does

If you have not used Anthropic prompt caching before, the short version is simple: Anthropic can cache an identical prompt prefix and reuse it on follow-up calls.

So instead of sending the same big static prompt over and over again at full price, you pay once to write it into cache, and then the next requests can read that cached prefix much more cheaply.

The important part is this: the cached prefix must stay identical.

If the stable part of the prompt keeps changing, you will keep missing the cache and you will not save much. By default, the cache lives for 5 minutes, which is perfectly fine for a feature like NL2IR where users often try several queries one after another.

There is also an option to extend it to 1 hour, although you pay a little bit more, if you expect your usage patterns to be more sporadic.

If you want the full details, Anthropic has a good prompt caching guide here: https://platform.claude.com/docs/en/build-with-claude/prompt-caching

That is why prompt shape matters here.

Why NL2IR is a good candidate for caching

NL2IR sends a lot of prompt data. And most of that prompt is not the user question.

The user might type something short like:

show me orders from last month

But the request also contains a large system prompt, instructions, metadata, and structured information about the report. That is where the token cost really comes from.

In my tests, the static portion of the instructions made up almost 90% of that massive payload.

That makes NL2IR a very good fit for prompt caching, as long as we can keep the reusable part stable.

The problem with the default prompt shape

The problem is that the APEX-generated system prompt does not contain only the static instructions we want to cache.

It also contains a ## Structured Data section. That section is the variable part we do not want glued into the static prefix.

So if we just cache the raw system prompt as-is, we are not really controlling the prompt shape. What we want instead is:

Keep the static instructions in one place
Move the structured data out of that static prompt
Keep the final request stable enough for Anthropic caching to do useful work

That is what the package below does.

Step 1: Create a dedicated NL2IR service

Just like in my previous NL2IR post, create a dedicated AI service for Natural Language to IR.

I am using Anthropic and routing NL2IR to a cheaper model. The exact model is up to you, but NL2IR is a very good place to use a lower-cost model because the input is huge and the output is usually small.

The important part is the static ID of the service. In my package I use:

haiku-for-ir

You can call it whatever you want, but remember to update the package if you use a different static ID.

Step 2: Put the static prompt into Additional Parameters

This time the dedicated service is not just about model routing. It is also where we keep the static system prompt and enable Anthropic prompt caching.

Wait, where do I get the static prompt?

Since APEX generates the NL2IR prompt under the hood, you need to grab it first. The easiest way is probably to add logging to the package in the next section and log p_result.request.system_prompt, or just check your provider logs. Copy that massive system prompt, delete the ## Structured Data section from it, and use the rest as your static instructions.

In the service Additional Parameters, put that static part of the NL2IR prompt and enable caching.

{
  "system": [
    {
      "type": "text",
      "cache_control": { "type": "ephemeral" },
      "text": "<system prompt>"
    }
  ]
}

The key idea is that this system prompt block should contain only the stable instructions you want Anthropic to cache.

Do not include the variable ## Structured Data section there. The package will take care of that part during request handling.

Step 3: Deploy the request handler package

Now deploy this package:

create or replace package ai_handler_cache as
  procedure ai_request (
    p_param  in            apex_ai.t_chat_request_handler_param,
    p_result in out nocopy apex_ai.t_chat_request_handler_result );
end ai_handler_cache;
/

create or replace package body ai_handler_cache as

  g_target_service_static_id constant varchar2(255) := 'haiku-for-ir';

  function get_service_id(
    p_service_static_id in varchar2
  ) return number
  as
    l_service_id number;
  begin
    select remote_server_id
      into l_service_id
      from apex_workspace_ai_services
     where upper(remote_server_static_id) = upper(p_service_static_id)
     fetch first 1 rows only;

    return l_service_id;
  end get_service_id;

  function is_interactive_report(
    p_region_id in number
  ) return boolean
  as
    l_dummy number;
  begin
    select 1
      into l_dummy
      from apex_application_page_ir
     where region_id = p_region_id
     fetch first 1 rows only;

    return true;
  exception
    when no_data_found then
      return false;
  end is_interactive_report;

  function least_positive(
    p_left  in pls_integer,
    p_right in pls_integer
  ) return pls_integer
  as
  begin
    if nvl(p_left, 0) = 0 then
      return p_right;
    elsif nvl(p_right, 0) = 0 then
      return p_left;
    else
      return least(p_left, p_right);
    end if;
  end least_positive;

  function next_top_level_heading_pos(
    p_source       in clob,
    p_search_start in pls_integer
  ) return pls_integer
  as
    l_lf_heading_pos   pls_integer;
    l_crlf_heading_pos pls_integer;
  begin
    l_lf_heading_pos := dbms_lob.instr(p_source, chr(10) || '## ', p_search_start);
    l_crlf_heading_pos := dbms_lob.instr(p_source, chr(13) || chr(10) || '## ', p_search_start);

    if l_lf_heading_pos > 0 then
      l_lf_heading_pos := l_lf_heading_pos + 1;
    end if;

    if l_crlf_heading_pos > 0 then
      l_crlf_heading_pos := l_crlf_heading_pos + 2;
    end if;

    return least_positive(l_lf_heading_pos, l_crlf_heading_pos);
  end next_top_level_heading_pos;

  function slice_clob(
    p_source in clob,
    p_start  in pls_integer,
    p_length in pls_integer
  ) return clob
  as
    l_result clob;
  begin
    if p_source is null or p_length <= 0 then
      return null;
    end if;

    dbms_lob.createtemporary(l_result, true);
    dbms_lob.copy(
      dest_lob    => l_result,
      src_lob     => p_source,
      amount      => p_length,
      dest_offset => 1,
      src_offset  => p_start );

    return l_result;
  end slice_clob;

  function extract_structured_data_section(
    p_system_prompt in clob
  ) return clob
  as
    c_heading_marker constant varchar2(20) := '## Structured Data';
    l_heading_pos      pls_integer;
    l_next_heading_pos pls_integer;
    l_section_end_pos  pls_integer;
    l_prompt_length    pls_integer;
  begin
    if p_system_prompt is null then
      return null;
    end if;

    l_heading_pos := dbms_lob.instr(p_system_prompt, c_heading_marker);
    if l_heading_pos = 0 then
      return null;
    end if;

    l_prompt_length := dbms_lob.getlength(p_system_prompt);
    l_next_heading_pos := next_top_level_heading_pos(
                            p_source       => p_system_prompt,
                            p_search_start => l_heading_pos + length(c_heading_marker) );

    if l_next_heading_pos > l_heading_pos then
      l_section_end_pos := l_next_heading_pos - 1;
    else
      l_section_end_pos := l_prompt_length;
    end if;

    return slice_clob(
             p_source => p_system_prompt,
             p_start  => l_heading_pos,
             p_length => l_section_end_pos - l_heading_pos + 1 );
  end extract_structured_data_section;

  function has_structured_data_message(
    p_messages in apex_ai.t_chat_messages
  ) return boolean
  as
    l_idx pls_integer;
  begin
    l_idx := p_messages.first;
    while l_idx is not null loop
      if p_messages(l_idx).chat_role = apex_ai.c_role_user then
        return false;
      elsif p_messages(l_idx).chat_role = apex_ai.c_role_assistant
        and p_messages(l_idx).message is not null
        and dbms_lob.instr(p_messages(l_idx).message, '## Structured Data') > 0
      then
        return true;
      end if;

      l_idx := p_messages.next(l_idx);
    end loop;

    return false;
  end has_structured_data_message;

  function assistant_message(
    p_message in clob
  ) return apex_ai.t_chat_message
  as
    l_message apex_ai.t_chat_message;
  begin
    l_message.chat_role := apex_ai.c_role_assistant;
    l_message.message_type := apex_ai.c_chat_message_type_text;
    l_message.message := p_message;
    return l_message;
  end assistant_message;

  procedure insert_assistant_message_before_first_user(
    p_messages in out nocopy apex_ai.t_chat_messages,
    p_message  in            clob
  )
  as
    l_existing_messages apex_ai.t_chat_messages := p_messages;
    l_new_messages      apex_ai.t_chat_messages;
    l_source_idx        pls_integer;
    l_target_idx        pls_integer := 0;
    l_inserted          boolean := false;
  begin
    if p_message is null then
      return;
    end if;

    l_source_idx := l_existing_messages.first;
    while l_source_idx is not null loop
      if not l_inserted and l_existing_messages(l_source_idx).chat_role = apex_ai.c_role_user then
        l_target_idx := l_target_idx + 1;
        l_new_messages(l_target_idx) := assistant_message(p_message);
        l_inserted := true;
      end if;

      l_target_idx := l_target_idx + 1;
      l_new_messages(l_target_idx) := l_existing_messages(l_source_idx);
      l_source_idx := l_existing_messages.next(l_source_idx);
    end loop;

    if not l_inserted then
      l_target_idx := l_target_idx + 1;
      l_new_messages(l_target_idx) := assistant_message(p_message);
    end if;

    p_messages := l_new_messages;
  end insert_assistant_message_before_first_user;

  procedure ai_request (
    p_param  in            apex_ai.t_chat_request_handler_param,
    p_result in out nocopy apex_ai.t_chat_request_handler_result )
  as
    l_structured_data_section clob;
  begin
    if p_param.component.type != 'APEX_APPLICATION_PAGE_REGIONS' then
      return;
    end if;

    if not is_interactive_report(p_param.component.id) then
      return;
    end if;

    l_structured_data_section := extract_structured_data_section(p_result.request.system_prompt);

    if l_structured_data_section is not null
       and not has_structured_data_message(p_result.request.messages)
    then
      insert_assistant_message_before_first_user(
        p_messages => p_result.request.messages,
        p_message  => l_structured_data_section );
    end if;
        -- add logging here to capture the system prompt
        -- example: insert into my_logs (clob_column) values (p_result.request.system_prompt)
    p_result.request.system_prompt := null;
    p_result.request.service_id := get_service_id(g_target_service_static_id);
  end ai_request;

end ai_handler_cache;
/

What the package is doing

The package does not try to reinvent NL2IR. It just changes the request shape before it goes out.

The flow is:

Check that the request is really coming from an Interactive Report.
Look inside the APEX-generated system prompt for the ## Structured Data section.
Extract that section.
Insert it into the message history as an assistant message, before the first user message.
Clear the request system prompt.
Route the request to the dedicated Anthropic service.

That one change is the whole trick.

Instead of sending one big changing system prompt every time, we separate the stable part from the variable part.

The stable instructions now live in the service configuration, where Anthropic can cache them. The structured data is still sent to the model, but it is no longer glued into the same system prompt that we want to keep stable.

Why the structured data is moved into a message

This is the part that matters for caching.

The package extracts only the ## Structured Data section and injects it as an assistant message before the first user question.

That means the model still gets the structured data it needs, but the static system instructions can stay static.

And static prompt prefixes are exactly what Anthropic caching likes.

The package also checks whether that structured-data message is already present, so it does not keep inserting duplicates into the conversation.

Step 4: Configure the request handler in APEX

Once the package is deployed, go to your application:

Edit Application Definition
AI tab
Request Handler Procedure

Enter:

ai_handler_cache.ai_request

That is it.

From that point on, NL2IR requests for Interactive Reports will be rewritten and routed through the dedicated cached service.

Everything else is left alone, because the package exits unless the request is really coming from an Interactive Report.

What changes in practice

From the user point of view, nothing really changes. They still type natural language into the report and get the same feature.

From the request point of view, quite a lot changes.

Before:

Around 17k tokens per request
Big prompt repeated every time
No real benefit from prompt reuse

After:

First request still pays the full setup cost
Roughly 15k tokens can be cached
Those cached tokens are reused on each NL2IR call

That is where the savings come from.

A few important notes

This package is built around the current shape of the APEX NL2IR prompt, so there are a few assumptions worth calling out.

It only targets Interactive Reports
It expects the generated prompt to contain a ## Structured Data heading
It routes requests to a dedicated service with static ID haiku-for-ir
It assumes your static system prompt is already stored in the dedicated service Additional Parameters

So if Oracle changes the internal NL2IR prompt structure in a future release, this package may need a small update.

What about other providers?

This post uses Anthropic for the concrete implementation.

If your provider of choice is OpenAI, caching is fully automatic, so it will probably cache something even without changes. If you want to make it work better, you would need to modify my package so it removes the Structured Data section from the system prompt, but does not null out the system_prompt attribute. Instead, you would populate it with the rest of the system prompt. Also, because the system prompt is part of the message array rather than a separate attribute in OpenAI, you would want to modify how the package injects the structured data into the message array.

With Google, caching is also automatic, and the same general approach as OpenAI could work. If it does not, Google also has a separate endpoint where you define the cache, populate it with the content you want cached, and then refresh the TTL every hour or so. That could become a separate job to schedule and maintain.

I did not test those approaches, but I would be very happy to consult on them if someone dares to do it.

A few closing thoughts on the APEX AI hooks

I actually like that Oracle gave us request and response handler injection points this early. Without them, this whole approach would not really be possible.

That said, I do think the current request-building model works against optimization.

Right now Oracle builds the request such that messages land in the middle of it, between the system prompt, tools, and anything that we add in service configuration. For cost optimization, caching, and even debugging, that is a bit backwards.

What I would really like is a lower-level injection point just before the request is sent to the LLM service, after Oracle has finished building the final payload. That would make it much easier to do provider-specific optimizations cleanly instead of reverse-engineering the generated prompt shape.

I would also like better visibility into the actual provider request and response payloads. Right now that is one of the frustrating parts of working with caching. Anthropic exposes useful usage information, including cache-related fields, but Oracle does not really expose that final response data back to us in a way that lets us inspect cache hits properly.

So yes, the hooks Oracle gave us are already useful. But if Oracle exposed the final outbound payload and the real provider usage details, building and validating optimizations like this would be much easier.

If this were a production project, I would probably suggest not relying too heavily on the standard Oracle AI services directly. In that kind of setup, it would probably be better to configure one standard Ollama or OpenAI-compatible service in APEX, put a small proxy in front of the real provider, and let that proxy intercept the calls, optimize them properly, and then pass them on to the actual LLM service or at least use a service like OpenRouter.

And that is it

The interesting part here is that the big saving does not come from only switching to a cheaper model.

It comes from changing the request shape so the expensive, stable part of the prompt can actually be cached.

Instead of paying for a massive system prompt on every single turn, you turn it into a flow where the heavy lifting is cached and much cheaper on follow-up calls.

That is a much better place to be.

If you want to build on top of this, the obvious next steps are moving the service static ID into configuration, adding more defensive error handling around service lookup, and hoping Oracle eventually exposes enough provider-level usage data to make cache-hit validation much easier.

Reducing Oracle Apex AI Costs

What Anthropic prompt caching actually does

Why NL2IR is a good candidate for caching

The problem with the default prompt shape

Step 1: Create a dedicated NL2IR service

Step 2: Put the static prompt into Additional Parameters

Step 3: Deploy the request handler package

What the package is doing

Why the structured data is moved into a message

Step 4: Configure the request handler in APEX

What changes in practice

A few important notes

What about other providers?

A few closing thoughts on the APEX AI hooks

And that is it

Comments

More from this blog

Apex 26.1 Nice token limit error

Custom LLM Service for Natural Language to IR

Apex 26.1 Per User Token Management

SQLCL liquibase tags explained #JoelKallmanDay

Command Palette

What Anthropic prompt caching actually does

Why NL2IR is a good candidate for caching

The problem with the default prompt shape

Step 1: Create a dedicated NL2IR service

Step 2: Put the static prompt into Additional Parameters

Step 3: Deploy the request handler package

What the package is doing

Why the structured data is moved into a message

Step 4: Configure the request handler in APEX

What changes in practice

A few important notes

What about other providers?

A few closing thoughts on the APEX AI hooks

And that is it

Comments

More from this blog