Reducing Oracle Apex AI Costs

Oracle APEX & PL/SQL Developer with 10 years of experience. Working mainly with manufacturing industry. My work is my passion. Avid gamer.
Natural Language to IR in Oracle APEX 26.1 is great. It is also expensive.
When I looked at the requests going out for NL2IR, a normal call was around 17k tokens. The important part is that roughly 15k of those tokens are not really changing between calls. So if we can cache that stable part of the prompt, we stop paying full price for it every single time.
That is exactly what this post is about.
We are going to route NL2IR to a dedicated Anthropic service, enable prompt caching on the static part of the prompt, and use a request handler package to move the variable structured-data section out of the system prompt.
One important thing up front: this specific implementation is for Anthropic.
The high-level idea can work with other providers too, but not in exactly the same way. This version depends on Anthropic prompt caching and on being able to shape the request so that Anthropic sees a stable prefix it can reuse.
What we will build:
A dedicated AI service for NL2IR
A static system prompt stored in the service Additional Parameters
Anthropic prompt caching enabled on that static prompt
A request handler package that rewrites the NL2IR request into a cache-friendlier shape
The result is that after the first request pays the initial cost, the vast majority of the prompt gets cached and reused on every subsequent call. In practice, that means every NL2IR call across your reports can benefit from the same cache.
What Anthropic prompt caching actually does
If you have not used Anthropic prompt caching before, the short version is simple: Anthropic can cache an identical prompt prefix and reuse it on follow-up calls.
So instead of sending the same big static prompt over and over again at full price, you pay once to write it into cache, and then the next requests can read that cached prefix much more cheaply.
The important part is this: the cached prefix must stay identical.
If the stable part of the prompt keeps changing, you will keep missing the cache and you will not save much. By default, the cache lives for 5 minutes, which is perfectly fine for a feature like NL2IR where users often try several queries one after another.
There is also an option to extend it to 1 hour, although you pay a little bit more, if you expect your usage patterns to be more sporadic.
If you want the full details, Anthropic has a good prompt caching guide here: https://platform.claude.com/docs/en/build-with-claude/prompt-caching
That is why prompt shape matters here.
Why NL2IR is a good candidate for caching
NL2IR sends a lot of prompt data. And most of that prompt is not the user question.
The user might type something short like:
show me orders from last month
But the request also contains a large system prompt, instructions, metadata, and structured information about the report. That is where the token cost really comes from.
In my tests, the static portion of the instructions made up almost 90% of that massive payload.
That makes NL2IR a very good fit for prompt caching, as long as we can keep the reusable part stable.
The problem with the default prompt shape
The problem is that the APEX-generated system prompt does not contain only the static instructions we want to cache.
It also contains a ## Structured Data section. That section is the variable part we do not want glued into the static prefix.
So if we just cache the raw system prompt as-is, we are not really controlling the prompt shape. What we want instead is:
Keep the static instructions in one place
Move the structured data out of that static prompt
Keep the final request stable enough for Anthropic caching to do useful work
That is what the package below does.
Step 1: Create a dedicated NL2IR service
Just like in my previous NL2IR post, create a dedicated AI service for Natural Language to IR.
I am using Anthropic and routing NL2IR to a cheaper model. The exact model is up to you, but NL2IR is a very good place to use a lower-cost model because the input is huge and the output is usually small.
The important part is the static ID of the service. In my package I use:
haiku-for-ir
You can call it whatever you want, but remember to update the package if you use a different static ID.
Step 2: Put the static prompt into Additional Parameters
This time the dedicated service is not just about model routing. It is also where we keep the static system prompt and enable Anthropic prompt caching.
Wait, where do I get the static prompt?
Since APEX generates the NL2IR prompt under the hood, you need to grab it first. The easiest way is probably to add logging to the package in the next section and log p_result.request.system_prompt, or just check your provider logs. Copy that massive system prompt, delete the ## Structured Data section from it, and use the rest as your static instructions.
In the service Additional Parameters, put that static part of the NL2IR prompt and enable caching.
{
"system": [
{
"type": "text",
"cache_control": { "type": "ephemeral" },
"text": "<system prompt>"
}
]
}
The key idea is that this system prompt block should contain only the stable instructions you want Anthropic to cache.
Do not include the variable ## Structured Data section there. The package will take care of that part during request handling.
Step 3: Deploy the request handler package
Now deploy this package:
create or replace package ai_handler_cache as
procedure ai_request (
p_param in apex_ai.t_chat_request_handler_param,
p_result in out nocopy apex_ai.t_chat_request_handler_result );
end ai_handler_cache;
/
create or replace package body ai_handler_cache as
g_target_service_static_id constant varchar2(255) := 'haiku-for-ir';
function get_service_id(
p_service_static_id in varchar2
) return number
as
l_service_id number;
begin
select remote_server_id
into l_service_id
from apex_workspace_ai_services
where upper(remote_server_static_id) = upper(p_service_static_id)
fetch first 1 rows only;
return l_service_id;
end get_service_id;
function is_interactive_report(
p_region_id in number
) return boolean
as
l_dummy number;
begin
select 1
into l_dummy
from apex_application_page_ir
where region_id = p_region_id
fetch first 1 rows only;
return true;
exception
when no_data_found then
return false;
end is_interactive_report;
function least_positive(
p_left in pls_integer,
p_right in pls_integer
) return pls_integer
as
begin
if nvl(p_left, 0) = 0 then
return p_right;
elsif nvl(p_right, 0) = 0 then
return p_left;
else
return least(p_left, p_right);
end if;
end least_positive;
function next_top_level_heading_pos(
p_source in clob,
p_search_start in pls_integer
) return pls_integer
as
l_lf_heading_pos pls_integer;
l_crlf_heading_pos pls_integer;
begin
l_lf_heading_pos := dbms_lob.instr(p_source, chr(10) || '## ', p_search_start);
l_crlf_heading_pos := dbms_lob.instr(p_source, chr(13) || chr(10) || '## ', p_search_start);
if l_lf_heading_pos > 0 then
l_lf_heading_pos := l_lf_heading_pos + 1;
end if;
if l_crlf_heading_pos > 0 then
l_crlf_heading_pos := l_crlf_heading_pos + 2;
end if;
return least_positive(l_lf_heading_pos, l_crlf_heading_pos);
end next_top_level_heading_pos;
function slice_clob(
p_source in clob,
p_start in pls_integer,
p_length in pls_integer
) return clob
as
l_result clob;
begin
if p_source is null or p_length <= 0 then
return null;
end if;
dbms_lob.createtemporary(l_result, true);
dbms_lob.copy(
dest_lob => l_result,
src_lob => p_source,
amount => p_length,
dest_offset => 1,
src_offset => p_start );
return l_result;
end slice_clob;
function extract_structured_data_section(
p_system_prompt in clob
) return clob
as
c_heading_marker constant varchar2(20) := '## Structured Data';
l_heading_pos pls_integer;
l_next_heading_pos pls_integer;
l_section_end_pos pls_integer;
l_prompt_length pls_integer;
begin
if p_system_prompt is null then
return null;
end if;
l_heading_pos := dbms_lob.instr(p_system_prompt, c_heading_marker);
if l_heading_pos = 0 then
return null;
end if;
l_prompt_length := dbms_lob.getlength(p_system_prompt);
l_next_heading_pos := next_top_level_heading_pos(
p_source => p_system_prompt,
p_search_start => l_heading_pos + length(c_heading_marker) );
if l_next_heading_pos > l_heading_pos then
l_section_end_pos := l_next_heading_pos - 1;
else
l_section_end_pos := l_prompt_length;
end if;
return slice_clob(
p_source => p_system_prompt,
p_start => l_heading_pos,
p_length => l_section_end_pos - l_heading_pos + 1 );
end extract_structured_data_section;
function has_structured_data_message(
p_messages in apex_ai.t_chat_messages
) return boolean
as
l_idx pls_integer;
begin
l_idx := p_messages.first;
while l_idx is not null loop
if p_messages(l_idx).chat_role = apex_ai.c_role_user then
return false;
elsif p_messages(l_idx).chat_role = apex_ai.c_role_assistant
and p_messages(l_idx).message is not null
and dbms_lob.instr(p_messages(l_idx).message, '## Structured Data') > 0
then
return true;
end if;
l_idx := p_messages.next(l_idx);
end loop;
return false;
end has_structured_data_message;
function assistant_message(
p_message in clob
) return apex_ai.t_chat_message
as
l_message apex_ai.t_chat_message;
begin
l_message.chat_role := apex_ai.c_role_assistant;
l_message.message_type := apex_ai.c_chat_message_type_text;
l_message.message := p_message;
return l_message;
end assistant_message;
procedure insert_assistant_message_before_first_user(
p_messages in out nocopy apex_ai.t_chat_messages,
p_message in clob
)
as
l_existing_messages apex_ai.t_chat_messages := p_messages;
l_new_messages apex_ai.t_chat_messages;
l_source_idx pls_integer;
l_target_idx pls_integer := 0;
l_inserted boolean := false;
begin
if p_message is null then
return;
end if;
l_source_idx := l_existing_messages.first;
while l_source_idx is not null loop
if not l_inserted and l_existing_messages(l_source_idx).chat_role = apex_ai.c_role_user then
l_target_idx := l_target_idx + 1;
l_new_messages(l_target_idx) := assistant_message(p_message);
l_inserted := true;
end if;
l_target_idx := l_target_idx + 1;
l_new_messages(l_target_idx) := l_existing_messages(l_source_idx);
l_source_idx := l_existing_messages.next(l_source_idx);
end loop;
if not l_inserted then
l_target_idx := l_target_idx + 1;
l_new_messages(l_target_idx) := assistant_message(p_message);
end if;
p_messages := l_new_messages;
end insert_assistant_message_before_first_user;
procedure ai_request (
p_param in apex_ai.t_chat_request_handler_param,
p_result in out nocopy apex_ai.t_chat_request_handler_result )
as
l_structured_data_section clob;
begin
if p_param.component.type != 'APEX_APPLICATION_PAGE_REGIONS' then
return;
end if;
if not is_interactive_report(p_param.component.id) then
return;
end if;
l_structured_data_section := extract_structured_data_section(p_result.request.system_prompt);
if l_structured_data_section is not null
and not has_structured_data_message(p_result.request.messages)
then
insert_assistant_message_before_first_user(
p_messages => p_result.request.messages,
p_message => l_structured_data_section );
end if;
-- add logging here to capture the system prompt
-- example: insert into my_logs (clob_column) values (p_result.request.system_prompt)
p_result.request.system_prompt := null;
p_result.request.service_id := get_service_id(g_target_service_static_id);
end ai_request;
end ai_handler_cache;
/
What the package is doing
The package does not try to reinvent NL2IR. It just changes the request shape before it goes out.
The flow is:
Check that the request is really coming from an Interactive Report.
Look inside the APEX-generated system prompt for the
## Structured Datasection.Extract that section.
Insert it into the message history as an assistant message, before the first user message.
Clear the request system prompt.
Route the request to the dedicated Anthropic service.
That one change is the whole trick.
Instead of sending one big changing system prompt every time, we separate the stable part from the variable part.
The stable instructions now live in the service configuration, where Anthropic can cache them. The structured data is still sent to the model, but it is no longer glued into the same system prompt that we want to keep stable.
Why the structured data is moved into a message
This is the part that matters for caching.
The package extracts only the ## Structured Data section and injects it as an assistant message before the first user question.
That means the model still gets the structured data it needs, but the static system instructions can stay static.
And static prompt prefixes are exactly what Anthropic caching likes.
The package also checks whether that structured-data message is already present, so it does not keep inserting duplicates into the conversation.
Step 4: Configure the request handler in APEX
Once the package is deployed, go to your application:
Edit Application Definition
AI tab
Request Handler Procedure
Enter:
ai_handler_cache.ai_request
That is it.
From that point on, NL2IR requests for Interactive Reports will be rewritten and routed through the dedicated cached service.
Everything else is left alone, because the package exits unless the request is really coming from an Interactive Report.
What changes in practice
From the user point of view, nothing really changes. They still type natural language into the report and get the same feature.
From the request point of view, quite a lot changes.
Before:
Around 17k tokens per request
Big prompt repeated every time
No real benefit from prompt reuse
After:
First request still pays the full setup cost
Roughly 15k tokens can be cached
Those cached tokens are reused on each NL2IR call
That is where the savings come from.
A few important notes
This package is built around the current shape of the APEX NL2IR prompt, so there are a few assumptions worth calling out.
It only targets Interactive Reports
It expects the generated prompt to contain a
## Structured DataheadingIt routes requests to a dedicated service with static ID
haiku-for-irIt assumes your static system prompt is already stored in the dedicated service Additional Parameters
So if Oracle changes the internal NL2IR prompt structure in a future release, this package may need a small update.
What about other providers?
This post uses Anthropic for the concrete implementation.
If your provider of choice is OpenAI, caching is fully automatic, so it will probably cache something even without changes. If you want to make it work better, you would need to modify my package so it removes the Structured Data section from the system prompt, but does not null out the system_prompt attribute. Instead, you would populate it with the rest of the system prompt. Also, because the system prompt is part of the message array rather than a separate attribute in OpenAI, you would want to modify how the package injects the structured data into the message array.
With Google, caching is also automatic, and the same general approach as OpenAI could work. If it does not, Google also has a separate endpoint where you define the cache, populate it with the content you want cached, and then refresh the TTL every hour or so. That could become a separate job to schedule and maintain.
I did not test those approaches, but I would be very happy to consult on them if someone dares to do it.
A few closing thoughts on the APEX AI hooks
I actually like that Oracle gave us request and response handler injection points this early. Without them, this whole approach would not really be possible.
That said, I do think the current request-building model works against optimization.
Right now Oracle builds the request such that messages land in the middle of it, between the system prompt, tools, and anything that we add in service configuration. For cost optimization, caching, and even debugging, that is a bit backwards.
What I would really like is a lower-level injection point just before the request is sent to the LLM service, after Oracle has finished building the final payload. That would make it much easier to do provider-specific optimizations cleanly instead of reverse-engineering the generated prompt shape.
I would also like better visibility into the actual provider request and response payloads. Right now that is one of the frustrating parts of working with caching. Anthropic exposes useful usage information, including cache-related fields, but Oracle does not really expose that final response data back to us in a way that lets us inspect cache hits properly.
So yes, the hooks Oracle gave us are already useful. But if Oracle exposed the final outbound payload and the real provider usage details, building and validating optimizations like this would be much easier.
If this were a production project, I would probably suggest not relying too heavily on the standard Oracle AI services directly. In that kind of setup, it would probably be better to configure one standard Ollama or OpenAI-compatible service in APEX, put a small proxy in front of the real provider, and let that proxy intercept the calls, optimize them properly, and then pass them on to the actual LLM service or at least use a service like OpenRouter.
And that is it
The interesting part here is that the big saving does not come from only switching to a cheaper model.
It comes from changing the request shape so the expensive, stable part of the prompt can actually be cached.
Instead of paying for a massive system prompt on every single turn, you turn it into a flow where the heavy lifting is cached and much cheaper on follow-up calls.
That is a much better place to be.
If you want to build on top of this, the obvious next steps are moving the service static ID into configuration, adding more defensive error handling around service lookup, and hoping Oracle eventually exposes enough provider-level usage data to make cache-hit validation much easier.


