Can I stream outputs from AIP LLMs without using Agent Studio

I am well aware of using the sessions API to stream responses from AIP Agent studio agents. I do not want to do that. Agents built in agent studio proxy the model responses, almost always summarize or alter them, and are designed for tool calling which I do not want. Do not tell me to use AIP Agent Studio, I do not want it. What I want is the ability to stream outputs from models in the model catalog (via a TypeScript function because that is still the only way to use them, why, just why). Is this possible? When are we going to be able to use open source SDKs and leverage the inference endpoints in AIP like we do when working directly with the labs?

LLM proxies support streaming outputs, but it is unclear what models are supported. Below is a working example. I have only been able to get this to work with GPT4 series models.

import {
  SupportedFoundryClients,
  type OpenAIService,
} from '@codestrap/developer-foundations-types';
import OpenAI from 'openai';
import { foundryClientFactory } from '../factory/foundryClientFactory';
import type { ChatCompletionCreateParamsStreaming } from 'openai/resources/chat';
import type { RequestOptions } from 'openai/core';
import type { ResponseCreateParamsStreaming } from 'openai/resources/responses/responses';

// ADd tpe definitions for the OpenAI response here, or in a separate file and import them in, to ensure type safety when working with the API response data.
export function makeOpenAIService(): OpenAIService {
  const { getToken, url, ontologyRid } = foundryClientFactory(
    process.env.FOUNDRY_CLIENT_TYPE || SupportedFoundryClients.PRIVATE,
    undefined,
  );

  return {
    // TODO code out all methods using OSDK API calls
    completions: async (
      body: ChatCompletionCreateParamsStreaming,
      options?: RequestOptions,
    ) => {
      const token = await getToken();
      const client = new OpenAI({
        baseURL: `${url}/api/v2/llm/proxy/openai/v1`,
        apiKey: process.env.FOUNDRY_TOKEN,
      });

      const stream = await client.chat.completions.create(body, options);

      let text = '';
      for await (const chatCompletionChunk of stream) {
        text += chatCompletionChunk.choices[0]?.delta?.content || '';
      }
      return text;
    },
    responses: async (
      body: ResponseCreateParamsStreaming,
      options?: RequestOptions,
    ) => {
      const token = await getToken();
      const client = new OpenAI({
        baseURL: `${url}/api/v2/llm/proxy/openai/v1`,
        apiKey: process.env.FOUNDRY_TOKEN,
      });

      // Responses API streaming emits semantic events (delta, completed, error, etc.)
      const stream = await client.responses.create(
        { ...body, stream: true },
        options,
      );

      let text = '';

      for await (const event of stream) {
        if (event.type === 'error') {
          throw new Error(`OpenAI API error: ${event.code} - ${event.message}`);
        }

        if (event.type === 'response.output_text.delta') {
          text += event.delta ?? '';
        }
      }

      return text;
    },
  };
}
 

Foundry Functions do not support streaming. Docs mention this is being worked in (June 2025). We’ve had similar issues and had to resort to externally hosted logic to support our workflows. Very annoying, since the Language Model API does have methods for it, which are being used by (as you mentioned) Agent Studio.

1 Like

@CodeStrap All models which are supported in other parts of the platform should be supported through the LLM proxies. Some of the newer OpenAI Models (such as GPT-5.2-Codex) are only available through the responses API instead of the chat completions, but the availability should match as if you were using OpenAI directly.

What type of failures were you hitting when using the newer models?

Here is a list of what I tried with the status code returned:

gpt-5-mini - 500
gpt-5-mini-2025-08-07 - 500
gpt-5-nano - 500
gpt-5.2 - for completions I get 500, responses works as you said
gpt-4.1 - 200
gpt-4.1-nano - 200
gtp-4.1-mini - 200

Here is the page I am pulling model IDs from: https://developers.openai.com/api/docs/models

Do you mind sharing an example request body that is failing?

I get successful results when calling:

curl --request POST \
  --url https://{BASE_URL}/api/v2/llm/proxy/openai/v1/responses \
  --header 'Authorization: Bearer $TOKEN' \
  --header 'Content-Type: application/json' \
  --data '{
	"model": "gpt-5-mini",
	"input": [
		{
			"type": "message",
			"role": "user",
			"content": [
				{
					"type": "input_text",
					"text": "Tell me a 10 word story"
				}
			]
		}
	],
	"max_output_tokens": 500,
	"stream": false
}'

for example.

gpt-5-mini-2025-08-07 is not a supported ID. We have a change rolling out now that will expose the usable ids for a model within Model Catalog - completely understand that without this, its quite hard to know.

I am not doing a raw fetch. I am using the openAI SDK for Node:

import {
  SupportedFoundryClients,
  type OpenAIService,
} from '@codestrap/developer-foundations-types';
import OpenAI from 'openai';
import { foundryClientFactory } from '../factory/foundryClientFactory';
import type { ChatCompletionCreateParamsStreaming } from 'openai/resources/chat';
import type { RequestOptions } from 'openai/core';
import type { ResponseCreateParamsStreaming } from 'openai/resources/responses/responses';

// Palantir has not documented what models are supported for this proxy
// I opened an issue: https://community.palantir.com/t/what-models-are-supported-with-llm-proxies/6065
// I have only tested with gpt-4.1-mini. gpt-4.1 may also be supported. % series models are not.
export function makeOpenAIService(): OpenAIService {
  const { getToken, url, ontologyRid } = foundryClientFactory(
    process.env.FOUNDRY_CLIENT_TYPE || SupportedFoundryClients.PRIVATE,
    undefined,
  );

  return {
    // TODO code out all methods using OSDK API calls
    completions: async (
      body: ChatCompletionCreateParamsStreaming,
      options?: RequestOptions,
    ) => {
      const token = await getToken();
      const client = new OpenAI({
        baseURL: `${url}/api/v2/llm/proxy/openai/v1`,
        apiKey: process.env.FOUNDRY_TOKEN,
      });

      const stream = await client.chat.completions.create(body, options);

      return stream;
    },
    responses: async (
      body: ResponseCreateParamsStreaming,
      options?: RequestOptions,
    ) => {
      const token = await getToken();
      const client = new OpenAI({
        baseURL: `${url}/api/v2/llm/proxy/openai/v1`,
        apiKey: process.env.FOUNDRY_TOKEN,
      });

      // Responses API streaming emits semantic events (delta, completed, error, etc.)
      const stream = await client.responses.create(
        { ...body, stream: true },
        options,
      );

      return stream;
    },
  };
}

Note the foundryClientFactory is our own abstraction over your OSDK client. But the token is not used becuase of another issue I filled related to OSDK tokens not working.