AIP Agent Streaming API Not Sending Progressive Chunks in React Native

The /api/v2/aipAgents/.../streamingContinue endpoint is sending complete responses as a single chunk instead of streaming progressively as documented. This occurs consistently across different response lengths when using XMLHttpRequest in React Native.

Environment

  • Platform: React Native (Expo SDK 54)
  • Client Library: Native XMLHttpRequest (no SDK wrapper)
  • API Version: v2 (preview=true)
  • Agent Version: 12.0
  • Endpoint: POST /api/v2/aipAgents/agents/{agentRid}/sessions/{sessionRid}/streamingContinue?preview=true

Expected Behavior

According to the API documentation:

β€œReturns a stream of the Agent response text (formatted using markdown) for clients to consume as the response is generated.”

Expected: Multiple progressive chunks arriving as the AI generates the response.

Actual Behavior

The entire response arrives as a single chunk after the complete AI generation is finished, regardless of response length.

Test Results

Test 1: Short Response (216 characters)

Request sent: 2025-11-23T16:26:13.096Z
Chunk 1 received: 16144ms later (216 chars)
Total chunks: 1
Response: "βœ… Done! Your 'Doc appointment' is scheduled..."

Test 2: Longer Response (981 characters)

Request sent: 2025-11-23T16:33:04.879Z
Chunk 1 received: 4993ms later (981 chars)
Total chunks: 1
Response: "The sky appears blue because of Rayleigh scattering..."

Implementation Details

Request Configuration

// Using native XMLHttpRequest for progressive chunk reception
const xhr = new XMLHttpRequest();

xhr.open('POST', url, true);
xhr.setRequestHeader('Content-Type', 'application/json');
xhr.setRequestHeader('Authorization', `Bearer ${AUTH_TOKEN}`);
xhr.setRequestHeader('Accept', 'text/plain, text/event-stream, */*');
xhr.timeout = 60000;

// Track progress to receive chunks as they arrive
let lastProcessedIndex = 0;

xhr.onprogress = (event) => {
  const responseText = xhr.responseText;
  
  // Process only new content since last progress event
  if (responseText.length > lastProcessedIndex) {
    const newContent = responseText.substring(lastProcessedIndex);
    lastProcessedIndex = responseText.length;
    
    console.log(`Chunk received: ${newContent.length} chars`);
    onChunk(newContent); // Callback to update UI
  }
};

xhr.send(JSON.stringify(requestBody));

Request Body

{
  "messageId": "uuid-v4-generated",
  "userInput": {
    "text": "[User Context]\nUser ID: <anonymized>\nTimezone: America/Chicago\n\n[User Message]\n<user's message>"
  }
}

Response Headers (Anonymized)

{
  "content-type": "application/json",
  "server": "envoy",
  "server-timing": "server;dur=1412.117",
  "x-envoy-upstream-service-time": "1412"
}

Key Observations

  1. Single Chunk Delivery: Both short (216 chars) and long (981 chars) responses arrive as exactly 1 chunk
  2. Complete Buffering: The entire response is buffered server-side before transmission
  3. Timing Pattern: Response time correlates with content length, suggesting generation completes before sending
  4. XMLHttpRequest Working: onprogress fires correctly when data arrives, confirming client-side implementation is correct

Code Extract

Full Streaming Function

export async function streamContinueSession(
  agentRid: string,
  sessionRid: string,
  userMessage: string,
  userId: string,
  timezone: string,
  onChunk: (text: string) => void,
  onComplete: () => void,
  onError: (error: string) => void
): Promise<void> {
  const url = `${API_BASE_URL}/api/v2/aipAgents/agents/${agentRid}/sessions/${sessionRid}/streamingContinue?preview=true`;
  
  const requestBody = {
    messageId: generateUUID(),
    userInput: {
      text: `[User Context]\nUser ID: ${userId}\nTimezone: ${timezone}\n\n[User Message]\n${userMessage}`
    }
  };
  
  return new Promise((resolve, reject) => {
    const xhr = new XMLHttpRequest();
    let lastProcessedIndex = 0;
    let chunkCount = 0;
    
    xhr.onprogress = (event) => {
      const responseText = xhr.responseText;
      
      if (responseText.length > lastProcessedIndex) {
        const newContent = responseText.substring(lastProcessedIndex);
        lastProcessedIndex = responseText.length;
        chunkCount++;
        
        console.log(`Chunk ${chunkCount}: ${newContent.length} chars`);
        
        if (newContent.trim()) {
          onChunk(newContent);
        }
      }
    };
    
    xhr.onload = () => {
      if (xhr.status >= 200 && xhr.status < 300) {
        console.log(`Total chunks received: ${chunkCount}`);
        onComplete();
        resolve();
      } else {
        onError(`Error: ${xhr.status}`);
        reject(new Error(`HTTP ${xhr.status}`));
      }
    };
    
    xhr.onerror = () => {
      onError('Network error');
      reject(new Error('Network error'));
    };
    
    xhr.open('POST', url, true);
    xhr.setRequestHeader('Content-Type', 'application/json');
    xhr.setRequestHeader('Authorization', `Bearer ${AUTH_TOKEN}`);
    xhr.setRequestHeader('Accept', 'text/plain, text/event-stream, */*');
    xhr.timeout = 60000;
    
    xhr.send(JSON.stringify(requestBody));
  });
}

Questions

  1. Is progressive streaming supported? Does the API actually stream chunks as they’re generated, or does it buffer the complete response?

  2. Response format: Should we expect text/plain, text/event-stream, or another content type for true streaming?

  3. Configuration needed? Are there specific request headers, parameters, or agent configurations required to enable progressive streaming?

  4. Alternative endpoints? Is there a different endpoint that provides true progressive streaming?

Reproduction Steps

  1. Create an AIP Agent in Agent Studio
  2. Create a session using POST /api/v2/aipAgents/agents/{agentRid}/sessions
  3. Send a message using the code above to streamingContinue endpoint
  4. Monitor xhr.onprogress events
  5. Observe that only 1 chunk is received containing the complete response

Impact

While the current behavior is functional, it prevents us from providing real-time feedback to users as the AI generates responses. Users experience a loading state for the full generation time (5-16 seconds) before seeing any text, rather than seeing progressive text generation.

Request

Could you please clarify:

  • Whether progressive streaming is supported in the current API version
  • If there are specific configurations or headers needed to enable it
  • If this is expected behavior or a potential issue

Thank you for your assistance!

1 Like

Curious about this too. I’d prefer to not make our users sit and wait for 10-20 seconds while nothing happens on the screen, when they’ve been trained with practically every AI product that the response will start streaming in quicker to the initial request time. Even AIP streams in Foundry, it would be great to make that available in our React/OSDK apps.

I tested with flutter and here is the results
We are experiencing significant latency (~15 seconds) when using the AIP Agent streaming API in a production Flutter mobile application. The delay occurs between sending the streaming request and receiving the first response chunk, impacting user experience in our real-time conversational interface.

-–

## Environment

- **Platform:** Flutter mobile application (iOS & Android)

- **HTTP Client:** Dio 5.4.0

- **API Version:** v2

- **Endpoint:** `/api/v2/aipAgents/agents/{agentRid}/sessions/{sessionRid}/streamingContinue`

- **Response Type:** Streaming (chunked transfer encoding)

-–

## Implementation Details

### API Call Pattern

We are using the streaming continue endpoint with the following configuration:

```dart

// HTTP POST request with streaming response

POST /api/v2/aipAgents/agents/{agentRid}/sessions/{sessionRid}/streamingContinue?preview=true

Headers:

- Content-Type: application/json

- Authorization: Bearer {token}

- Accept: text/plain, text/event-stream, */*

Request Body:

{

β€œmessageId”: β€œ{uuid}”,

β€œuserInput”: {

"text": "\[User Context\]\\nUser ID: {userId}\\nTimezone: {timezone}\\n\\n\[User Message\]\\n{userMessage}"

}

}

Response Configuration:

- responseType: stream

- Transfer-Encoding: chunked

```

### Session Management

1. **Session Creation:** We create a session once at the start of the conversation using `/api/v2/aipAgents/agents/{agentRid}/sessions`

2. **Session Reuse:** The same session RID is reused for multiple streaming continue requests within the conversation

3. **Context:** Each request includes user context (ID, timezone) embedded in the message text

-–

## Performance Measurements

### Observed Latency

We have instrumented our code to measure timing at various stages:

```

Timeline for a typical request:

T+0ms: Client sends HTTP POST request

        ↓

T+14,872ms: HTTP 200 response headers received

        "Stream connection established"

        ↓

T+14,872ms: First chunk received (1 character)

T+14,901ms: Second chunk received (4 characters)

T+14,903ms: Third chunk received (4 characters)

        ... (rapid chunk delivery continues)

        ↓

T+15,853ms: Stream completes (38 total chunks)

```

### Key Metrics

- **Time to First Byte (TTFB):** ~14.8 seconds

- **Time to HTTP 200:** ~14.8 seconds

- **Streaming Duration:** ~1 second (38 chunks)

- **Total Request Time:** ~15.8 seconds

### Latency Breakdown

Based on our analysis:

```

Component Time Percentage

─────────────────────────────────────────────────────────

Network (TCP + TLS + HTTP) ~300ms 2%

Server Processing ~14,500ms 98%

─────────────────────────────────────────────────────────

Total to HTTP 200 14,800ms 100%

```

**The vast majority (98%) of the latency occurs server-side before the HTTP 200 response is sent.**

-–

## Problem Description

### User Experience Impact

From a user perspective:

1. User sends a message

2. Application shows β€œProcessing…” indicator

3. **15 second wait with no feedback**

4. Response text appears word-by-word (streaming works well)

The 15-second delay creates a poor user experience, making the application feel unresponsive.

### What We’ve Ruled Out

We have verified that the latency is **NOT** caused by:

1. :cross_mark: **Client-side processing** - Our code yields chunks immediately with no buffering

2. :cross_mark: **Network latency** - TCP/TLS handshake completes in ~300ms

3. :cross_mark: **HTTP overhead** - Request/response headers are minimal

4. :cross_mark: **Client configuration** - We’ve tested with various timeout settings

The delay occurs **before** the server sends the HTTP 200 response, indicating server-side processing time.

-–

### Questions for AIP Team

1. **Is this expected behavior?**

- Is 15-second TTFB normal for streaming continue requests?

- Are there known performance characteristics we should be aware of?

3. **Session Optimization:**

- Does session reuse provide performance benefits?

- Are there session configuration options to improve response time?

- Should we be using a different API pattern?

4. **Best Practices:**

- Are there recommended patterns for low-latency streaming?

- Should we batch requests or use different endpoints?

- Are there agent configuration options that affect performance?

### Desired Outcome

We would like to achieve:

- **Target TTFB:** 2-5 seconds (acceptable for conversational AI)

- **Current TTFB:** 14-15 seconds (too slow for good UX)

- **Improvement needed:** 10-second reduction

-–

## Reproduction Steps

To reproduce this issue:

1. Create an AIP agent session using the sessions endpoint

2. Send a streaming continue request with user input

3. Measure time from request send to HTTP 200 response

4. Observe 14-15 second delay before first chunk

### Sample Request Flow

```

1. Create Session:

POST /api/v2/aipAgents/agents/{agentRid}/sessions

Response: { β€œrid”: β€œ{sessionRid}” }

2. Streaming Continue (measure timing here):

POST /api/v2/aipAgents/agents/{agentRid}/sessions/{sessionRid}/streamingContinue

Start timer

↓

[14.8 second wait]

↓

HTTP 200 received

Stop timer

Result: 14,800ms elapsed

```

-–

## Additional Context

### Application Architecture

- **Use Case:** Real-time conversational AI for calendar event creation

- **User Flow:** User sends message β†’ AI responds with event details

- **Frequency:** Multiple requests per conversation session

- **Expected Latency:** Sub-5-second response time for good UX

### Client Implementation

Our client implementation is optimized:

- :white_check_mark: Streaming response type configured

- :white_check_mark: Chunks processed immediately (no buffering)

- :white_check_mark: Persistent HTTP connections enabled

- :white_check_mark: Appropriate timeout settings (60 seconds)

The bottleneck is definitively server-side, occurring before the HTTP response begins.

-–

## Request for Guidance

We would appreciate guidance on:

1. **Performance Expectations:**

- What is the expected TTFB for streaming continue requests?

- Are there SLA or performance benchmarks we should reference?

2. **Optimization Options:**

- Are there agent configuration settings that affect performance?

- Can we request model pre-warming or caching?

- Are there alternative API patterns with better latency?

3. **Debugging:**

- Are there server-side logs or metrics we can access?

- Can you provide timing breakdown of server-side processing?

- Are there known issues or ongoing improvements?

4. **Workarounds:**

- Are there temporary solutions while performance is improved?

- Should we consider different agent types or configurations?

- Are there rate limiting or throttling factors we should know about?

-–

Hey friends,

I am going to assume @maddyAWS by your second message that you are no longer seeing issues with partial LLM message streaming.

One thing I can absolutely recommend @maddyAWS and @kevcam4891 to have a more responsive application while the Agent is processing the user input is to use the Get Session Trace endpoint. To use this endpoint, clients should short poll with the same sessionTraceId they supplied to the continue session endpoint.

Otherwise, there are a few other steps you might want to take to reduce latency in your agent. Each comes with tradeoffs.

  1. Switch conversation retention to 24 hours. This means conversations will expire after 24 hours of activity, but each message will be faster.

  2. Be judicious with how much retrieval context you give your agent. This is a powerful feature, but each configured retrieval context adds an overhead before each message. Might you instead expose the context to the agent through a tool, so that the agent can query the context only when it needs it?

To answer some more questions:

Are there rate limiting or throttling factors we should know about?

As language models are a limited resource, sometimes the model under the hood of the agent is throttled, and we have to retry the completion. To get a better idea of whether this happens often, I recommend testing your agent within the platform in AIP Agent Studio to see if your agent is having to retry.

Can we request model pre-warming or caching?

We don’t allow this, especially since we do not know what the user input to the model will be. We don’t currently make use of prompt caching on the model provider side.

Are there known issues or ongoing improvements?

Completion latency, which we call Time To First Token (TTFT) internally, is always on the team’s mind. This is valuable signal to keep investing in optimizing our TTFT.

1 Like

Some questions I have for you:

  • Are you seeing similar latency when you test your agent within AIP Agent Studio?
  • Do you have retrieval contexts configured?
  • Is the model making tool calls, leading to this latency? If so, the Get Session Trace endpoint might be very helpful
  • Which model are you using?

Thank you for all of this detail!

  1. It feels like the response is little faster in AIP agent studio, if I add up all the tool calls and back and forth it comes to around 5 seconds in AIP agent, but it takes longer in my App
  2. No I dont have retrival context configured
  3. there are tool calls and actions as part of the interaction, but there are less than 200ms latency
  4. Im using Claude 4.5 Sonnet

You are proposing to use get Session trace as a debugging tool right ? its not going to give response

when I tested with it, it doesnt give a response

the API shows looks and outputs, but it not clear from that how long the model took to respond and how long the tool call took, the response doesnt have any timestamp

if each message sent to the model can have its own traceID, then when I poll just before complete, should the response show all the tools that were used ? or just the tool used before the polling happened ?