Some Questions on Pipelines, Code Repositories, and Semantic Search

Hello,

I have been attempting to use Foundry as a backend for a webapp which ingests, indexes, and serves a semantic search API to an LLM chat frontend (a React app hosted outside of the platform). To that end, I have two components, implemented in the following ways:

First, a document indexing pipeline. This takes documents uploaded to a mediaset, chunks them, and adds them to a dataset with a vector index.

Second, a set of functions (previously AIP Logic functions but now code-repository functions) which the frontend may call to search and return results from that indexed set.

I would like to minimize the reliance on custom code, and use as much built-in Foundry functionality as possible, so I present a few questions on the matter of where that simplification might be made:

First, in our document indexing pipeline, we read from a mediaset that we upload to via the mediaset put API endpoints. I would like to incrementally update the output dataset, but there doesn’t seem to be any way to do that without writing a custom transform function. Is that the case, or am I missing something?

Second, I want to support filtering on which particular documents can be searched for through my custom functions. I have allowed the caller to pass in a list of RIDs, but I’ve found that there is no support to filter a nearest neighbor search by such a parameter in AIP Logic, forcing me to use a Code Repository instead. Is there any way to avoid this?

Third, is there some way to allow SQL queries of a dataset via SDK or API call?

Fourth, is there any built-in support for lexical/keyword search over dataset columns in the SDK or through the APIs?

Thank you for any help you can offer.

For your first question: Pipeline Builder supports incremental transactions. See here for an example.

Second question: AIP logic supports searching around to a linked object. You would have to filter for the “Document” objects using the passed in string list rid param, and then search around to your “Chunk” objects. If you can provide more insight here, I’m happy to help you further.

Fourth question: Yes, Ontology SDKs do support keyword search. Here are the docs: https://www.palantir.com/docs/foundry/ontology-sdk/python-osdk#types-of-search-filters-searchquery. Note that Ontology SDKs are different from the Platform SDK. You would need to go into the Developer Console app and specify the objects you would like to include in the Ontology SDK you create.

Thanks for the response. In order:

The incremental transform documentation only refers to datasets, and when I try to mark an input mediaset as incremental no UI element appears. Are incremental builds supported for mediasets?

I’ll take a look at this search-around, thank you.

On keyword searching, are there any keyword search features that allow relevance scoring? I am interested in ranking documents rather than in filtering them.

To give a bit more context, I’m working on the backend aspects of the app talked about in this post.

The incremental transform documentation only refers to datasets, and when I try to mark an input mediaset as incremental no UI element appears. Are incremental builds supported for mediasets?

Incremental builds aren’t directly supported for mediasets. You would have to convert your mediaset into a dataset as here:

And then in a different pipeline, import that dataset and set it as incremental:

On keyword searching, are there any keyword search features that allow relevance scoring? I am interested in ranking documents rather than in filtering them.

Yes, we do have support for relevance scoring with the .orderByRelevance() method when searching for objects. You can find out more here: https://www.palantir.com/docs/foundry/ontology/ontology-augmented-generation#ranked-keyword-based-search

Third, is there some way to allow SQL queries of a dataset via SDK or API call?
It is possible, but it is a bit of a work around. Further, since this is not an API endpoint in our documentation, it could break any time.

On the dataset preview page, we have a SQL preview section. You can see what API is being hit by looking at the network tab when executing the SQL query.

It’s a POST request being made to https://<your foundry stack url>/foundry-sql-server/api/v2/queries/execute with the payload:

{
  "dialect": "SPARK",
  "serializationProtocol": "ARROW_V1",
  "queryCapabilities": [],
  "maxStreams": 1,
  "maxRows": 1000,
  "query": "SELECT *\nFROM `<your dataset rid>`",
  "fallbackBranchIds": [
    "master"
  ],
  "attribution": {
    "applicationId": "<your dataset rid>"
  }
}