Creating Ontology Objects With Embeddings

Hello,

I’m attempting to use an ontology action to create a new instance of an object type featuring a vector embedding property. The objects are chunks of an uploaded and text-extracted PDF, and the aim is to allow users of a document intelligence AI system to upload and use their own files without any developer intervention. I’ve computed an embedding vector using the appropriate model (text-embedding-3-small), but I’m running into some trouble when it comes to actually creating the new object. There seems to be a length limit of 1000 on array parameters passed to ontology actions. Unfortunately for me, the embeddings in question are 1536-dimensional. Is there any way around this limitation? Have I misidentified the problem? Any suggestions on alternative approaches?

The alternative that jumps out at me would be to incrementally run a pipeline off a mediaset, but that has its own share of problems: the uploading app has limited visibility into completion status, there’s no support in pipeline builder for incremental mediaset builds, converting a mediaset to a dataset to be incrementally built from does not seem to work (throwing errors about media references without an associated media set), and though it might be possible to rewrite our entire existing batch ingestion pipeline in a series of code repository transforms, I would much rather avoid it, in large part because of the lack of embedding or LLM support in the Python SDK.

Any help would be greatly appreciated.

Hi,

as far as I can tell, the actual limit is 10_000 for arrays of doubles, which is what gets used as a parameter type when operating with Vector property types. Can you double checks what is the the dimension defined on the vector property in the Ontology Manager? Can you share the definition of your action type as a screenshot - what do you use as an input parameter, vs how it is mapped to the modified property type on the object type?

Otherwise, if you have error instanceId or some screenshot handy, it can be helpful for us to root cause what is going on here.

Thanks!

Adam

Thanks for the response! Here’s the dimension and type of the property:

The action’s mapping rule for the parameter:

image

The parameter on the action definition’s form content page:

image

The detailed parameter view:

The action execution request raises a ValidationError in the Typescript ontology SDK we’re using. Here’s what the server returns:

{
	"validation": {
		"result": "INVALID",
		"submissionCriteria": [],
		"parameters": {
                       ...
			"annotated_chunk_text": {
				"result": "VALID",
				"evaluatedConstraints": [],
				"required": true
			},
			"chunk_embedding": {
				"result": "INVALID",
				"evaluatedConstraints": [
					{
						"type": "arraySize",
						"gte": 1
					}
				],
				"required": true
			},
			"media_reference": {
				"result": "VALID",
				"evaluatedConstraints": [],
				"required": false
			},
                        ...
		}
	}
}

Only the chunk_embedding is given as invalid.

I arrived at the max size of 1000 by binary search with arrays of different length.

As for the actual construction of the array, it’s simply a number[] derived from a call to OpenAI’s embedding API.

It’s then passed as the value for the chunk_embedding parameter. We check for the appropriate length before passing it through.

If I try in validate mode, it returns the same sort of validation error as I posted above.

I’m afraid I can’t provide a request or error ID, since the action execute call returns a 200 and doesn’t include any such information. x-b3-traceid is set in the response headers, though. Here are a couple of values:

2e5df159db203d64, 007133afd5a70a3d

Thanks again!

We have just fixed a bug with handling of empty maximum selection sizes of parameters which caused them to arbitrarily be capped at 1,000 elements. This change should become available within the next week.

In the meantime, you can set up a maximum selection size of 2,000 for the Chunk Embedding parameter which should unblock the issue you were seeing.

Thank you for the fix!

Thanks for reporting! Please, let us know if this isn’t resolved for your use case by mid next week or if you run into any other issues.

We took an alternative approach, simply using shorter embeddings. We may revisit increasing the length again in the future. Thanks again!

1 Like