Impact of adding a column with LLMs?

2c17e0fe21fe49c8714d · December 17, 2024, 6:59am

I am building a pipeline where one step I have adding is using an LLM to create a column based on two other columns.

Prior to this step, my sample dataset has 38 rows. However, applying this change leads to 20 rows, where only unique values from the other two columns are available.

Here is a simplified example of what I’m seeing:

Original Dataset:

Column 1 | Column 2
A | B
A | B
C | B

New Dataset:

Column 1 | Column 2 | Generated Column
A | B | D
C | B | E

For the LLM processing step, I am using a custom prompt that references the two columns (in the case above, ‘Column 1’ and ‘Column 2’). Is there a reason this is happening?

mai125 · December 17, 2024, 8:00am

Hi! Thanks for the explanations. How are you sending the data to the LLM (are you using a udf?)?

2c17e0fe21fe49c8714d · December 17, 2024, 3:10pm

I’m using an LLM node within Pipeline Builder with a custom-written prompt. To reference data, I selected the two relevant columns in an input box that says “Provide input data.”

I also tried adding the primary key for the dataset in this input box, but that didn’t fix the issue.

mai125 · December 17, 2024, 3:54pm

what is your prompt? I assume you used the Empty prompt. under Advanced settings, you can also select skipping recalculating the ones that are repeating.
you can also calculate how many times a row appears in a new variable and after the Use LLM block, you can explode by that variable

2c17e0fe21fe49c8714d · December 17, 2024, 5:40pm

Here is the prompt I am using:

You are a [general description of LLM role]. You must determine [generated column description]. Give me a single number.

I tried the first solution previously and it didn’t work, and the second solution doesn’t work for me as there are other columns in my dataset.

I experimented with my dataset and realized that the issue is not with unique values for the two columns I am referencing. Instead, the “Use LLM” node appears to be dropping rows, seemingly by random choice. Is there something behind-the-scenes that could be causing this issue?

mai125 · December 17, 2024, 7:22pm

It should not, unless you have a filter before/after the block. If you create a new column that is the row number and input it to the LLM, do you still get the same issue? i have tried to replicate the problem, but the rows are not dropped when previewing the Use LLM block

2c17e0fe21fe49c8714d · December 17, 2024, 8:28pm

I just checked it out and it looks like I’m getting the final output I want.

My original dataset had 38 rows (based on “Calculate Row Count”). After the LLM processing step, the dataset had 20 rows (again, based on “Calculate Row Count”).

However, the final output (which I set to create an object instance for each row in the final dataset) has 38 entries in its backing datasource. The data appear to be correct.

deank · December 19, 2024, 11:03am

What I think happens is that when you use the LLM node, the output preview is far less than the original row count. I am guessing that the preview is not going to waste tokens on LLM output until you deploy the pipeline. Then it will compute whatever you are doing on every row.

2c17e0fe21fe49c8714d · December 19, 2024, 3:48pm

That’s what I was thinking too, but even the “Calculate Row Count” feature was giving me inaccurate counts (20 vs 38).

system · February 17, 2025, 3:49pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.