How to do row and column filtering of a Dataset

raybellwaves · December 2, 2024, 2:48pm

It seems the foundry.transforms.Dataset API can’t do for row and column filtering at the same time.

I would expect the order doesn’t matter like in SQL or other lazy parquet readers.

I’m working in a Jupyter notebook and doing

Dataset.get("my_table")\
  .where(Column.get("letter") =="a")\
  .select("column1")\
  .read_table(format="pandas")

and get

ConjureHTTPError: 400 Client Error: Bad Request for url: https://localhost:8080/api/datasets/puzzle_inputs/readTableV2. ErrorCode: 'INVALID_ARGUMENT'. ErrorName: 'Fletcher:MissingFilterColumnException'. ErrorInstanceId: 'XXX'. TraceId: 'XXX'. Parameters: {'missingColumnNames': '[letter]'}

Dataset.get("my_table")\
  .where(Column.get("letter") =="a")\
  .read_table(format="pandas")

and

Dataset.get("my_table")\
  .select("column1")\
  .read_table(format="pandas")

work fine

arukavina · December 2, 2024, 3:14pm

Hi @raybellwaves ,

I suspect this error occurs because the select operation removes columns not explicitly included before the where filter is applied. This results in the where clause failing because the column letter no longer exists in the dataset.

In the foundry.transforms.Dataset API, operations like select and where are applied lazily.

Why This Happens

select removes unused columns: When you call select("column1"), the API only retains the column1 column in the dataset.
where depends on letter: The where(Column.get("letter") == "a") operation requires the letter column to exist in the dataset.
Order of operations matters: By chaining select before where, you’re effectively trying to filter rows based on a column that has already been removed by the select operation.

Possible Solution

Retain Columns Required for Filtering

If you must select columns early for some reason, ensure that the column required for filtering (letter in this case) is included in the selection:

Dataset.get("my_table")\
  .select("column1", "letter")\
  .where(Column.get("letter") == "a")\
  .read_table(format="pandas")

Let me know if this worked.

raybellwaves · December 2, 2024, 5:36pm

Thanks.

It would be nice to have predicate push down in the Dataset API so it can handle any order of operations and do the filtering like in Spark or other tools here: https://sites.google.com/view/raybellwaves/blog/what-data-processing-tool-should-i-use

What i’m actually after is:

expected = (
    Dataset.get("puzzle_inputs")
    .where(Column.get("file_name") == file_name)
    .read_table(format="pandas")[["input"]]
)

raybellwaves · December 3, 2024, 8:20am

Thanks.

It would be nice to have predicate push down in the API so it can handle any order of operations.

What i’m actually after is

file_name = "sample_day_01_part_1.txt"
df = Dataset.get("puzzle_inputs").where(Column.get("file_name") == file_name).read_table(format="pandas")[["input"]]

system · February 1, 2025, 8:20am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.