How do you filter a dataset import by date in Code Workspaces?

I’m getting errors when i try to import a dataset in Code Workspaces when applying date filters. Haven’t been able to find examples of how to do this correctly. Below is an example of what I’m trying to do.

from foundry.transforms import Dataset, Column
import datetime

# limit date
date_cutoff = datetime.datetime(2024,1,1)

# Read the dataset using the date filter.
alias_dataset = (Dataset.get("alias_name")
    .where(Column("date_column") >= date_cutoff)
    .read_table("pandas"))

I’ve also tried adding a string date without datetime formatting and it’s not working.

Are dates supported?

Welcome to the Dev Community!

I think that you would replace alias_name with the name of whichever dataset you have imported to the Code Workspace. Remember, codeworkspaces is a Foundry wrapper over the Jupyter dev environment.

Bear in mind most people import the dataset into whichever format they are familiar with such as Pandas and from there you have the whole suite of Pandas Docs and Stack Overflow to help.

Thanks for the reply. However, my intent was to filter by date at import since my dataset is 8GB and don’t want to bring in entire data before filtering (takes up to 5 minutes). I think i figured out a work-around by converting date to a formatted ISO string date.

Example Dataset (with three types of formats)
Column 1: id, string
Column 2: date1, date
Column 3: date2, timestamp
Column 4: date3, string

Solution (Only string dates filtered properly, date and datetime properties return errors)

from foundry.transforms import Dataset, Column
from datetime import datetime, date

date_format = date(2024,1,5)
datetime_format = datetime(2024,1,5)
string_format = "2024-01-05"

## date and datime attempts return errors.
# Dataset.get("dataset").where(Column("date1") >= date_format).read_table(format="pandas")
# Dataset.get("dataset").where(Column("date2") >= datetime_format).read_table(format="pandas")
Dataset.get("dataset").where(Column("date3") >= string_format).read_table(format="pandas")

From what i can tell the current library does not support dates and those columns need to be casted as strings. Unless I’m not understanding the proper syntax for filtering the dates correctly.

Posting my work-around if others find this useful or have other ideas.

The example code snippet should work, there was a bug previously which has since been fixed. Note that you can also use a string in the where clause, so the following code snippet would also work:

from foundry.transforms import Dataset, Column

# Read the dataset using the date filter.
alias_dataset = (Dataset.get("alias_name")
    # alternatively, use a datetime.datetime object
    .where(Column("date_column") >= "2024-01-01")
    .read_table("pandas"))