Hello, is there any way I can export 100k rows of an object sorted from workshop, pipeline builder or a custom code repository?
My client’s use case involves capturing snapshots of a large dataset (preferable an excel file) that is sorted by multiple columns from our main workshop app.
I tried building a custom CSV file in code repositories and sorting the dataset there, but for objects more than 100k the function times out.
Excel type exports from the export feature in workshop buttons enable objects more than 100k, but there is no built in sorting feature.
Any guidance would be appreciated, thanks!
One potential workaround I see here is to sort in a PySpark transform, write to a dataset, export from there.
from transforms.api import transform_df, Input, Output
from pyspark.sql import functions as F
@transform_df(
Output("/path/to/sorted_snapshot"),
source=Input("/path/to/source_dataset"),
)
def compute(source):
return source.orderBy(F.col("column1").asc(), F.col("column2").desc())
You can export directly from the dataset in Foundry UI (no row limit), or point an Object Table at it and export from Workshop (up to 200k rows and preserves sort).
If the blocker is timeout on 100k+ rows, this shifts the work to the pipeline layer.
Docs: PySpark transforms
Thanks for the response! I tried sorting in pipeline builder so the dataset is sorted but when the ontology object is indexed the sort isn’t preserved. Is sort supposed to be preserved on indexing?
Does this export need to come from Workshop for your use case?
Yes ideally as our main application is from there, the user flow would be from a home page in workshop press a button for export on a particular dataset → overlay with user being able to filter the dataset then click export → sorted filtered dataset.