Parsing XLSX docs with many sheets

fguze · July 2, 2024, 3:49am

Hi community! I am using Pipeline Builder to parse a raw dataset of XLSX docs with 30+ sheets each. All the sheets have the same schema but the PB native function “EXTRACT ROWS FROM AN EXCEL FILE” only handles one sheet, meaning we have to use 30 transform blocks and then Union the results which is non-performant. Please let me know if there is a better approach anyone is aware of

sandpiper · July 3, 2024, 10:41pm

If you just specify the empty string (which is the default value) for the “sheet” parameter, the parser should do exactly what you want (extract data from all sheets, applying the same schema).

The actual behavior of the sheet parameter is to consider all sheets whose names match a regular expression of the form .*{sheet_parameter_value}.*, which is why the empty string works. We should definitely update the parameter’s display name to be more clear!