Pipeline builder incremental mediasets pdf extraction

CodeStrap · October 9, 2024, 3:00pm

I’ve been setting up my own incremental logic to subtract rows based on a custom join. I use document id and page number, but you could optionally check the extracted text to see if it’s changed. Full credit to the Palantir team who showed me this approach and provided the boilerplate:

def get_incremental_data(ctx, input_dataset, output_dataset, limit_rows=True, limit=2):
    # We enforce the read of the input dataframe as a snapshot, via the snapshot_input decorator
    input_df_all_dataframe = input_dataset.dataframe(mode="current")

    if hasattr(ctx, '_is_incremental') and ctx._is_incremental:
        # We read the current output to see what we already processed in previous builds
        # Note: We have to specify the schema for the first run
        # page_id is a hash of the page number and media set item rid which should be unique for each page and stable between builds
        out_schema = T.StructType([
            T.StructField('page_id', T.StringType()),
            # T.StructField('page_number', T.IntegerType()),
        ])
        output_df_previous_dataframe = output_dataset.dataframe('current', out_schema)

        # ==== Example processing here ====
        # We diff the input with the current output, to find the "new rows".
        # We do this with a LEFT ANTI join : A - B <==> A LEFT ANTI B
        KEY = ["page_id"]
        new_rows_df = input_df_all_dataframe.join(output_df_previous_dataframe, how="left_anti", on=KEY)
    else:
        # On first run
        new_rows_df = input_df_all_dataframe

    # We had a timestamp for easier tracking/debugging/understanding of the example
    new_rows_df = new_rows_df.withColumn('incremental_ts', F.current_timestamp())

    # 2. Conditional row limiting based on the parameter
    if limit_rows:
        new_rows_df = new_rows_df.limit(limit)

    return new_rows_df

Like you said though OCR isn’t deterministic. But if you are lucky enough to be extracting the text layer you cold know with a fair degree of certainty if the text of the page has changed in any way. If you are processing images, audio, etc you might be able to use checksums. Even with incremental support for media sets it might not have the level of granularity your want.