Retrying Data Connection Sync when 0 Rows Yielded

Hi,

Is there a way to automatically re-trigger an ingest when a certain condition is not met; for example, if the number of rows is 0 then re-run the build? Note that this is for cases where the job doesn’t fail but rather doesn’t meet a condition.

Thanks in advance.

You can use data expectations in this case to fail the build and set the schedule to retry it. Note that if your schedule consistently fails it may be automatically paused. I can’t recall if this is something you can override to prevent, but I’d guess it is possible.

A setup without expectations is a bit more challenging as it’ll be difficult to avoid a circular dependency. You could do something using the ontology:

  • Have the output of the build synced to an object (object is ‘zero rows written’ rather than the actual output of zero rows)
  • Set an automate monitor to create another object
  • Create a materialization from that second object
  • Set the materialization as a trigger for the original build schedule.

I’d generally prefer the first approach if possible.

1 Like

Hi Ben,

Thanks for the quick reply—this makes sense to me.

The problem is this issue happens on the initial ingest coming out of data connection, and I’m not seeing a place to add data expectations there. Do you have a strategy for this?

Best,

Larissa

Oh looks like the title might have been wrong - It seemed like you wanted to do this in a builder pipeline. I’ve edited the title.

To my knowledge there isn’t really a super easy way to accomplish this on a normal data connection sync. From Data Connection’s POV it has successfully synced the zero rows that the source provided. Would you be able to just increase the frequency of syncs to work around this? If the source provides zero rows in cases where there is no data available, maybe just try to sync every 5 minutes or so?

Your only other way would be a hacky script/external transform to use the build2 transaction APIs to see if an empty transaction is committed and then retrigger a build. This isn’t something that will be hugely easy to maintain, but can be done with our public APIs. To be clear, I would not broadly recommend this.

You’d want something like:

  • Get the master branch to find the latest txn with get-branch
  • IIRC, data connection with abort a transaction if no new data is yielded. If so you can just get the txn with get-transaction
  • If this last txn is aborted, kick off a new build of the sync using run-schedule (or create a manual build…? Probably better to run the schedule for easier debugging)

You’d want to run this in an source-based external transform with a foundry-to-foundry source (rest API with creds for a third-party account to avoid using a user token), and then schedule this to run frequently. If the empty txns are not shown as aborted, you will need to get the files from the latest transaction and either fully parse the content, or just approximate from the file size.

Awesome, I’m going to take the first approach.

So now I have it set up for the ingest to run in intervals and downstream datasets to only run when there’s data. The new issue this unlocks, though, is that in cases where there is always data, the downstream process is going to run more frequently than required.

What would you advise on the next step here?

Thanks in advance

Would it be possible to make the stuff downstream incremental/append-only? This should mean that the process isn’t really any more expensive.

If you use expectations in Code Repos you can put them on the inputs, and just fail the build if the inputs are empty instead of having it run. You could also just have a set pipeline that runs every x hours to throttle the runs too.