File Directory Connector Fails Ingest If One Of The Files Is Still Being Written

We are using the file directory connector ingest from a local file system, to take in large files, however we’ve noticed if at least one of the files is still being written, the connector will fail, citing an JavaIOException, that another process is already accessing this file, is there any way for this to be handled gracefully to not abort the other files as there’s also no way to exclude these files from the sync

Thank you in advance!

Hello.

There is no “keep going if one file fails” option, without writing your own plugin. You can reduce the impact of a failure by limiting the number of files processed with each sync, and increasing sync frequency to compensate.

If you have at least editor permission on both the source and the output dataset, then select “View advanced configuration” button in your sync’s configuration editor.

While “lastModifiedAfter” is configurable by the UI, you can switch this in advanced configuration to “lastModifiedBefore”. This can accept an offset in ISO-8601 format. Use a negative offset to only include files that were modified before a given duration in the past, e.g.:

  - type: lastModifiedBefore
    offset: '-PT1M'

There is also a source-level configuration - under “connection details”, there is a “File change timeout” option. This specifies the number of milliseconds that a file must remain unmodified for it to be considered for import, and applies to all syncs created with this source.

The sync-level “lastModifiedBefore” option is more efficient than the source-level configuration, and should be used alongside setting “File Change timeout” on the source to 0, provided there is negligible clock drift between the agent’s clock, and the source system’s clock (e.g. not where the directory is mounted from another system, such as a SAN, with clock drift of more than a couple of seconds).

Either way, this approach only works if the process writing the source files behaves according to the assumption that it writes data continuously (with a gap no greater than our configured setting) to a file, and then when it stops, it never accesses that file again.

There are conditions (intentional or otherwise) where this might not be the case. For example, if the source system experiences severe memory or CPU pressure, it might pause writes for long enough for us to attempt to read a file, only to resume writing during our read.

One remediation for this unlikely scenario is simply to increase the number of retries on the sync.

If the source system adheres to the rule that it only writes one file at a time, then you can sort the file list by last modified and exclude the last file (notInLastNFiles) - but this means you will never ingest the last file until another begins to be written.

Beyond this, if we need 100% certainty, the source system will need to be modified to provide more information. This could be achieved by, for example, renaming a file to include a suffix such as “.completed” once writes are finished, and by including this as a filename regular expression in the sync.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.