We recently created a media set solution with the media dataset’s primary format set to PDF, but also allowing all other format files.
It come to our attention now that when Foundry converts any document to a media item, it converts the item to the primary format. Reviewed the documentation and found the one sentence which states the above:
#1) Is there a way around this behavior? Or a work-around? When we import a JPEG file, the file name shouldn’t change to xyz.jpeg.pdf. One would like to keep the original xyz.jpeg format intact.
For interest sake:
#2) What is the possible use-cases that one would want this behavior for converting files into a media set? I could not think of a scenario where you want to force a jpeg to become a pdf
Media sets are designed to contain files all in the same format so that the files can be operated on as an entire media set. As an example, the way to extract text from a docx file is different from the way to extract text from a PDF, so it makes it simpler to extract text from an entire media set when all the files are the same format.
To provide flexibility and allow you to work with lots of different formats, we allow you to choose additional input formats, and those files get converted to the primary format on upload. Having said that, we do provide the option to download the original file or the converted file, so you can still get the original file.
There are other reasons why people may want to convert formats. For example, it’s easier to work with PDFs, which is why we allow pptx and docx files as additional input formats, but not as a primary format. Someone may have a jpeg image which is a scanned document and they may want to convert that into a pdf as it is makes more sense and can be treated in the same way as all their other documents.
Generally, if you have a lot of images that don’t represent documents, you are unlikely to want to process them in the same way as your documents, or even your audio files, so having the semantic distinction is useful.
However, we have been working on a new media set schema type called multimodal. This is designed to allow any file formats to be uploaded, and they don’t get converted to any other format. This means you can’t process the media set in the same way (eg. you can’t extract text from everything in your media set if some of the items are audio files) as the other schema types. We will be releasing this in the next month or two, and that might help with your use case
I’d be curious to understand why you don’t want to convert files on upload?
Thank you for the detailed explanation. It makes more sense now why you might want to auto-convert documents at times.
In our current use-case, we are simply moving files from a Foundry dataset (synced from an SFTP) to another Foundry dataset (which is exported to SharePoint). Before the copy operation, we were converting the files into a media set and then using the file paths as identifier to copy the raw files across. With the auto-convert, some of the file paths were changing and the copy operation then failed.
After your input, we realize that we could likely achieve our goal without converting the docs to a media set, as the media items are not really needed in the Ontology at the moment.
Thanks for all the info. Feel free to let us know if there’s anything else to consider that we might have missed.