What would be the best way to parse a BCP file from a dataset that contains just the file (no schema)? The file format is pretty much just a CSV with a different delimiter (##~## instead of commas).
If it’s just a text file that is a csv like file then you probably could apply the schema and let Foundry try to parse it and then adjust the parameters for “com.palantir.foundry.spark.input.TextDataFrameReader” in schema > datails > edit
with your delimiter.
It seems like the TextDataFrameReader can only handle single character delimiters, because I’m getting java.lang.IllegalArgumentException: Value for fieldDelimiter must be a single character
when I try to edit the delimiter.
If TextDataFrameReader failed and using Code Repositories is an option, I would lean on reading the file via a Python Transform
Thanks for the tip, I think we’d prefer to use a UDF over Code Repos, however, so if there’s no way to do so in Builder I can look into how to read and parse a file in a UDF.
The fieldDelimiter
option must be a single character, but the built-in spark CSV parser sep
option allows multiple-character delimiters, and it is prioritized over the fieldDelimiter
option. Therefore, a schema like the following should work.
{
"fieldSchemaList": [],
"primaryKey": null,
"dataFrameReaderClass": "com.palantir.foundry.spark.input.TextDataFrameReader",
"customMetadata": {
"options": {
"addRowNumber": "false",
"sep": "###"
},
"textParserParams": {
"parser": "CSV_PARSER",
"charsetName": "UTF-8",
"recordDelimiter": "\n",
"fieldDelimiter": "#",
"quoteCharacter": "\"",
"dateFormat": {},
"skipLines": 1,
"jaggedRowBehavior": "THROW_EXCEPTION",
"parseErrorBehavior": "THROW_EXCEPTION",
"addFilePath": false,
"addFilePathInsteadOfUri": false,
"addByteOffset": false,
"addImportedAt": false,
"initialReadTimeout": "1 hour"
}
}
You can even leave out the fieldDelimiter
entirely in this case (which is what I actually recommend, since it is confusing when it’s there).
Hmm, when I add that config and save schema, it parses my file into a dataset with 0 rows and 0 columns. It seems like having "fieldSchemaList": []
in the config prevents it from actually parsing the file, even if the parser is configured correctly. If I take "fieldSchemaList": []
out of the config then it just autopopulates once I save – it seems to me like there isn’t really any way to have Foundry autogenerate the correct schema.
Sorry, fieldSchemaList
being empty in the example was just because I don’t know what fields are in your dataset. You’ll have to actually specify the fields there - and yes, I don’t believe that at the present time, automatic schema inference will work, so you’ll have to specify the fields manually.
If automatic inference is important, you can use the pattern at https://www.palantir.com/docs/foundry/building-pipelines/infer-schema/#csv, explicitly specifying ###
for the sep
option. You will have to use a code repo, but the code is trivial in this case, so the maintenance burden should not be high.
Ah, got it. Thanks for the help!