Parsing a BCP file stored in a dataset

jenniferli · July 3, 2024, 9:21pm

What would be the best way to parse a BCP file from a dataset that contains just the file (no schema)? The file format is pretty much just a CSV with a different delimiter (##~## instead of commas).

md5 · July 5, 2024, 9:17am

If it’s just a text file that is a csv like file then you probably could apply the schema and let Foundry try to parse it and then adjust the parameters for “com.palantir.foundry.spark.input.TextDataFrameReader” in schema > datails > edit with your delimiter.

jenniferli · July 8, 2024, 2:34pm

It seems like the TextDataFrameReader can only handle single character delimiters, because I’m getting java.lang.IllegalArgumentException: Value for fieldDelimiter must be a single character when I try to edit the delimiter.

kthorne · July 8, 2024, 6:19pm

If TextDataFrameReader failed and using Code Repositories is an option, I would lean on reading the file via a Python Transform

jenniferli · July 8, 2024, 7:03pm

Thanks for the tip, I think we’d prefer to use a UDF over Code Repos, however, so if there’s no way to do so in Builder I can look into how to read and parse a file in a UDF.

sandpiper · July 9, 2024, 8:58pm

The fieldDelimiter option must be a single character, but the built-in spark CSV parser sep option allows multiple-character delimiters, and it is prioritized over the fieldDelimiter option. Therefore, a schema like the following should work.

{
  "fieldSchemaList": [],
  "primaryKey": null,
  "dataFrameReaderClass": "com.palantir.foundry.spark.input.TextDataFrameReader",
  "customMetadata": {
    "options": {
      "addRowNumber": "false",
      "sep": "###"
    },
    "textParserParams": {
      "parser": "CSV_PARSER",
      "charsetName": "UTF-8",
      "recordDelimiter": "\n",
      "fieldDelimiter": "#",
      "quoteCharacter": "\"",
      "dateFormat": {},
      "skipLines": 1,
      "jaggedRowBehavior": "THROW_EXCEPTION",
      "parseErrorBehavior": "THROW_EXCEPTION",
      "addFilePath": false,
      "addFilePathInsteadOfUri": false,
      "addByteOffset": false,
      "addImportedAt": false,
      "initialReadTimeout": "1 hour"
    }
  }

You can even leave out the fieldDelimiter entirely in this case (which is what I actually recommend, since it is confusing when it’s there).

jenniferli · July 10, 2024, 2:32pm

Hmm, when I add that config and save schema, it parses my file into a dataset with 0 rows and 0 columns. It seems like having "fieldSchemaList": [] in the config prevents it from actually parsing the file, even if the parser is configured correctly. If I take "fieldSchemaList": [] out of the config then it just autopopulates once I save – it seems to me like there isn’t really any way to have Foundry autogenerate the correct schema.

sandpiper · July 11, 2024, 4:43pm

Sorry, fieldSchemaList being empty in the example was just because I don’t know what fields are in your dataset. You’ll have to actually specify the fields there - and yes, I don’t believe that at the present time, automatic schema inference will work, so you’ll have to specify the fields manually.

If automatic inference is important, you can use the pattern at https://www.palantir.com/docs/foundry/building-pipelines/infer-schema/#csv, explicitly specifying ### for the sep option. You will have to use a code repo, but the code is trivial in this case, so the maintenance burden should not be high.

jenniferli · July 11, 2024, 5:18pm

Ah, got it. Thanks for the help!