Hey,
I need to ingest many TBs of data and I have the option of either pulling from JDBC or pulling from some flat files. Let’s assume in this case that 1) The JDBC ingest can be pulled incrementally (so I can do smaller batches) and 2) the backing source is beefy enough to support the amount of queries we will make.
At this scale, I would think that directory would make a lot more sense, here’s my line of thought:
- For JDBC, we first have to write files to disk - and at the scale we’re going it becomes non trivial cost
- Additional deserializaiton/serialization costs that are non trivial
- As such, I assume it would be much more intensive w.r.t. to memory
- Not bottlenecked on backing JDBC source for data extraction
Am I missing any key points? Any reasons that are pro for JDBC?
I’m mostly talking about Agent (vs. Direct Connection) but curious if the above still holds true - i.e. in the Direct connection is there reasons to choose JDBC?
Is the file format something spark can directly read?
Are the files already present and ready or do they have to be exported from your source first?
When you say directory, do you mean a sharedrive mounted in the agent, an s3 bucket or something else?
Without more context I would tend to agree that pulling in files is faster but your mileage may vary - I am not sure if the parallelism can be configured on the file source.
2 Likes
In general we recommend against using the directory source type (pulling directly from the filesystem of the agent host) since that will never be able to scale horizontally and is limited to the memory/CPU of the host machine running the agent. If you do want to sync the data as files, you should be able to find a file-based option that is compatible with direct connection or agent proxy runtimes (e.g. a samba share, putting the data into a cloud storage bucket, etc.)
To answer your question though, I tend to agree with Nico that pulling in files is likely faster but depends on the specifics.
1 Like