Having Memory overload trying to sync from source tables

Hello all,

I’m working on a project and we extract around a million lines into Palantir (initial load). If I try to sync I get a memory overload and I don’t know how I can resolve this. I’ve been looking into Apply Spark profiles. But I don’t really understand the workings of it.

Are there other ways to get the data into palantir?

Thanks in advance!

1 Like

Hi!
You can leverage the profiles of the builds you’re running to get more memory:

  • for code repo: https://www.palantir.com/docs/foundry/code-repositories/spark-profiles
  • for pipeline builder: https://www.palantir.com/docs/foundry/pipeline-builder/management-build-settings

This will allow you to circumvent Out of Memory (OOMs). That being said, it’s often interesting to look into optimizations, such as avoiding loading things in memory when you don’t need them, or expensive useless shuffles/poorly optimized joins. AIP assist (and more generally LLMs) can be very useful to detect bad practices, especially since millions of rows is not such a high number (this of course depends on the content of said rows).

Are you syncing data via an Agent or Direct Connection?

If it’s via an Agent, you may want to consider increasing the memory on the agent.

Hey there @cc36b9a87e50cd105460, there’s a way to configure the spark profiles on the data sync itself. These steps were provided by Palantir’s support when we struggled with a sync that was having OOM issues (no transforms, but the sync itself).

As far as I know, a native “GUI” way of doing this via the data sync config is coming sometime in the future but in the meantime, you should be able to follow these steps to configure the Spark profile of the sync.

What you need: Foundry data sync, PowerShell or other command line tool

Step-by-step instructions

1 – Open the specific data sync you want to configure the spark profile for.

2 – In the browser, enable developer tools (Ctrl+Shift+I in Microsoft Edge), or under: (More Tools → Developer Tools in the settings menu).

3 – Navigate to “Network” on the developer tools.

4 – Edit any setting on the data sync, press “Save” and refresh the page.

5 – Look for a network package starting with something like: ri.maggritte…

This is the identifier for the configuration of the data sync.

6 – Once in PowerShell, paste the contents and look for the sparkProfiles parameter (should be last)

7 – Edit the contents to the selected spark profile. The default is DRIVER_MEMORY_MEDIUM. Try increasing it to DRIVER_MEMORY_LARGE, like so:

8 – Press enter and wait…

9 – If the return status code is 204, the process worked successfully. This will set this spark profile for that data sync to be the selected one (in this case, DRIVER_MEMORY_LARGE). You can validate this configuration by repeating steps 1-5 and seeing that when you paste the new configuration, the spark profile is already set.

10 – Revert any temporary changes in the sync and run it with the updated spark profile.

Hope this helps!

Best,

4 Likes