As a data engineer, I like several aspects about it.
speed (a cluster stays there while one’s developing)
pipeline visualization (often, micro-pipeline)
actual code-writing
price (batch computing)
support for several languages
support for graph visualization
For a data engineer, when in development, speed helps a lot when one needs to try out several versions of algorithms, quickly get results and move on. Pipeline visualization helps to instantly identify possibilities for code optimization, need for caching, checkpointing.
For an organization, the price is also a significant factor.
Is there any other tool under development which would cover the above points?
Code Workspaces may cover the speed, but you do not see a pipeline there, and the price is what… tens of times higher?…
Data Lineage may show pipelines, but every transformation node should be an actual transformation saved as a dataset. When we intentionally create datasets, we tend to put many steps inside them, and then you lose track of micro-pipeline. This is where I see value in Code Workbook’s Preview - we can create micro-pipelines, see them visually, it is quick.
Pipeline Builder’s drawback for me as a data engineer is that it has no code. In a world of LLMs, when developing, it is crucial to have the possibility to write code. For some tasks, we may try several options, suggested by different LLMs. Palantir may offer AI assistance in Pipeline Builder, but it will never solve ALL tasks. We simply need to be able to ask different LLMs and use answers from them, which are in code.
How do other data engineers perform their day-to-day tasks? Which tools do you use?
Thanks for your good questions! I’m glad you pointed out a few of Code Workbook’s (CWB) helpful features for data engineers. In my conversations with users, people often appreciate the graph view for its walkup usability, its intuitive format for testing “forks” of code, and handy structure for copying and pasting code chunks between notebooks. Users also mention CWB’s support for multiple languages, a flexible templating framework, and ability to handle any scale of dataset (Spark integration).
CWB predates many of Foundry’s modern tools, and was an answer to data engineers’ needs before we could host applications like JupyterLab and RStudio within the platform. In the spirit of adopting more open, contemporary, and widely used frameworks, we’ve decided to invest in Code Workspaces (CWS) and other tools instead of further developing CWB, which has always been a highly bespoke, Foundry-specific tool. The “Legacy” label is meant to signal that we will support CWB as it exists today (and continue to push critical updates as needed) but will not expand its current functionality.
In particular, I hope you try CWS as a more modern replacement for a few reasons:
@nicornk is correct that CWS is 1/10th of the cost per compute-second than CWB. This is partly because CWS runs on one machine and does not require spinning up a live Spark cluster for every user session.
With a Jupyter workspace, you can write Python, R, and SQL similarly to how you can in CWB.
To execute R, you can install the r-irkernel package and get an R kernel in your Jupyter notebook.
To run SQL, you can use this pattern to execute a query with foundry-sql-server and save the results in memory. This functionality works for datasets as well, not only for restricted views (documentation about this is coming soon). This SQL approach supports certain queries today (e.g., no joins) and will broaden over time.
With an RStudio workspace, you can write R, Python, and SQL similarly to how you can in CWB.
To execute Python, you can install the reticulate package to write .py files or use the containers-sql library described above to execute SQL queries.
We are adding improved ways to work with large datasets in CWS. At the end of the day, the data you work with must fit into the memory of your running workspace (a default max of 64 GB of RAM)—but this is sufficient for the vast majority of interactive use cases. With the SQL pattern above or using lazy-polars filters to load a parquet-backed dataset, you can load a subset of TB-scale data into your container without a problem.
It’s a common misconception that a Spark environment is always necessary just because it’s nice not to have to worry about scale limits. But it’s actually often more expensive to sit around and wait for large-scale data processing to complete when 1) your data might be small enough to fit on a single machine or 2) you could perform the same analysis on a smaller cut of data.
If you do need to test out PySpark code in CWS (as you might in CWB), you can set up a local Spark environment with these instructions.
While there is no easy or immediate answer to CWB’s graph view, we are exploring how starter templates for CWS (along the lines of starter templates for Code Repositories) could help inform new users about CWS capabilities and give them a clear jumping-off point to write additional code.
We are also exploring AIP features for CWS that will dramatically transform how users write code in CWS. LLMs can write idiomatic and effective code in a Jupyter notebook (for example) because it is an open-source framework that they likely have been trained on. Keep an eye out for this!
Finally, there are a variety of other CWS capabilities that echo or exceed what you can do in CWB: