Best Practices for Scanning Datasets at Scale (10k+ tables, ~100 cols each) for Sensitive Data Detection

I’m currently working through a workflow where I need to scan thousands of datasets at scale for CUI/sensitive data detection as part of CMMC compliance. We’re looking at 10k+ tables across multiple data sources with probably an average of 100 columns each. This 10k count is comprised of many different source systems, so we can’t expect similar schemas - we’ll have to treat each table as its own thing basically.

Some questions for the community:

  • Any specific SDS configuration recommendations for high-volume scanning?

  • How do you handle schema variations across large numbers of datasets?

Would love to hear from other people who’ve tackled similar large-scale sensitive data scanning challenges. Thanks!

Hey,

Did this a few years ago when SDS was still referred to as Inference; the practices we followed were as follows:

  1. List out all possible sensitive information you want to search for (i.e., phone number, addresses, social security numbers, credit card numbers, etc.)
  2. Map out all possible patterns for each type of sensitive data (i.e., a phone number can be a StringType() column or a LongType() column from the external source; it might just be the 10-digit number, or it might have the country code as a prefix, or it might have the ‘+’ sign and the country prefix before the 10 digits, etc.)
  3. Formulate the regex for every pattern.
  4. Thoroughly test the regex patterns before deploying (extremely, extremely important – I had accidentally flipped the regex for a SDS condition before and the noise it created at that scale was regrettable).
  5. Test regex conditions on a small sample of rows to make sure it works as expected before deploying everywhere.

Up to you and your organization for how you want to implement the downstream workflows from here, but generally a practice that is followed is that if you have one Datasource Project for each external system you’re creating (mark these projects with a prefix like “[DATASOURCE]”), then apply the SDS conditions to just those projects.

Optionally, leverage either the Issues app to create an issue (or update an issue if an existing one has not been closed) or apply a marking automatically to tables that are flagged by SDS conditions to restrict any downstream propagation. In doing so, you can build a workflow where any data used in downstream pipelines and analysis must be scrubbed as far upstream as possible before anyone can use them.

Hope this helps!

1 Like

Great question! There are a couple of features in Sensitive Data Scanner that can specifically help with scanning at scale:

  1. Use the scan strategy feature in SDS to exempt scanning derived data: In most cases, sensitive data must come from some entry point to the platform and won’t spawn within a transformation. So, by scanning only the transaction types that represent entry points to the platform (uploads, ingest, fusion sheet syncs, writebacks, etc.) as a first pass, that’s the best way to know where the sensitive data is coming from and then you can apply markings to cover everything downstream. You can do this by selecting the “Scan only source datasets” scan strategy option.
  2. Exempt already-marked data: You can configure your scan to exempt data that has already been marked as sensitive. For example, if you have a CUI marking on datasets, you might not need to scan data that has already been marked CUI a second time.
  3. Use column-name matching: If you can provide a column name regex match condition, SDS will first try to match on the column before running a more expensive build. If SDS finds a match on the column name, it will forego running the build (assuming you’re looking for a match on Column OR Content).
  4. Row selection strategy: If you think that the distribution of sensitive data is pretty uniform in a column, use the row selection strategy to only scan a subset or sample of the data. This lets you trade off cost vs. completeness, which is helpful at scale if you need to look at every value in a column.
  5. Scan scheduling: If you are doing recurring scanning – instead of one-time scanning – you should consider using some of the scheduled options (daily/weekly/monthly) for recurring scans. Continuous recurring scans are the fastest scanning option since SDS will scan almost right after a transaction commits on a dataset, but it’s the most cost-intensive as a result. So, a scheduled recurring scan is a good compromise between the guarantees of automated scanning and the potential compute cost.
1 Like