Seeking Advice on Combining Data Sources

I’m currently learning Foundry and working on creating an agent that can answer questions related to FAA regulations. I have two sets of data sources: one extracted from eCFR.gov via API, and the other consisting of various PDF documents.

My initial challenge was figuring out how to combine these data sources into a single dataset. I processed each data source separately through chunking and built individual datasets. In a subsequent pipeline, I combined these datasets, utilized LLM, and generated chunks, entities, and jointable datasets.

Am I on the right track with this approach, or is there a more efficient way to consolidate data from different sources into one cohesive dataset? Image of the pipelines attached.

1 Like

Nice question! And thanks for the clear idea of what you’ve built so far.

For me , at least, in terms of figuring out my solution design in AIP, I’m always working backwards from the “functional requirements” [1] . In this case, from what you’ve shared, the need to for a user to interact with an agent and get information backed by many different sources.

Short-cutting a bit, we can consider two general data architectures:

  1. A one-object-type-per-source
  2. A generic “document” object type

There are pros and cons of either, but I think if the primary intent is to prepare this content for an LLM and have users interact with it through an Agent, you’re better off working to a generic “document” ontology. You might even find some “ready to use” example in the Documents Suite in the Marketplace Examples app.

That brings us back to your initial question, then, of how do I get my various sources all to match the same “schema” so that they can feed into a single object type. And here I’ll say that you do seem to be on the correct approach: I would expect this to look like (in at least one implementation option) a Pipeline Builder pipeline where each source has it’s own set of transforms that does things like: extract the text, normalize and rename columns, handle source-specific values, etc.

Then, you might union the various sources together and do any shared transformations. This might include chunking and embedding the text so the Agent can use a semantic search to find relevant context for the user; or generating a summary of each document with the Use LLM board.

So I’d say you’re definitely on the right track! Let us know where you run into any challenges along the way!

[1] You can read a much more thorough description of my approach to “solution design” in the docs here

2 Likes

Thank you for your detailed reply.

Before consolidating the two datasources, i ran them through separate pipelines as an example to learn the process. In the “Use LLM” transform, i used GPT-4o and the following time is what it took to build all the datasets.

Datasource 1. FAA Docs: 3 hrs
Datasource 2. eCFR Title14: 8 - 10 hrs.

When combined the datasrouces, it took 10 - 12 hrs to build using GPT-4o. However, with Gemini 1.5 pro, it been building for more than 2 days now (Refer screenshot)

Why Gemini 1.5 pro? - After some trials, i thought that Gemini is doing a better job in extracting array of entities for each chunk.

I’ve been playing around with the entity extraction as the reply from agent is not as detailed or accurate as expected and is not able to link all the relevant details from all the datasources.

Any recommendation on how i could improve the agent’s reply and reduce build time? I understand it is a pretty broad question. I’m currently checking for any similar examples in marketplace. I have pasted the LLM instruction i used in the “Use LLM” transform.

Thank you
Kannan

Instruction used in “Use LLM” Transform:

##LLM Instructions

Your task is to summarize a short, meaningful text snippet from the content and extract the relevant entities from the following list.

Entities to Extract:

  1. Title Number - The unique section identifier (e.g., 25.561).
  2. Regulation Number - The specific regulation identifier (e.g., 25.561).
  3. Section Title - The official title of the regulation.
  4. Regulation Type - The category of the rule (e.g., Airworthiness Standard, Operational Requirement, Certification Process).
  5. Regulation Level - Indicates the level of the hierarchy (e.g., Primary Rule, Sub-rule, Amendment).
  6. FAA Advisory Circular Reference - Related FAA advisory circulars providing interpretation or guidance.
  7. Amendment History - Identifies historical changes to a regulation.
  8. Requirement - Main compliance expectations (e.g., “Must withstand 9G impact.”).
  9. Condition - Applicable circumstances for the requirement (e.g., “Only applies to passenger aircraft.”).
  10. Action - Required or prohibited actions (e.g., “Conduct stress testing for all joints.”).
  11. Limitations - Defined operational limits (e.g., “Maximum altitude 41,000 ft.”).
  12. Exception Criteria - Conditions under which compliance is not required.
  13. Testing Methodology - Required compliance testing (e.g., Drop test, Wind tunnel test).
  14. Penalties - Consequences of non-compliance (e.g., Certification revocation, fines).
  15. Performance Standards - Specific numerical standards, limits, or log requirements.
  16. Documentation Requirement - Required reports, manuals, or logs for compliance.
  17. Certification Type - The type of approval required (e.g., Type Certificate, Supplemental Type Certificate).
  18. Reference Regulation - Related regulations mentioned in a section (e.g., “See 25.562 for crashworthiness standards.”).
  19. Industry Standard Reference - External ASTM, SAE, ISO, or MIL standards referenced.
  20. Aircraft Category - The regulation’s applicability (e.g., Transport, Non-transport, Rotorcraft).
  21. Component Category - Component type (e.g., Wing, Landing Gear, Electrical System).
  22. Aircraft System - Related aircraft subsystems (e.g., Flight Controls, Avionics).
  23. Impact on Other Regulations - Regulatory interdependencies (e.g., “Compliance with 25.561 affects 25.601.”).
  24. Department - The regulatory domain responsible (e.g., Flammability, Damage Tolerance, Structural Integrity).
  25. Regulatory Body - The agency enforcing the regulation (e.g., FAA, EASA).
  26. FAA Office - The specific FAA office responsible (e.g., Transport Airplane Directorate).
  27. Manufacturer Responsibility - Specific obligations for OEMs (e.g., Boeing, Gulfstream).
  28. Operator Responsibility - Specific obligations for airlines or aircraft owners.
  29. Effective Date - The date when the regulation became applicable.
  30. Revision Date - The latest publication date of the regulation.
  31. Regulatory Status - Whether the rule is Active, Amended, or Revoked.
  32. Jurisdiction - The geographical scope (e.g., US, EU, International).
  33. Document Type - The type of document referencing the regulation (e.g., MOC, Rulemaking Docket).
  34. Interpretation Notes - Additional clarifications or official legal interpretations.
  35. Public Comment Period - Timeline for stakeholder feedback.
  36. Legislative Origin - The underlying legislative statute that led to the regulation (e.g., Federal Aviation Act).

##Entity Extraction Guidelines:

  • Each extracted entity should be in singular form without any prefixes, adjectives, or explanations.
  • DO NOT use generic or meaningless single-letter entities like “J”.
  • DO NOT extract random, isolated numbers unless they represent meaningful identifiers (e.g., 25.561).
  • DO NOT use any meaningless sequences of numbers like 1, 2, 17 unless they refer to a regulation.
  • DO NOT extract a list of multiple unrelated regulations (e.g., “23, 12, 10”) as a single entity.
    -Extract data like 8100.8D, 8100-9 etc

If a content snippet is linked to a Title/Chapter/Subchapter, the entity should be extracted in full format.
For example:
If a text is linked to “Title 14 / CHAPTER III - COMMERCIAL SPACE TRANSPORTATION, FEDERAL AVIATION ADMINISTRATION, DEPARTMENT OF TRANSPORTATION / PART 406 - INVESTIGATIONS”,
DO NOT extract just “Subchapter B - Procedure”. Instead, preserve the full structure.

1 Like