Parsing a large PDF of resumes in pipeline or AIP

polemics · January 24, 2025, 2:58pm

Hello community! I’ve got a project and don’t see a similar issue covered here. I routinely get a single large PDF file (~250 pages) of resumes. The Resumes have variable length and formatting with no distinguishing separator/delimiter separating them.

I was exploring use of AIP Agents, AIP Logic and/or Pipeline (Use LLM) to parse these out to separate PDFs or alternately in a pipeline, use an LLM to extract text from PDF into a table format that Pipeline Use LLM can iterate across for a first pass review.

Anyone have any similar projects or have thoughts on which combination of tools would best serve here?

nickk · January 24, 2025, 5:03pm

We have recently / are continuing to add greater support for PDFs in both Logic and Pipeline builder. Here is a breakdown of everything we are capable of as least from the Pipeline Builder and Logic perspective. I will tag someone to provide more of the AIP Agents context as I lack context.

PDF Text Extraction:

Currently, we have support for text extracting from PDFs in both Logic and Pipeline builder
Can extract on any range of pages you would like!

Separating PDFs into a range of pages:

Both Pipeline Builder and Logic are tracking supporting this (i.e. given a range of pages, only give me this sub-range of pages).
Unfortunately, support for separating these into relevant PDFs chunks (individual resumes in your case) with an LLM is largely unsupported. Very interesting use case but much scoping would be involved here.

Passing PDFs as images to a Vision model:

Logic has support for this (as long as you operate with Media References)
Pipeline Builder support is coming soon!
Only caveat is you have to manually convert each PDF page to an image before sending to the LLM, which is not great. We are tracking improvements for this now (i.e. custom transformation to convert a page range of PDF to a list of images).

There is a very rudimentary version of what you are looking for with the tools above, but we are currently scoping and trying to unblock larger multi-page PDF processing and LLM workflows like yours now. I think previously, all of these had to be done through custom Functions in code repos but first class support for media across all of these services is something we are improving!

polemics · January 24, 2025, 6:48pm

Interesting, thanks for that!

On reflecting on your words here I may try to have one LLM flow of some kind identify the start/stop page number of each resume, then parse out the PDFs somehow. Re: “Can extract on any range of pages you would like!”

-Polemics

narmbrust · January 24, 2025, 7:04pm

Hi @polemics,

From the AIP Agents perspective, if you’d like to quickly ad-hoc ask questions on this 250 page PDF you can upload directly from your desktop into AIP Threads via the Document upload feature.

I suggest Anthropic Claude or Google Gemini for the large context windows given your pdfs are large.

If you have a pipeline setup via Pipeline Builder or Transforms, you could interact with the Ontology you’ve created via an AIP Agent by using Ontology Context, (which needs an embedding property, most likely your extracted pdf text), or Function-backed context.

Both these will allow you to easily interact back-and-forth with your documents in the AIP Agent view mode, AIP Threads, or AIP Interactive Widget.

nickk · January 24, 2025, 9:35pm

Yup agree! Once we get better at sending multi-page documents to the LLM in a very easy, low-lift way, this sort of workflow should definitely be possible!

polemics · January 26, 2025, 4:05pm

Hello all and any future readers. I did get some moderate success chaining together 3 tools. Here is the gist:

Feed the large PDF to AIP Threads. (I had to break it into 2 PDFs since the large one exceeded context window) → Ask AIP Threads to list the Resume name, and PDF Page start/stop for each resume.
Feed the large resume to pipeline. → Use Generate Transform and asked it to extract text based on the name/page numbers out of Step 1.
Use a series of LLM Transforms to add the Applicant name, start/stop page as new columns.
(not part of my original ask) I used LLM to run a basic scan of resume qualifications against my job description to rank it as ‘qualified’ / ‘not qualified’

So 90% mission complete here.

The next item I’m picking at is how to display the resumes or to at least hyperlink to the resume page or to extract the qualified resumes. I have a MediaReference column but I don’t see a clean way to use it in workshop or other application. Documentation says to feed this to an ontology object and then use the ontology in Workshop … but my organization is pretty tight on the number of ontologies created.

Extremely happy with these tools. My slogan at work is ‘we are only limited by our imagination’.

system · March 27, 2025, 4:05pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.