unstructured PDF data
I played around a lot with several python libs to read and extract text from unstructured PDF files.
My special use case is, that i need, as a preprocessing step, a library to understand complex tables but also normal chapters like a cover with important data from the same PDF file.
Tables can have a list of data in it or again tables -> unstructured data made by humans.
In a next step i can create embeddings of it and store them in a vector database (to work with a model with it later) - but this is not the problem of this post.
After a lot of failed tryouts with "normal" python libraries, i found the library from https://unstructured.io. It helps me a lot to keep the content of tables semantically together to be able to execute search on it later.
But i am quiete unsure, if i am on the right track or if there is some other, easier technique to work with such PDF files.
Also, i am sure that a lot of AI apps have similar requirements like these - what do you think?
7
8 comments
Chris B
3
unstructured PDF data
Data Alchemy
skool.com/data-alchemy
Your Community to Master the Fundamentals of Working with Data and AI — by Datalumina®
Leaderboard (30-day)
powered by