I played around a lot with several python libs to read and extract text from unstructured PDF files.
My special use case is, that i need, as a preprocessing step, a library to understand complex tables but also normal chapters like a cover with important data from the same PDF file.
Tables can have a list of data in it or again tables -> unstructured data made by humans.
In a next step i can create embeddings of it and store them in a vector database (to work with a model with it later) - but this is not the problem of this post.
After a lot of failed tryouts with "normal" python libraries, i found the library from https://unstructured.io. It helps me a lot to keep the content of tables semantically together to be able to execute search on it later. But i am quiete unsure, if i am on the right track or if there is some other, easier technique to work with such PDF files.
Also, i am sure that a lot of AI apps have similar requirements like these - what do you think?