unstructured PDF data · Data Alchemy

unstructured PDF data

I played around a lot with several python libs to read and extract text from unstructured PDF files.

My special use case is, that i need, as a preprocessing step, a library to understand complex tables but also normal chapters like a cover with important data from the same PDF file.

Tables can have a list of data in it or again tables -> unstructured data made by humans.

In a next step i can create embeddings of it and store them in a vector database (to work with a model with it later) - but this is not the problem of this post.

After a lot of failed tryouts with "normal" python libraries, i found the library from https://unstructured.io. It helps me a lot to keep the content of tables semantically together to be able to execute search on it later.

But i am quiete unsure, if i am on the right track or if there is some other, easier technique to work with such PDF files.

Also, i am sure that a lot of AI apps have similar requirements like these - what do you think?

8 comments