Shima Nikfal

AI Developer Accelerator

Activity

Mon

Wed

Fri

Sun

Dec

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov

What is this?

Less

Memberships

AI Developer Accelerator

Public • 3.8k • Free

3 contributions to AI Developer Accelerator

Shima Nikfal

2d ago in

Full Stack Development with AI

Challenges with Exceeding Context Window Limits in LLMs

I have a question regarding handling data that exceeds the context window limit. I once worked on an application to generate unit tests. As you know, writing a good unit test for a function requires access to the definitions of all dependencies used by the function. While I, as a human, can access the entire codebase, this is not possible with an LLM. If I don’t provide the definitions of all the dependencies, the LLM will make assumptions, which can result in incorrect mocks for other classes or external tools used by the function I’m trying to test. Do you have any suggestions for addressing this issue? Summarization or advanced chunking solutions won't work in this case because the LLM must have full access to all dependencies to generate accurate unit tests.

New comment 5h ago

Paul Miller

16d ago in

General discussion

Going from PDF to Chunks the smart way

I got asked on yesterdays call about how to take a PDF into a more consistent way into chunks for RAG. The first challenge you have with converting any PDF file is dealing with the unique underlying way that the PDF document may be formated. Much of that formating has no impact on the printed output but does have an impact if you are using python to extract with Langchain making the output often inconsistent with sections often being wrongly aggregated for the chunking process. A better approach that has worked consistantly for me is to first convert the PDF into Markdown then convert the Markdown into chunking see: Step One: import pymupdf4llm import pathlib # Convert PDF to markdown md_text = pymupdf4llm.to_markdown("input.pdf") # Save markdown to file (optional) perhaps just save as a string pathlib.Path("output.md").write_bytes(md_text.encode()) Step Two: from langchain_text_splitters import MarkdownHeaderTextSplitter # Define headers to split on headers_to_split_on = [ ("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3"), ] # Initialize splitter markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on) # Split the markdown text md_header_splits = markdown_splitter.split_text(md_text)

New comment 15d ago

Shima Nikfal

0 likes • 15d

@Paul Miller Thanks for sharing.

Shima Nikfal

24d ago in

LangChain

I’m watching the LangChain Master Class video: https://www.youtube.com/watch?v=yF9kGESAi3M&ab_channel=codewithbrandon, and I’m a bit confused about how we create rag_chain using rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain) in 7_rag_conversational.py. Couldn’t we just use create_history_aware_retriever to retrieve the information?

1-3 of 3

Level 1 - Bit Newbie

5points to level up

Shima Nikfal

@shima-nikfal-5160

Experienced software developer and engineer specializing in machine learning, data engineering, and computer vision.

Active 13h ago

Joined Oct 29, 2024

Contributions

Followers

Following