Programming with LLMs (not just generating code with a chatbot)

This page is about using LLM APIs in a programming project. Not to be confused with generating and editing code using a chatbot.

Under construction: this is currently a place for me to dump links and quick thoughts. It might turn into a real post one day.

Resources

Python to use LLM libraries

Python is currently the best way to access LLMs programmatically. There are several packages you can use — each LLM provider (Google, OpenAI, Anthropic, etc.) has their own package to access their set of models, and there are also other packages that allow you to access all available models. You can also use the OpenAI package to access non-OpenAI models like gemini, but they may not have the full set of API features for those models. I am currently using gemini’s API because I am experimenting with the long input context of the gemini models and the free small gemma model gemma-3n-e4b-it.

If you are interested in fine-tuning or training transformer models directly, see the Transformers package, which is built to work with PyTorch. If you are using transformers in the mix of other neural network models, check out PyTorch Lightning. If you are planning to use High Performance Clusters for training, I think HuggingFace Accelerate is easiest to set up for HuggingFace models, but PyTorch Lightning Fabric is more general.

Below are a few typical programmatic workflows for using LLMs:

LLM Workflow 1: LLM only

Basic. Simple. Prompt.

  • Design your prompt
  • Submit your prompt string and get a response string
LLM Workflow 2: Input Sources

When you already have some additional text you want to add to the prompt.

  • Find additional text that the LLM can use to improve. Google search results, keyword search in a database, etc.
  • Design your prompt and add the additional text
  • Submit your prompt and get a response
LLM Workflow 3: Retrieval

When you need to process a lot of data and dynamically select which additional information is useful for the current prompt. See the RAG section below for more info on searching/selecting text data for prompts.

  • Collect data sources. Text is easiest, but with multimodal models, you can also use images, audio, and video.
  • (optional) Tag the sources with metadata. Manually or using an LLM to help.
  • Create a database of your sources. There are several ways to store, search, and retrieve your sources. Semantic Vector Embeddings databases are the easiest to get started with. These run chunks (e.g. a paragraph) through an LLM embedding model and output a vector describing the topics/sentiment of the chunk. This vector is then stored along with the metadata.
  • Retrieve select chunks from your dataset. The prompt is converted into a semantic vector to compare against all the stored vectors, and the Top K “closest” chunks are returned (closest largest dot product). You can first filter on metadata to ensure you get better results. You can also use another LLM step do additional selection to filter out any irrelevant chunks.
  • Submit your prompt and additional context and get a response. The selected chunks are appended to the prompt, ideally with some formatting to make it clear they are separate from the prompt so the LLM does not accidentally internalize any implicit instructions in the chunks.
  • (optional) Check for comprehension and repeat. You can grade the response on how relevant it is to the prompt and how certain the model was. You can include in the original prompt a request for a rating of certainty of response, and even ask it to give you a database query prompt that would give it more helpful information. You can then cycle through the retrieval, submission, and comprehension steps to ensure you are extracting the most helpful information from your dataset.
Workflow Improvements

Each step in this process can be additionally broken down into smaller steps to refine, often with the use of an LLM to make that step better. These are a few of the places we can make improvements:

  • You can improve your prompts (see LLM Prompts).
  • You can improve the semantic search of your dataset by training an adjustment layer on top of the vector embedding model that maps the default vector output (which has a general topic space) into vectors that better match your task and specific queries (a more specific topic space).
  • You can improve the way chunks are stored and retrieved by selecting the chunk size and storing overlapping or concentric chunks.
  • You can also link the chunks to the larger chunk they are a part of so the large chunk is retrieved. This has been shown to help with context. (called Small-to-big retrieval)
  • You can leverage different models for different parts of the workflow: you may need a more expensive model to design an important prompt or prediction that requires a lot of context and reasoning. You may use a cheaper model to do smaller, easier tasks like extracting a piece of data from a file.
  • You can fine-tune a model to a specific use case. Read Andrew Ng’s advice on when to use this technique. Some research shows that generalist models with detailed prompts can outperform fine-tuned models: “we perform a systematic exploration of prompt engineering. We find that prompting innovation can unlock deeper specialist capabilities and show that GPT-4 easily tops prior leading results for medical benchmarks”

Published by acwatt

PhD student at Berkeley Agricultural and Resource Economics. Research interests: energy, low-carbon transitions, climate change, exhaustible resource economics

One thought on “Programming with LLMs (not just generating code with a chatbot)

Leave a comment