I’ve been asked many times to provide some general information on LLMs for developers that are looking to get started with LLMs for the first time.
Core Concepts
What is an LLM?
Large Language Models (LLMs) are systems built from multiple ML models that are trained on large amounts of human-written text with the purpose of being able to mimic human-like writing abilities. An oversimplified description would be a 'smart autocomplete', but LLMs have demonstrated the ability to perform many tasks that they were not originally trained for including summarizing text, extracting information, performing Q&A, generating code, and even using tools and achieving larger objectives.
Additional links:
What can LLMs be used for?
Though LLMs are primarily trained to generate text, many studies have demonstrated that LLMs are also capable of advanced reasoning, instruction following, and even basic math (🤯).
Some tasks that LLMs can do really well:
- Semantic search: being able to search based on the semantic value of the user's intent rather than the precise search text they use - you can use this to locate code based on a description of the code).
- Code generation: Copilot is able to generate entire classes and functions based on a leading comment that describes the behavior of the code. GPT-4 is also able to do this very well.
- Summarization: Given a longer block of text & an instruction asking the LLM to create a summary, the model is able to write a shortened version of the large text. It is even possible to specify tone, audience, and writer's background.
- Extracting structured data from unstructured text: Since LLMs understand language rules really well, they perform exceptionally at the task of retrieving specific pieces of data from a blob of unstructured text (even just a straight up Cmd+A copy/paste of an entire webpage, PDF, etc). This can be used to perform Q&A against unstructured content, or even to convert it into a JSON format for your program to use.
Additional links:
How do you interact with LLMs?
Completion models (GPT-3)
You can think of interacting with an LLM like an intelligent auto-complete. Given some text, the LLM will attempt to generate additional text as a 'completion'. The initial text that was provided by you to the LLM is called a 'prompt'. LLMs that have been trained to follow instructions will pay closer attention to the instructions provided in the prompt and try their best to make their completion make sense.
Chat models (GPT-3.5-Turbo, GPT-4)
Some newer LLM models are designed as chat models. This means that interactions with them are specifically formatted to simulate a conversation between two people (an assistant and a user), and they are trained on conversational data. Examples of these models include GPT3.5 and GPT4. These models are still instruction-tuned, so you can still ask them to format their answers in specific ways so your program can parse the answer.
What package(s) should I be using?
There are different packages that allow you to work with various models that might be relevant to you. Here’s a list of a few that I find myself reaching for:
JavaScript
- langchain is the TS equivalent of the Python langchain product. It contains both higher-level clients for common services as well as a number of other tools and techniques.
- zod-to-json-schema is a package that while isn’t directly made for LLMs, is incredibly useful for providing an LLM with formatting instructions as well as validating that the LLM actually did what you wanted.
Python
- sentence-transformers includes a number of very useful things including some of the best embedding models for semantic search.
- langchain contains both a higher-level client for OpenAI (as well as other services), but also some useful techniques, prompts, and other tools that can be useful for developing LLM applications.
- openai is the low-level SDK for interacting with OpenAI’s API.
How do I teach the LLM to do my task?
There are several different learning techniques that LLMs use to perform novel tasks. Here's a few of the common ones:
Zero-shot learning: this is when you can simply ask the LLM to perform the task you need, without needing to perform any training or even giving the LLM an example of how to do the task. This works well for tasks that LLMs are designed for or evaluated against (such as the examples mentioned in the 'What can LLMs be used for?' section).
Few-shot learning: this is one of the most effective learning strategies for LLMs. It involves giving the LLM a few examples of the task you want it to perform, and the sample output expected. By analyzing the examples, the LLM is able to understand formatting directives, task direction, and the intent behind your instructions.
Fine-tuning: this technique involves creating a set of training and validation data that provides examples of prompts for the LLM, and the completion that would have been expected from the LLM. There are a few things to be aware of when using fine-tuning:
- If you are fine-tuning with OpenAI's models: OpenAI only allows their base models from the GPT3 family to be fine-tuned. These models have not been tuned for instruction following and have not been updated in a long time.
- Fine-tuning quality scales linearly with the size of training data. OpenAI has recommended that you use thousands of samples at minimum for fine-tuning and that these samples be human-verified or human-generated.
Generally, your task is probably doable with zero-shot or few-shot learning. Fine-tuning is valuable in very specific scenarios, such as if you are generating code for a language that the LLM has not seen many examples of before.
Is it safe to provide PII or other confidential information to LLMs?
It depends on how the LLM is hosted. Many language and embedding models are available to self-host, and therefore you can provide PII to them without the risk of leaking data. For cloud-hosted models, it is really up to you how much you trust the hosting and company behind the product. For OpenAI specifically, you can view information about their compliance certifications and documents here.
Another important factor to keep in mind is when providing an LLM with concrete information such as PII, there is a possibility of LLM 'hallucinations'. If you are providing sensitive information and then expecting the information to be reformatted or returned in any way in the response, it is possible that the model may decide to change the information during its response. Generally, your LLM-powered applications should be as tightly bound as possible so you can detect errors and test smaller assumptions.
Are LLMs accurate? What about 'hallucinations'?
One of the known issues with LLMs is that they will typically confidently try to pattern match a completion for you, even if that completion doesn't really make sense or is wrong. An example of this is when LLMs are asked to do math. Though LLMs have some limited ability to perform basic arithmetic, any complex operations result in a number that looks right, but is definitely not.
There are a number of strategies for overcoming this problem.
- OpenAI suggests designing prompts that give the LLM as escape hatch for when its reasoning determines that there isn't an acceptable answer. For example, you could say "Answer this question, or say 'null' if an answer cannot be determined."
- You can try to enrich the context window of the prompt to increase the likelihood of success. For example, instead of asking "Who is the President of the US?" (which would cause the LLM to rely on its old training data), you could first extract text content from a Wikipedia entry and then pose your question after the text. Since the answer exists within the context window, the LLM is no longer trying to answer a knowledge question but is rather extracting structured answers from unstructured data.
- Agent-based approaches are a novel technique for giving LLMs access to tools and forcing them to use systems design thinking. For example: see the picture below.
GPT4 is the only model in the GPT family that was trained to bias towards accuracy instead of confidence and is much more likely to refuse to do a task. If you are uncertain whether your task is truly doable or you want to utilize an LLM to test the output of another inference, it is recommended that you use a model such as GPT 4 that has been evaluated against these tasks.
What is the difference between 'lexical' and 'semantic' representations?
When we are talking about the lexical representation of text, we are talking about the exact words and syntax used to represent the text. For example, when you search your code using your IDE, you are performing a lexical search of your code (looking for matches based on the words used in the search query).
The semantic value of a text is the meaning or intent behind the text. For example, when you search Google, it tries its best to give you semantically-relevant search results by matching synonymous, intent, themes, etc.
LLMs generally pay more attention to the semantic value of your text, but the lexical content you provide can make a difference in the output.
What is an "embedding"?
One of the key tasks that an LLM perform internally is mapping the text its given to a vector in N-dimensional space. This vector is called an "embedding" - it is the way that the model has decided to represent your content semantically. The embedding is what allows the LLM to understand the difference between "these computers are really hot" and "these peppers are really hot" (the two sentences are very similar lexically but very different semantically).
There are some really interesting things you can do with this information. Embedding vectors that are closer together in space should be more similar in meaning. As a result, you can group together similar content, build higher quality search experiences, and even perform zero-shot classification tasks.
What are vector stores?
Vector stores are database applications that are specially designed to efficiently search large sets of vectors. A simple semantic search using a vector store works something like this:
- Retrieve all searchable documents (this might be code, text, etc).
- Split up the documents to make search most effective (i.e. when searching, you probably don't want to match an entire file but rather a single paragraph).
- Retrieve the embedding for each split datum, and store all of these vectors into a vector store. The store should allow you to store an ID or reference so you can trace your datum back to the document that it came from.
- When the user performs a search, you must transform the user's search query into an embedding as well (using the same embedding model). You can then ask the vector store to retrieve the N most similar documents to your query vector.
Performing vector math in 1500 dimension space is not something that a typical database will be able to handle easily, which is why vector stores exist.
There are a number of common vector stores that you can choose from for your application including Chroma, FAISS, pgvector, and more.
What parameters should I be using (i.e. temperature, stopping sequence, etc)?
Different LLMs will have different parameters and the values of your parameters depend greatly on your task. However, there are a few things to keep in mind:
- Limit your
max_tokens
, especially if you are including untrusted text in your prompts. This helps avoid overspending money on malicious prompts.
- Use
stop
(stopping sequence) if you know the correct stopping point for the model. For example, the use of a ReAct or MRKL agent depends on usingObservation:
as your stopping sequence to avoid the LLM generating its own output for the tools. - Sometimes this can limit your ability to verify the model’s response. For instance, if you ask the model to output a single line but it hallucinates and outputs several, using a LF token as your stopping character will stop your program from detecting that the model has hallucinated and will simply utilize the first line as the ground truth.
How
temperature
affects output:- A temperature of 0 will force short concise responses and will bias the model towards utilizing the information (and exact words) in its context window.
- A temperature between zero and 1 will enable some creativity, which can be useful for:
- tasks where the model needs to justify its behavior
- tasks that require “general knowledge” (or background knowledge not present in the context window)
Additional links:
Can LLMs improve in accuracy?
Though LLMs are static ML models (i.e. they are not continuously changing as you interact with them), there are recently published techniques that allow you to improve the accuracy of your interactions with an LLM. Two such techniques are "Recursive Criticism and Improvement" and "Reflexion" (both techniques were evaluated against GPT3 models, but might work with 3.5 and 4). While these techniques do not modify the underlying model, they can be utilized to verify accuracy for more complex workflows.
Helpful links
Getting Started
- OpenAI Reference: https://platform.openai.com/docs/api-reference (useful for learning the concepts)
- LangChain Reference: https://python.langchain.com/en/latest/index.html (langchain combines several recent techniques into a single package - it is very useful for rapidly prototyping systems using LLMs, and still valuable even if you don't work with OpenAI models).
- LLM Parameters: https://txt.cohere.com/llm-parameters-best-outputs-language-ai/
Models
- Available OpenAI models: https://platform.openai.com/docs/models/overview
- Available sbert models: https://www.sbert.net/docs/pretrained_models.html
- Flan-T5 model documentation: https://huggingface.co/docs/transformers/model_doc/flan-t5
Tutorials
- Building semantic search with LLMs: https://www.youtube.com/watch?v=Yhtjd7yGGGA
- LangChain basics: https://www.youtube.com/watch?v=2xxziIWmaSA
Examples
- OpenAI's examples: https://platform.openai.com/examples
Helpful packages
- Extracting structured data from unstructured data: https://github.com/eyurtsev/kor
Looking for more interesting readings than just getting started? Checkout the LLM Readings page.