Generative AI — Developer's guide around LLM technologies

Dr. Gabriel Lopez
7 min readAug 27, 2023

--

A guide over the most relevant concepts surrounding LLMs, practical links and recommendations to understand this new technology

Image generated with Stable Diffusion technology

What are LLMs?

The term LLM refers to Large Language Model (LLM). An LLM is an artificial intelligence model which has been trained on vasts amount of data to reproduce human language and reasoning. ChatGPT is example of an application based on an LLM (GPT4).

Uses of LLMs

LLMs can be used for many purposes, including: (1) Q&A applications based on reference documents, (2) text summarisation, (3) language translation, (4) information retrieval from text / images, (5) zero-shot / few-shot classification, (6) programming code generation

LLMs can show close to human reasoning capabilities and are a formidable, and for some, scary, form of machine intelligence.

Top LLMs (per August 2023)

Hereby a list with the most influential models so far and some of their main properties:

*Llama2 has with a 32K window can be found here.

GPT4 is overall the best LLM so far, Llama2 is known for its code-generating capabilities, Claude2 is known for its emphatic responses and Palm2 is known for its multilingual support (+100 languages). Llama2 and Palm2 are open-source, thus very attractive for developers.

Practical Concepts around LLMs

In general, the bigger the LLM, the better its output. Running big models requires big machines, that is why we need model optimisation techniques.

  • Model size: The amount of trainable parameters in the model. Bigger models are often more accurate. However, often they eat more memory and require better hardware to run (GPUs, TPUs). Bigger models also have lower inference speeds. Currently a model can be seen as "small" if it has less than 10B parameters, medium size models have about 10B — 100B, and anything bigger than 100B is considered large.
  • Context Window: the number of tokens the model can take as input when generating responses. As prompts get larger, this numbers becomes an important limiting factor for practical applications. For example, in GPT3 the context window size is 2K and in GPT4 it is 32K.
  • Throughtput (inference speed): The amount of tokens per second an LLM can produce. An excellent speed is 40tok/s (typical response after 2s), 30tok/s is a good speed (typical response after 3s) and less than 20tok/s is seen as relatively slow (typical response after 5s). The higher the throughput the smaller the latency in the LLM responses. Throughput depends on model architecture, running infrastructure, API response time, etc.
  • Quantisation: A technique to increase the inference speed of a model by downgrading its floating point precision. Many LLMs are available in a quantised form, making them more memory saving and giving them faster inference. Typically float32/float16 models are mapped to float8/float4. Accuracy might be impacted.
  • Optimisations: Transformer architectures can be optimised for speed. Technologies like Flash Attention and Memory Efficient Attention redefine some of the transformer building blocks to achieve better training and inference speeds. SOTA model implementations often use these optimisations. Check useful implementations here.

Limitations of LLMs

  • Cost: LLMs cost money. Having a quality model running in productions requires either a paid API (like OpenAI or Antrophic API) or investment in cloud / on-premise computational resources (GPUs, TPUs) to serve the model.
  • Hallucinations: The LLM tendency to produce content that is nonsensical or untruthful, yet often articulated in a very convincing manner. Techniques like RAG are good to combat hallucinations.
  • Bias: LLMs are trained with text from the internet. As western developed countries tend to publish more in the internet, LLMs often show a “western”point of view. At the point of even expressive negative sentiment towards other cultures, ethnicities, gender expressions, etc. Additionally, LLMs can provide unsafe outputs on request (like instructions to build a bomb). Strong research effort is currently placed in reducing this.
  • Security: LLM-based systems are vulnerable to prompt injection attacks. These attacks enable malicious users to gain control of the LLMs making them ignore their original instructions to: output malicious content (like “how to build a bomb”), extract information from SQL databases connected to the LLM-system, or simply gain control of the base-LLM application.
  • Privacy: LLMs are often too heavy to run locally, so users often use 3rth part APIs to access LLMs (OpenAI, Antropic, etc). This implies however that possible sensitive user data is sent to 3rth party digital centers for processing. Having this data leak towards far-away countries breaks data privacy regulations like GDPR.

LLMs can be “tricked” into displaying unsafe responses. This is known as “jailbreaking”

Recommendations for Developers

  • For LLM-powered applications in production we recommend HuggingFace instead of LangChain
  • When external knowledge banks are needed we recommend prompt-engeneering (RAG) instead of fine-tunning
  • Fine-tunning works well even with small datasets of 2K samples
  • When deploying open source LLMs locally we recommend looking for LLM distributions with low peak-memory, quantisation and speed improvements. This can be a good start. You can check here if you have enogh memory to run the model.
  • Testing your ChatGPT-based application pre-prompt against injection attacks can be done here.
  • RAG applications can suffer if the documents retrieved to match the user query are not relevant enough. The following tips can alleviate that: (1) Increase the number of documents retrieved, (2) Use a larger embedding model, (3) Use re-rankers to re-rank the retrieved list with a more refined similarity ranking tool (check here), (4) Use fine-tuned embedding models (check here)

Dictionary of concepts around LLMs

When you dive into the GenAI world there will be many concepts to grasp, here is a list of some of the most relevant ones:

  • Agents: AI systems composed of LLMs powered with a Python executor thus allowing the model to “act” on the cyber-world by being able to run the generated python code. Agents are entities powered with “tools” that can execute any model at will. Guide here.
  • User-prompt: The prompt typed by the user. Normally this prompt is enriched with application dependent information before is sent to the LLM.
  • Chains: A chain refers to the union of user-prompt, pre-prompt, memory and contextual documents into a final prompt to be sent to the LLM.
  • DPR: Dense Passage Retrieval. Technique to search among documents using embeddings (a.k.a. dense representations), like the ones provided by GPT embedders, as opposed to sparse representations like the ones obtained with TF-IDF or BM25.
  • LoRA: Fine-tuning framework that allows re-training using less trainable parameters than ever before, thus enabling: (1) faster inference time, (2) portability: A single LLM can now swap between different “fine-tuning heads” if desired, (3) less hardware requirements. The technology behind it known as the Low-Rank Adaptation of LLMs.
  • Memory: External memory that stores the previous user-prompts + replies. These are used as contextual information to answer the follow up question.
  • PEFT: Parameter Efficient Fine-Tunning. Set of techniques built to make model fine-tuning more efficient. There is a nice HuggingFace package around it
  • Pre-prompt: Set of instructions to be sent to the LLM in order to process the user-prompt in the correct way. Ex: “Answer the following question as Niels de Grass Tyson: <user-prompt>”
  • Quantisation: A technique to drop the precision of the calculations in a neural net with the goal of speeding training+inference times.
  • QLoRA: Same as LoRA but with quantisation included (+ other improvements). See more here.
  • RAG (Retrieval-Augmented Generation): AI framework to link LLMs with domain-specific knowledge bases. This enables the LLM to answer questions using domain-specific data. Normally used for Q&A purposes. This data is passed dynamically in the prompt and is usually retrieved by matching the question of the user with the most relevant document in the knowledge base.

Empower LLMs with your own data!

One of the ways to truly unleash the power of LLMs is to connect them with your own data. This can prove very useful for Q&A applications, for instance reading long contracts or extracting specific info from journal papers. There are 2 main ways to provide external data to an LLM:

via Prompting + External Sources (RAG): Simply add the relevant source documents to the input prompt to be sent to the LLM as contextual information together with the user prompt. This approach is fast (no re-training needed), easy (just add text to the pre-prompt) and cheap. However, it shows overall lower accuracy than fine-tuned models of domain-specific tasks.

via Fine-Tuning: Re-trains the last linear layers of the LLM in order to reproduce a target training dataset with the domain knowledge (prompt, desired-response) pairs. This approach is slower (re-training takes time), harder (need to modify model weights) and more expensive (infrastructure requirements) than RAG. But provides better accuracy than fine-tuned models.

Leaderboards

New open source models pop-up every week. To find the best open-source model for a specific tasks one can visit the leaderboards

LLMs for Python Developers — Free Online Tutorials

These links will help you in your path of developing your own LLM-based applications

Thanks for reading and keep learning

--

--

Dr. Gabriel Lopez

Senior Data Scientist and PhD (Delft University). Science and technology enthusiast. Passionate about food and fractals. Dog lover.