In its Explore small language models for specific AI scenarios report, published in August 2024, Gartner explores how the definitions of “small” and “large” in AI language models have changed and evolved.
Gartner notes that there are estimates that GPT-4 (OpenAI – March 2023), Gemini 1.5 (Google – February 2024), Llama 3.1 405B (Meta – July 2024) and Claude 3 Opus (Anthropic – March 2024) have around half a trillion to two trillion parameters. On the opposite end of the spectrum, models such as Mistral 7B (Mistral.AI – September 2023), Phi-3-mini 3.8B and Phi-3-small 7B (Microsoft – April 2024), Llama 3.1 8B (Meta – July 2024) and Gemma 2 9B (Google – June 2024) are estimated to have 10 billion parameters or fewer.
Looking at one example of the computational resources used by a small language model compared with those used by a large language model, Gartner reports that Llama 3 8B (eight billion parameters) requires 27.8GB of graphics processing unit (GPU) memory, whereas Llama 3 70B (70 billion parameters) requires 160GB.
The more GPU memory needed, the greater the cost. For instance, at current GPU prices, a server capable of running the complete 670 billion parameter DeepSeek-R1 model in-memory will cost over $100,000.
Knowledge distillation
The fact that a large language model is several times larger than a small language model – in terms of the parameters used during training to build a data model that they use for AI inference – implies that SLMs are only trained on a subset of data. This suggests there are likely to be holes in their knowledge, hence they will sometimes be unable to provide the best answer to a particular query.
Distilled SLMs improve response quality and reasoning while using a fraction of the compute of LLMs Jarrod Vawdrey, Domino Data Lab
“This knowledge transfer represents one of the most promising approaches to democratising advanced language capabilities without the computational burden of billion-parameter models,” he says. “Distilled SLMs improve response quality and reasoning while using a fraction of the compute of LLMs.”
Vawdrey says knowledge distillation from LLMs to SLMs begins with two key components: a pre-trained LLM that serves as the “teacher”, and a smaller architecture that will become the SLM “student”. The smaller architecture is typically initialised either randomly or with basic pre-training.
Augmenting SLMs
Neither an LLM nor an SLM alone may deliver everything an organisation needs. Enterprise users will typically want to combine the data held in their corporate IT systems with an AI model.
According to Dominik Tomicevic, CEO of graph database provider Memgraph, context lies at the core of the entire model debate. “For very general, homework-level problems, an LLM works fine, but the moment you need a language-based AI to be truly useful, you have to go with an SLM,” he says.
For instance, the way a company mixes paint, builds internet of things (IoT) networks or schedules deliveries is unique. “The AI doesn’t need to recall who won the World Cup in 1930,” he adds. “You need it to help you optimise for a particular problem in your corporate domain.”
As Tomicevic notes, an SLM can be trained to detect queries about orders in an e-commerce system and, within the supply chain, gain deep knowledge of that specific area – making it far better at answering relevant questions. Another benefit is that for mid-sized and smaller operations, training an SLM is significantly cheaper – considering the cost of GPUs and power – than training an LLM.
However, according to Tomicevic, getting supply chain data into a focused small language model is technically a major hurdle. “Until the basic architecture that both LLMs and SLMs share – the transformer – evolves, updating a language model remains difficult,” he says. “These models prefer to be trained in one big batch, absorbing all the data at once and then reasoning only within what they think they know.”
This means updating or keeping an SLM fresh, no matter how well-focused it is on the use cases for the business, remains a challenge. “The context window still needs to be fed with relevant information,” he adds.
For Tomicevic, this is where an additional element comes in – organisations repeatedly find that a knowledge graph is the best data model to sit alongside a domain-trained SLM, acting as its constant tutor and interpreter.
Retrieval augmented generation (RAG) powered by graph technology can bridge structured and unstructured data. Tomicevic says this allows AI systems to retrieve the most relevant insights with lower costs and higher accuracy. “It also enhances reasoning by dynamically fetching data from an up-to-date database, eliminating static storage and ensuring responses are always informed by the latest information,” he says.
“This transforms how organisations deploy AI, bringing powerful capabilities to environments previously considered impractical for advanced computing and democratising access across geographical and infrastructure barriers,” he says.
According to Mahl, RAG provides a pipeline that cuts through the noise to deliver precise, relevant context to small language models.
Reducing errors and hallucinations
While LLMs are regarded as incredibly powerful, they suffer from errors known as hallucinations, whereby they effectively make things up.
Rami Luisto, healthcare AI lead data scientist at Digital Workforce, a provider of business automation and technology solutions, says SLMs provide a higher degree of transparency to their inner workings and their outputs. “When explainability and trust are crucial, auditing an SLM can be much simpler compared to trying to extract reasons for an LLM’s behaviour,” he says.
While there is a lot of industry hype around the subject of agentic AI, a major barrier to using AI agents to automate complex workflow is that these systems are prone to errors, leading to incorrect decisions being automated. This inaccuracy will improve over time, but there is little evidence that enterprise applications are being developed with tolerance to potential errors introduced by agentic AI systems.
In a recent Computer Weekly podcast, Anushree Verma, a director analyst at Gartner, noted that there is a shift towards domain-specific language models and lighter models that can be fine-tuned. Over time, it is likely these smaller AI models will work like experts to complement more general agentic AI systems, which may help to improve accuracy.
The analogy is rather like someone who is not a specialist in a particular field asking an expert for advice, a bit like the “phone a friend” lifeline in the TV game show Who wants to be a millionaire?
DeepMind CEO Demis Hassabis envisages a world where multiple AI agents coordinate activities to deliver a goal. So, while an SLM may have been transferred knowledge from an LLM through knowledge distillation, thanks to techniques like RAG and its ability to be optimised for a specific domain, the SLM may eventually be called as an expert to help a more general LLM answer a domain-specific question.