LLM (Large language model) – a general name for models like ChatGPT or GigaChat.

They’re developing well, can answer some things, but there are 2 problems:

  • relatively rare updates (less than once a year?)
  • far from all the world’s knowledge is in them, especially if you need to use some non-public knowledge

So we come to the need to pass some knowledge to models. For this there are several approaches:

  • simply “manually” add to the query (for example, upload a text document and ask to do something with it)
  • RAG (Retrieval Augmented Generation) – we make a layer that improves the query by automatically adding data from some database. More details in the next note.
  • Fine tune – we collect data and retrain the model on it. Only suitable for local models, cloud ones generally can’t be retrained.
  • CAG (Cache-Augmented Generation) – an analog of fine tune for cloud models: they have a large context window, all prepared data is loaded there and saved (“cached”). It’s like a long query, but you don’t need to send it constantly by hand and it’s already parsed.
  • MCP (Model Context Protocol) – we provide the model with the ability to call external functions. For example, a calculator or get current exchange rates. This way you can “feed” necessary data. As an example https://context7.com/ : “Use context7” is added to the query, and the model itself gets current information about the necessary framework (API, examples).

What and how is used:

  • “manually”: when it’s a one-time task or all data fits in context
  • CAG: all data fits in context and the task is frequent for cost optimization
  • RAG: some systems are already built, but, as I see it, MCP looks better
  • Fine tune: used most often for images (to train to recognize some specific new object classes (for example, determine the exact type of plant by leaves) or generate images with certain objects (same avatars, for example)). In LLM it can also be used, for example, to train a certain communication style, but not too popular.
  • MCP: current replacement for RAG, when some non-public data is collected in some database, it doesn’t fit in context and because of this you need to be able to select part of the data, not all, so the model can use them. With regular local LLMs (somewhere 14-24b) it works poorly, so promising, but not universal yet.

There’s also a class of wrapper systems over LLM:

  • anonymizers: the task is to remove personal and other sensitive data when sending to cloud LLMs. Similarly, de-anonymize the answer: instead of NAME1 substitute the specific name again.
  • NSFW (not suitable for work – usually about porn): fighting unwanted content. The system determines whether to block the query entirely or somehow reformulate it according to “community” rules and legislation. Similarly, checking the generated answer.
  • fighting hallucinations: double-checks whether the model’s answer is adequate or not (usually with the help of another model).

It’s clear that wrappers add delays and cost.