We use cookies

This website uses cookies to provide necessary website functionality, improve your experience and analyze our traffic. By using our website, you agree to our Privacy Policy and our cookies usage.

Contact us

Our friendly team would love to hear from you.




    or contact us directly at office@cognitum.eu

    ionicons-v5-e

    Thank you for your interest!

    We will contact you as soon as possible.

    Back to homepage
    NLP & LLMs Cognitum course

    Master NLP & LLMs with Cognitum course

    Dive into the essentials of Natural Language Processing and Large Language Models. This course offers a hands-on approach to NLP workflows, tokenization, and the Hugging Face ecosystem, building your expertise step-by-step. Perfect for software engineers eager to master NLP.

    Book NLP & LLM Workshop for your team!

    For all parts of the course, its main goal is to familiarize the participants with the NLP workflows and models, including the LLMs, from the software engineering perspective, only introducing the mathematical theory when strictly necessary.

    As the whole course was developed to be presented as a whole, each day builds on the knowledge gained in the previous one(s), further expanding the participant’s knowledge of the NLP in general. However, if deemed necessary, sufficiently prepared or experienced users should be able to forgo certain days.

    Don’t miss this opportunity to enhance your NLP skills!

    Natural Language Processing and Large Language Models workshops

    Session 1 – Introduction to NLP & LLMs

    Day 1 – Tokenization & introduction to NLP models

    expand
    collapse

    Agenda

    • Introduction to computer text representation, specifically in Python – unicode codepoints and encodings
    • Tokenization
      – Defining and exploring different types of tokenization including: word-, character- and most importantly subword-tokenization
      – Introduction to Hugging Face’s transformers library and how it performs tokenization
      – Definition and treatment of special tokens and showcase of different transformers’ tokenizers
      – Training a custom tokenizer with transformers and Sentencepiece
    • Language Modeling
      – What neural networks consist of and how they work with text
      – Word embeddings
      – Transformer architecture and self-attention
      – Masked and Causal language modeling
    • Hugging Face Ecosystem
      – How to work with entirety of functionalities that HF facilitates

    Description

    This day is a general introduction to the NLP and the context in which its tasks are performed. The main goal is to discover steps that should be performed before feeding the data to the neural network, talk about how the AI models work with textual data, and introduce the participants to the Hugging Face ecosystem and their libraries which are a staple of NLP workflows.

    Level

    This day’s course can be divided into two parts. The introduction to the various tools and Hugging Face ecosystem, that are more of a high-level demonstration than a challenge, and the more advanced sections such as the Tokenization, Training custom Tokenizers, Word Embeddings, and Transformer Architecture, some of them with more theory than others. All newly introduced terms and techniques are explained from scratch, thus leaving no one in the dark.

    Because of the introductory character of the first part, its difficulty is low, evolving to medium in the aforementioned latter sections.

    Target Participants

    The beginner audience, with little to no NLP experience, is best suited for this day, since during the course its participants get to know a little bit of everything from the tokenization, through the transformers library and its use cases to the whole Hugging Face ecosystem.

    However, even the advanced users, already acquainted with the presented tools may find some of the more theoretical sections interesting and valuable, since an understanding of them is vital for successful participation in the following days of the course.

    Day 2 – Text Classification

    expand
    collapse

    Agenda

    • Working closer with the datasets
      – Learn how to split the data into training, validation and test splits to then use them in training and evaluation.
    • Getting to know the most important metrics for binary classification
      – Wel evaluate our models with such metrics as Accuracy, Precision and Recall, while learning what do they bring to the table when it comes to gauging model’s performance and dealing with imbalanced datasets.
    • Learning about different ways of text classification – we’ll go through several methods used for text classification from the most basic and classical ones to the newest and best solutions including:
      – Masked language modeling
      – Zero-shot classification via Natural Language Inference
      – Logistic Regression
      – Multi-layer perceptron – simple neural network architecture
      – Fine-tuning a pre-trained transformer model
      – Explainability of text classification

    Throughout the various stages of the workshop we will be introducing ways to deal with the imbalanced datasets, whether it’s during the training or evaluation.

    Description

    On this day we take another step into the world of the NLP, focusing on one of the most versatile tasks in the field – text classification. We learn about the metrics with which you can measure model’s performance, discover ways to work with imbalanced datasets and most importantly explore different ways to classify text.

    Level

    The difficulty for this part of the course varies between medium (for the sections like the introduction of metrics, Masked Language Modeling, or working with datasets) and hard (like managing class imbalance, usage of SHAP, and especially working with Torch framework to implement the MLP).

    Target Participants

    In this part of the course we expect the more advanced participants to thrive, as we introduce more complex ways to work with the neural networks, including devising their architecture on our own with PyTorch.

    Less experienced users should also find a lot of interesting parts, such as new metrics, fine-tuning of models with transformers API, or working with the datasets.

    Day 3 – Token Classification

    expand
    collapse

    Agenda

    • Discussion of various real-world applications of Token Classification and demo of keyphrase extraction
    • Part of Speech Tagging
      – Quick revision of PoS tagging task,
      – Data Preprocessing: preparing data for a basic Token Classification problem,
      – Model Training,
      – Evaluation.
    • Named Entity Recognition (NER)
      – IOB Input Data Format,
      – Data Preprocessing: tackling a more complex Token Classification problem,
      – Model Training,
      – Monitoring the training using the TensorBoard
      – Evaluation using seqeval,
      – NER Visualization using transformers pipelines and displacy.
    • Optional task
      – Advanced data preprocessing: Transforming an unstructured dataset to IOB format.

    Description

    In this workshop, we delve into various Token Classification problems, highlighting their main challenges, potential pitfalls, and strategies to overcome them. We’ll cover preprocessing techniques for data, and methods to evaluate results tailored to specific tasks, including leveraging several new libraries.

    Level

    This particular workshop’s difficulty can safely be evaluated as medium as it mostly leverages what the participants should already be familiar with from the transformers library and pure python on, while providing more details on the theoretical side of things.

    Target Participants

    We expect this part of the course to appease both the beginner and the advanced participants, as we delve deeper into possible use-cases of NLP and discover another of its tasks, while mostly using the transformers API that was described in detail in the previous parts of the course.

    To those more interested in the technicalities of model training and evaluation we introduce the usage of monitoring tools and new evaluation library.

    Day 4 – Sequence-to-sequence models

    expand
    collapse

    Agenda

    • Examples of s2s problems
    • Full encoder-decoder architecture
    • Metrics used for s2s problems
      – BLEU,
      – ROUGE,
      – BERTScore,
      – SemScore.
    • Sequence-to-sequence (s2s) models as universal NLP models
    • Text generation algorithms
      – greedy decoding
      – beam search
      – output sampling
      – computational complexity
    • Fine-tuning s2s models for
      – question answering
      – dialog summarization
    • Considerations regarding input/output lengths.

    Description

    Continuing our journey through the NLP field with the use of the transformer models. This time we focus on the variants that leverage the full transformer architecture, also known as seq2seq models. We will explore two of the tasks those models thrive in: question answering and text summarization, learn how to measure the quality of text generated by a model and try to create a multi-task model able to perform both aforementioned tasks.

    Level

    Considering the similarity to the previous workshop, this one is also of mediocre difficulty. Just like the former, it contains significant theoretical content, while also touching upon important technical aspects of the systems solving those tasks, once again relying heavily on the transformers library.

    Target Participants

    As mentioned in the ‘level’ section, due to similarities to the day 3’s workshop, we expect this part of the course to be enjoyable and valuable for participants on all levels of the knowledge tree.

    While keeping the formula similar to the previous workshop, we provide new information by introducing a whole new set of NLP problems, classical and innovative metrics to evaluate the performance of models solving those tasks and methods of text generation, important for yet another architecture variant of Transformers.

    Day 5 – Introduction to LLMs

    expand
    collapse

    Agenda

    • Introduction to using LLMs with transformers
    • Definition and showcase of zero-shot and few-shot learning techniques
    • Comparison of performance of the LLM and dedicated smaller models on tasks from previous days
      – Sequence Classification
      – Named Entity Recognition
      – Summarization
    • Few-shot learning in detail
      – How it improves model performance
      – It’s effect on the technical side of the system: inference time, sequence length and memory footprint
    • Text generation strategies in LLMs

    Description

    Venturing away from two previous days, we introduce the participants to the concept of LLMs, prompt engineering, as well as the zero-shot and few-shot learning techniques, translating from fine-tuning a smaller, dedicated model for each task to using one LLM with fitting prompts for all of them.

    We compare results achieved by those two approaches on various tasks from previous days and explore newly introduced aspects regarding LLMs.

    Level

    This workshop acts as an introduction to the wide field of LLMs, thus its difficulty level is low. We aim to make the transition from the smaller models to LLMs as gentle and easy as possible, slowly, but surely building the foundation for the more advanced techniques and aspects of leveraging the biggest neural networks.

    Target Participants

    While we hope that everyone could gain some useful knowledge from this part of the course, we have to admit that it is mostly aimed at those who, have barely touched upon the Large Language Models in general, as this workshop takes the participants a step lower from using the UI of LLMs available online to being able to tweak model’s hyperparameters and try some new, albeit simple, prompting techniques.

    Session 2 – Large Language Models

    Day 1 – Large Language Models principles

    expand
    collapse

    Agenda

    • Introduction to Large Language Models, revision and expansion
      – Pre-trained LLMs
      – Open-source LLMs
      – LLM settings
      – LLM pipeline
    • Basics of Prompt Engineering in detail
      – Zero-shot prompting
      – Designing prompts
      – Common tasks
    • In-Context Learning
      – Few-shot learning
      – Zero-shot vs. Few-shot
      – Efficient few-shot learning
      – Self-Generated In-Context Learning (SG-ICL)

    Description

    In this workshop we want to refresh and expand participant’s knowledge about the fundamental concepts of the LLMs, talking in detail about the techniques introduced in the fifth day of the first session, as well as adding some new ideas and methods, all in the aim of communicating with the LLMs more efficiently, thus increasing our control over their work.

    Level

    Treating this workshop as a continuation and expansion of the fifth day from the first session, we can derive that its difficulty increases along the notebook from easy revisions from the former part of the course to medium when introducing new concepts or providing more details on the ones already introduced.

    Target Participants

    Once again using the last workshop from the first session as a reference point, we assume this part of the course suitable for those with limited knowledge in the field of LLMs. It would prove especially useful for those who skipped the previous workshop altogether.

    That being said, we did our best to develop this course in such way, that even more advanced participants could learn something new from every day and every session.

    Day 2 – Prompt Engineering mastery

    expand
    collapse

    Agenda

    In this workshop, we explore various techniques and strategies for effective prompt crafting, ensuring that our communications harness the full potential of these powerful tools. We can split those approaches into two main categories based on their intent.

    • Improve Reasoning and Logic
      – Chain-of-Thought (CoT) Prompting
      – Contrastive Chain-of-Thought (CCoT) Prompting
      – Self-Consistency
      – Tree of Thoughts (ToT) Prompting
      – Self-Ask (SA) Prompting
      – System 2 Attention (S2A) Prompting
      – Plan-and-Solve (PS) Prompting
      – Thread-of-Thought (ThoT) Prompting
      – Tabular Chain-of-Thought (Tab-CoT) Prompting
      – Program-of-Thoughts (PoT) Prompting
    • Reduce Hallucination
      – Temperature
      – Re-reading (RE2) Prompting
      – Self-Evaluation (SE) Prompting
      – Chain-of-Verification (CoVe) Prompting
      – Self-Refine (SR) Prompting
      – Rephrase and Respond (RaR) Prompting
      – Retrieval-Augmented Generation (RAG)
      – Reason and Act (ReAct) Prompting

    Description

    As we continue our journey into the intricacies of artificial intelligence, Day 2 shifts focus to the art and science of Prompt Engineering. This critical skill set involves crafting specific inputs that guide Large Language Models (LLMs) to generate desired outputs with higher precision and relevance. Prompt Engineering is not merely about asking questions; it’s about formulating them in a way that aligns closely with the model’s training and capabilities. Understanding this can significantly enhance the quality of interactions with LLMs, enabling more accurate and contextually appropriate responses.

    Level

    As we move onto more complex and complex prompt engineering techniques the level also increases from fairly easy (ex. CoT prompting), through intermediate (like Tab-CoT) to really hard (see RAG and ReAct).

    Target Participants

    Since the difficulty and advancement of the described techniques progresses through the workshop, we can guarantee that every participant will find their own niche and learn something new from it. While remaining in scope of transformers library in the technical aspect, this part of the course is more focused on the theoretical approach, since we introduce a lot of prompt engineering techniques.

    Day 3 – Agents, Benchmarks and Fine-tuning

    expand
    collapse

    Agenda

    • LLMs as Agents
      – Naive ReAct Agent
      – Introduction to LangChain
      – LangChain Agent
    • Benchmarking LLMs
      – Lm-evaluation-harness
    • Instruction fine-tuning (model 1B) with LoRA and bitsandbytes

    Description

    Diving deeper into the LLMs field, in this particular workshop we explore several new aspects of it, LLMs acting as Agents, and also how to evaluate and fine-tune LLMs, since their size and complexity vastly differs them from small task-specific models, creating both theoretical and technical challenges.

    Level

    Due to the introduction of the new frameworks and tools, as well as because we broach new, more complex tasks, the difficulty of this workshop is estimated as hard.

    Target Participants

    Since we delve into new, more sophisticated and technical territories we recommend this part of the course to the more advanced participants, especially with more experience when it comes to ML frameworks such as LangChain.

    Day 4 – Fine Fine-tuning and Alignment

    expand
    collapse

    Agenda

    • Efficient fine-tuning of LLMs
      – PEFT library
      – LoRA
      – qLoRA
      – GaLore
      – p-tuning
      – Prompt tuning
    • Alignment
      – PPO-based approach
      ▸ LLM fine-tuning
      ▸ Reward model
      ▸ Policy optimization
      – DPO
      ▸ LLM (freezed)
      ▸ Direct optimization

    Description

    In this workshop we dig even deeper into the fine-tuning of the LLMs, focusing on the processe’s efficiency and its another important factor – alignment, aiming to enhance this very important part of the LLMs’ deployment.

    Level

    With the introduction of yet another set of tools and approaches, that also build up on the knowledge gained in the prior days (especially the third day of the second session), the level remains the same, deeming this workshop as hard, especially to those with limited coding experience and the ones who omitted the previous day.

    Target Participants

    Once again, maintaining the difficulty of the previous workshop, we are obliged to emphasize that this part of the course is prepared with the advanced participants, who aim to customize their LLM-based solutions, in mind.

    Day 5 – quantization, inference libraries, multi-gpu

    expand
    collapse

    Agenda

    • Quantization methods
    • 1.58 bits models
    • Trening vs. inference
    • Hardware support for low-precision models
    • Other speed-up methods
      – DeepSpeed
      – CPU off-loading
      – NVMe off-loading
    • Parallel computation (optional)
      – DDP
      – FSDP
      – 3D parallel computation
      – DeepSpeed configuration
    • Wrap up

    Description

    In the final day of the second session we move even lower in the technology stack of the LLM-based systems. This time we talk about quantization methods, differences between training and inference of the models and explore methods that aim to speed-up those activities.

    Level

    As we reach closer and closer to the hardware beneath the ML, slowly venturing into the ML-Ops field, the level of difficulty increases again, making this workshop really hard, especially on the technical side.

    Target Participants

    To not repeat ourselves, we recommend this particular workshop for those deeply interested in building and deploying the systems that leverage the LLMs. It’s also important to mention that any previous experience with the aforementioned tools is crucial.

    Session 3 – Retrieval Augmented Generation

    Day 1 – RAG Demo

    expand
    collapse

    Agenda

    • Rag overview
      • What is RAG?
      • Why would you use it?
    • LangChain introduction
      • What it is and the advantages it brings
    • Retrieval fundamentals
      • Processing documents
      • Vector stores and similarity search
    • Context-based generation
      • Prompt engineering for RAG
      • Formatting in RAG chain
      • Components in the chain
    • RAG Evaluation
      • Retrieval evaluation
      • Generation evaluation
      • Evaluation pipeline

    Description

    In this session as a whole we will venture into a specific, highly sophisticated field of LLM usages called Retrieval Augmented Generation (RAG).

    We will learn in detail what this fancy sounding term really means, how we can leverage such solution in the real world, what’s the difference between using a normal LLM and a RAG system, what components does such tool consist of and how to build it.

    Day 2 – RAG components & evaluation

    expand
    collapse

    Agenda

    • Synthetic dataset generation
      • Data processing
      • Questions and answers generation
      • Dataset validation
      • Dataset filtering
      • Verification
    • Vector database
      • Setup
      • Static and dynamic filtering
      • Verification
    • Retrieval evaluation metrics

    Description

    After getting to know the basics of Retrieval Augmented Generation yesterday, we’re ready to dive deeper into its architecture. We’ve already built a demo from pre-prepared parts, but what if those elements don’t satisfy our needs? In this session we will explore most popular technologies and frameworks that allow us to create our own components.

    Day 3 – Improving retrieval

    expand
    collapse

    Agenda

    • Baseline implementation
      • Test set
      • Corpus
      • Document processing
      • Indexing
      • Evaluation
      • Error analysis
      • Question cleaning
    • Chunking
    • Hybrid search
      • Lexical and hybrid search
      • Custom tokenizer
      • Weaviate hybrid search
    • Reranking
      • Dataset preparation
      • Fine-tuning cross-encoder
      • Evaluation
    • Bi-encoder
      • Fine-tuning
      • Evaluation
      • Hard negatives

    Description

    We will start with building a simple embedding-based retrieval which will serve as a baseline. We will then evaluate it on the test set and analyze what kind of errors it makes. This will help us understand the limitations of the simple retrieval model and the data we are working with. We will also explore how the choice of chunking strategy can affect the retrieval performance. Next, we will go back to the basics and learn about lexical search and when it can be used to improve retrieval. Then, we will add another component to our retrieval pipeline: the cross-encoder. We will learn how to use it and how it can improve the retrieval performance. Finally, we will come back to the embedding-based retrieval and see how to fine-tune it to further improve the performance.

    Day 4 – Improving generation

    expand
    collapse

    Agenda

    • Evaluation metrics
      • Reference-free metrics
      • Ground-truth metrics
      • Custom generative metrics
    • Improving context
      • Generation
      • Evaluation
      • Number of chunks
      • Lost in the middle
      • Filtering
      • Extraction
    • Fine-tuning generatora
      • Question selection
      • Few-shot generator
      • Efficient fine-tuning
      • Evaluation
    • Complex generation techniques
      • Beyond naive RAG
      • Verification
      • Routing
      • Tools
    • Security
      • Prompt injection
      • Semantic guardrails

    Description

    We will start with an overview of the metrics used to evaluate the generated responses. In particular, we will focus on how to use Large Language Models (LLMs) to analyze the generated responses and compare them to the ground truth. Then, we will do a short recap and build a simple RAG system which will serve as a baseline. We will evaluate it on the test set and analyze what kind of errors it makes. In the next stage, we will explore the context created from documents returned by the retrieval model. We will analyze how the quality of the context affects the generation performance. Next, we will fine-tune the generation model to align it to expected answers and improve the generation performance. Finally, we will explore different extensions to the RAG model.

    Session 4 – Annotation, data and MLops (details to be announced)

    Launch your LLM project with a proven deep-tech partner

    With Cognitum, you can confidently launch your project. Ensure scalability, security, performance and design with our product experts on side.

    Get in touch
    NLP & LLMs Cognitum course

    FAQ

    What are Large Language Models (LLMs)?

    +

    Large Language Models (LLMs) are a type of machine learning model for natural language processing. They’re trained on a large amount of text data and can generate human-like text based on the input they’re given.

    What is a private LLM?

    +

    A private LLM is a large language model that is a model exclusively utilized by a specific organization. This guarantees data security and privacy as the model and its associated data are not disseminated to other entities.

    Are LLMs secure?

    +

    Yes, especially when you use private LLMs. These models are not shared with other entities, ensuring your data remains secure and complies with your stringent data policies.

    Can LLMs be integrated with existing systems?

    +

    Yes, LLMs can be seamlessly integrated with clients’ environments such as databases, websites, mobile apps, messaging apps, customer support platforms, and more.

    How do I get started with implementing LLMs?

    +

    To start implementing LLMs, reach out to us at Cognitum. We’ll discuss your specific needs and how our solutions can help you achieve your goals.

    What is a Generative AI Application?

    +

    A Generative AI application is a type of artificial intelligence that creates new content. It’s based on patterns and structure of their input training data and then generates new data.

    Your certified partner!

    Empower your projects with Cognitum, backed by the assurance of our ISO 27001 and ISO 9001 certifications, symbolizing elite data security and quality standards.