Master's thesis presentation. Aristotelis is advised by Mathias Sundholm (PreciTaste), Alexander Dolokov (PreciTaste) and Prof. Dr. Felix Dietrich.
SCCS Kolloquium
The SCCS Colloquium is a forum giving students, guests, and members of the chair the opportunity to present their research insights, results, and challenges. Do you need ideas for your thesis topic? Do you want to meet your potential supervisor? Do you want to discuss your research with a diverse group of researchers, rehearse your conference talk, or simply cheer for your colleagues? Then this is the right place for you (and you are also welcome to bring your friends along).
Upcoming talks
Aristotelis Tsoutsanis: Object recognition with LLMs
SCCS Colloquium |
This thesis delves into the development of Vision-Language Models (VLMs) that utilize pre-trained backbones, aiming to make these models more efficient and accessible by reducing the computational resources needed for training. With the rise of Large Language Models (LLMs) in recent years, we have seen remarkable progress in natural language processing, achieving near-human performance on a wide range of tasks. Meanwhile, visual recognition has remained a critical challenge in computer vision, playing a pivotal role in fields like robotics and autonomous driving. Vision-Language Models combine the strengths of visual and textual data, enabling them to tackle complex tasks like image captioning and visual question answering with high accuracy.
In this research, we utilize a two-stage training approach: pre-training and fine-tuning. During pre-training, we focus on transforming image embeddings into the text embedding space using adapters. This process involves minimizing the Earth Mover’s Distance between the image embedding distribution from the image encoder and the text embedding distribution of the LLM to ensure the embeddings align well. In this way, the LLM is not part of the training, significantly lowering computational costs. In the fine-tuning stage, the LLM is brought back into the pipeline, and we use the quantized version of the LLM and we apply Low-Rank Adaptation (LoRA) to fine-tune the model. Meaning that instead of updating the whole weight matrix, we update low-rank matrices that approximate the necessary adjustments. We explore three types of adapters: a simple Multi-Layer Perceptron (MLP) adapter that provides a strong baseline, and two more sophisticated transformer-based adapters that utilize attention mechanisms to enhance performance and alignment between modalities. The first one contains blocks of self-attention and feed-forward directly to the image tokens, while the second one employs learnable queries that learn to selectively extract the most relevant image tokens using self-attention and cross-attention.
Our experiments, conducted on the MSCOCO dataset show that these pre-trained adapters are effective for handling visual-language tasks. However, the fine-tuning phase is essential for refining the model’s accuracy and ability to generate well-structured responses. By omitting the LLM during pre-training, our approach makes it feasible for individuals and smaller organizations to work with multi-modal models, broadening access
to this advanced technology. The pre-training alignment facilitates a smoother and more effective fine-tuning process, leading to faster convergence and better overall performance. Moreover, the Food101 dataset was used for finetuning our pipeline for classification tasks in order to quantify the performance of our architecture.
In summary, this thesis addresses the challenges of scalability and accessibility in vision-language models. We demonstrate that TerraAlign can be trained efficiently for image captioning on the MSCOCO dataset and for classification on the Food101 dataset that shows optimistic results.
You don't want to miss a talk? Subscribe to our mailing list and our Colloquium calendar .
Contribute a talk
To register and schedule a talk, you should fill the form Colloquium Registration at least four weeks before the earliest preferred date. Keep in mind that we only have limited slots, so please plan your presentation early. In special cases, contact colloquium@mailsccs.in.tum.de.
Colloquium sessions are now on-campus. We have booked room MI 00.13.054 for WS24/25. You can either bring your own laptop or send us the slides as a PDF ahead of time. The projector only has an HDMI connection, so please bring your own adapters if necessary.
Do you want to attend but cannot make it in person? We now have a hybrid option. Simply join us through this BBB room: https://bbb.in.tum.de/shu-phv-eyq-rad
We invite students doing their Bachelor's or Master's thesis, as well as IDP, Guided Research, or similar projects at SCCS to give one 20min presentation to discuss their results and potential future work. The time for this is typically after submitting your final text. Check also with your study program regarding any requirements for a final presentation of your project work.
New: In regular times, we will now have slots for presenting early stage projects (talk time 2-10min). This is an optional opportunity for getting additional feedback early and there is no strict timeline.
Apart from students, we also welcome doctoral candidates and guests to present their projects.
During the colloquium, things usually go as follows:
- 10min before the colloquium starts, the speakers setup their equipment with the help of the moderator. The moderator currently is Ana Cukarska. Make sure to be using an easily identifiable name in the online session's waiting room.
- The colloquium starts with an introduction to the agenda and the moderator asks the speaker's advisor/host to put the talk into context.
- Your talk starts. The scheduled time for your talk is normally 20min with additional 5-10min for discussion.
- During the discussion session, the audience can ask questions, which are meant for clarification or for putting the talk into context. The audience can also ask questions in the chat.
- Congratulations! Your talk is over and it's now time to celebrate! Have you already tried the parabolic slides that bring you from the third floor to the Magistrale?
Do you remember a talk that made you feel very happy for attending? Do you also remember a talk that confused you? What made these two experiences different?
Here are a few things to check if you want to improve your presentation:
- What is the main idea that you want people to remember after your presentation? Do you make it crystal-clear? How quickly are you arriving to it?
- Which aspects of your work can you cover in the given time frame, with a reasonable pace and good depth?
- What can you leave out (but maybe have as back-up slides) to not confuse or overwhelm the audience?
- How are you investing the crucial first two minutes of your presentation?
- How much content do you have on your slides? Is all of it important? Will the audience know which part of a slide to look at? Will somebody from the last row be able to read the content? Will somebody with limited experience in your field have time to understand what is going on?
- Are the figures clear? Are you explaining the axes or any other features clearly?
In any case, make sure to start preparing your talk early enough so that you can potentially discuss it, rehearse it, and improve it.
Here are a few good videos to find out more:
- Simon Peyton Jones: How to Give a Great Research Talk (see also How to Write a Great Research Paper)
- Susan McConnell: Designing effective scientific presentations
- Jens Weller: Presenting Code
Did you know that the TUM English Writing Center can also help you with writing good slides?
Work with us!
Do your thesis/student project in Informatics / Mathematics / Physics: Student Projects at the SCCS.