Introduction
The realm of large language models (LLMs) is constantly expanding, and a new frontier has emerged: multimodal LLMs (MLLMs). These advanced models can process and understand both text and images, opening doors to revolutionary applications in natural language processing and computer vision. This paper review dives into “MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training,” a recent study that investigates the key factors influencing MLLM performance.
The authors meticulously dissect the impact of design choices and data selection through a series of experiments. We’ll explore the paper’s methodology, including its use of ablations to isolate variables and analyze their influence. We’ll delve into the architectural explorations, examining the investigation of image encoders, visual-language connectors, and the critical role of pre-training data. We’ll witness how the researchers ingeniously scale up their findings to create massive models with billions of parameters. Additionally, we’ll analyze the comprehensive set of experiments used to validate their approach, encompassing both pre-training metrics and supervised fine-tuning tasks. Finally, we’ll assess the paper’s strengths, such as its valuable guidance for the research community and the scalability of its findings. We’ll also explore potential areas for improvement, offering a well-rounded perspective on this significant contribution to the field of MLLMs.
Summary
Overview
The paper aims to investigate what methodological and data choices lead to performant multimodal large language models (MLLMs). Specifically, the authors study the impact of image encoders, vision-language connectors, and pre-training data choices. Through the use of ablations, the authors identify the extent to which certain design choices lead to performance increases/decreases of their models. Through scaling up their findings to dense and Mixture-of-Experts (MoE) models with parameter counts in the tens of billions, the authors achieve a state-of-the-art performance in pre-training metrics and comparable performance to the SOTA methods after supervised fine-tuning (SFT) in a range of tasks. The authors call their family of models MM1.
Methods
The authors perform numerous carefully chosen ablations on the architectures, data and training pipelines of the MLLMs they experiment with. In particular, they carry out those ablations on a smaller, less compute-intensive scale, draw their conclusions about it and scale up their findings to a large multi-billion parameter architecture.
Architecture-wise the authors explore using different visual encoder sizes of ViT (Dosovitskiy et al., 2021), trained with different objectives, i.e. contrastive against reconstructive loss, with different input image resolutions. Furthermore, the authors investigate what visual-language connectors perform better among average pooling, attention pooling and C-abstractors (Cha et al., 2024). They also go into the specifics of the optimal number of output tokens per image from the visual encoder.
The pre-training data ablation study covers four types of data: captioned images – real and synthetic, interleaved image-text data and text-only data. The study goes into what mixes of data types give MM1 models an optimal performance on 0-shot, 4-shot and 8-shot tasks.
To scale up their findings, the authors use data points from smaller-scale experiments to extrapolate into large scale. They estimate an optimal peak learning rate for their large-scale experiments by fitting a log-scale linear regression as a function of the number of parameters. The authors employ a MoE variant of their language decoders to compare the performance with standard dense decoders.
Finally, the authors measure the performance on both pre-training metrics and supervised fine-tuning tasks.
Experiments
The paper presents a comprehensive set of experiments to validate its contributions. We can divide the types of experiments into two categories: model ablations and data ablations.
As model ablations, the authors experiment with pre-training objective (reconstruction against contrastive loss) and, in parallel to that, with different image resolutions – 224 versus 336 and 378. They find that image resolution is the most important factor with 3% improvement in performance, where higher is better. Increasing from ViT-L to ViT-H added a 1% performance boost. The authors’ experiments on the type of visual connector (VC) (among average pooling, attention pooling and C-Abstractor (Cha et al., 2024)) show that the type of VC does not matter much, however, the number of output tokens per input image is the most important factor for performance, along with image resolution. 144 tokens per picture is found to be optimal against 64.
The authors’ experiments on pre-training data are ablations of mixes between: (1) interleaved and captioned data, (2) with and without text-only data, (3) image data (caption and interleaved) and text-only data, and (4) VeCap (Lai et al., 2024), which is synthetic caption data. It was found that: (1) “interleaved data is instrumental for few-shot and text-only performance, while captioning data lifts zero-shot performance”, (2) “text-only data helps with few-shot and text-only performance”, (3) “careful mixture of image and text data can yield optimal multimodal performance and retain strong text performance”, (4) “synthetic data helps with few-shot learning”. These are important findings for the overall MLLM field, as they can function as guidelines on how to pre-train MLLMs for optimal performance. The authors further corroborate these findings with large scale-experiments on 3B, 7B and 30B versions of MM1. They measure captioning and visual question answering performance and compare it with only larger SOTA models. In few-shot performance, the MM1 family of models sets a new state-of-the-art for pre-trained MLLMs, while for zero-shot they perform comparably to the SOTA.
The authors experiment further with supervised fine-tuning on top of the pre-trained models. The data mixture they fine-tune on includes LLaVA’s methods of data collection (Liu et al., 2023) – 1.45M SFT examples. To support higher image resolution the authors use two approaches: positional embedding interpolation and sub-image decomposition. The 3B and 7B variants set a new SOTA with MoE performing better than the dense architecture. The 30B variant is beats SOTA methods like Emu2-37B (Sun et al., 2023), CogVLM-30B (Wang et al., 2024) and are on par with LLaVA-NeXT. The authors go a step further to assess the impact on pre-training on SFT and few-shot chain-of-thought reasoning after SFT.
Ultimately, the experiments carried out show that the MM1 family of models, with the architectural design choices and data choices taken, are performant models achieving SOTA from multiple aspects and being on par with the current methods where they do not. This extensive experimentation solidifies this study as a valuable source of guidance for further research on MLLMs.
Strengths
- Due to its carefully set up ablations and extensive experimentation, this paper sets forth valuable lessons and guidance for training MLLMs for the research community.
- The authors show that their optimal design and data choices hold on smaller scale and scale them up to achieve SOTA performance, which also supports that their lessons hold.
- The paper provides extensive qualitative evaluation examples, implementation details and further ablation details in the Appendix.
Weaknesses
- Although the paper provides a good justification for not carrying out hyperparemters search on large models and instead uses a regression fit to estimate optimal learning rate, a search for optimal hyperparameters would solidify their impact on performance and make the paper more credible.
- The paper discusses performance improvements in metrics in pretraining and “SFT evaluation metrics”. However, there is no desctription of what these metrics are and why they are relevant to the research. Adding such section would be beneficial to understanding the paper better and will provide better clarity on the results.
- A bigger conclusion/discussion section discussing the limitations and implications of this study would be beneficial for contextualizing the results of this study within the broader research landscape.
Rating and Justification
Weak accept.
The paper presents a thorough investigation into the design choices and data ablations in multimodal large language models (MLLMs), providing valuable insights for the research community. The careful setup of ablations and extensive experimentation demonstrates the robustness of the proposed approach, which is a significant strength. Furthermore, the scalability of the findings from smaller-scale experiments to achieve state-of-the-art performance on larger models adds credibility to the study.
However, there are weaknesses that need to be addressed. Firstly, while the paper justifies the use of a regression fit to estimate optimal learning rates for large models, conducting a hyperparameter search for large-scale models would enhance the credibility of the findings. Additionally, the lack of description and justification for the pretraining and supervised fine-tuning evaluation metrics is a gap that needs to be addressed for better clarity and understanding of the results. Finally, a more substantial conclusion and discussion section discussing limitations and implications would help contextualize the study within the broader research landscape.
In their rebuttal, the authors should address these weaknesses in order to increase the impact and credibility of their paper and get a better rating.
Leave a Reply