Multimodal LLMs on a Mac M1: A Quick Test

8 min readJul 20, 2024

With large language models taking center stage in the A.I. drama, I am compelled to get hands-on, with the condition that I eventually run these experiments on my Mac. This ensures that I become aware of the optimisation methods for these models. I took up MiniCPM (a Multimodal LLM) to run the experiments. Note that this is more of a self-journaling rather than a wholesome blog on “LLMs 101…” gated by “Towards Data Sci…”.

Overview

Unlike the GPT-popularised chatbot-based LLMs, Multimodal LLMs (MLLMs) are designed for vision-language understanding. The models take image and text as inputs and provide high-quality text outputs. The foundation for such multimodal LLMs was laid by llava. To quote the description:

LLaVA is a multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking the spirits of the multimodal GPT-4. It offers 3 variants of weights: 7B, 13B, and 34B. To understand the memory requirements for different variants, here is a table used by ollama.

| Model              | Parameters | Size  | Download                       |
| ------------------ | ---------- | ----- | ------------------------------ |
| Llama 3            | 8B         | 4.7GB | `ollama run llama3`            |
| Llama 3            | 70B        | 40GB  | `ollama run llama3:70b`        |
| Phi 3 Mini         | 3.8B       | 2.3GB | `ollama run phi3`              |
| Phi 3 Medium       | 14B        | 7.9GB | `ollama run phi3:medium`       |
| Gemma              | 2B         | 1.4GB | `ollama run gemma:2b`          |
| Gemma              | 7B         | 4.8GB | `ollama run gemma:7b`          |
| Mistral            | 7B         | 4.1GB | `ollama run mistral`           |
| Moondream 2        | 1.4B       | 829MB | `ollama run moondream`         |
| Neural Chat        | 7B         | 4.1GB | `ollama run neural-chat`       |
| Starling           | 7B         | 4.1GB | `ollama run starling-lm`       |
| Code Llama         | 7B         | 3.8GB | `ollama run codellama`         |
| Llama 2 Uncensored | 7B         | 3.8GB | `ollama run llama2-uncensored` |
| LLaVA              | 7B         | 4.5GB | `ollama run llava`             |
| Solar              | 10.7B      | 6.1GB | `ollama run solar`             |ma

Note: You should have at least 8 GB of RAM available to run the 7B models, 16 GB to run the 13B models, and 32 GB to run the 33B models.

Unfortunately, my Mac has 8GB max, so we look for a compatible solution. My friend Shashwat shared openbmb/MiniCPM-Llama3-V-2_5.

MiniCPM-Lama3 (2.5)

MiniCPM-Llama3-V 2.5, the latest in the MiniCPM-V series, has 8B parameters and is built on SigLip-400M and Llama3–8B-Instruct. It excels in performance, scoring 65.1 on OpenCompass, surpassing models like GPT-4V-1106 and Claude 3. It features strong OCR capabilities, handling images up to 1.8 million pixels. The model shows trustworthy behavior with a low hallucination rate of 10.3%. It supports over 30 languages, offers efficient deployment with significant speed improvements, and is user-friendly with multiple usage options, including llama.cpp and ollama support, GGUF format models, and easy WebUI setup. You can test it yourself using this HuggingFace spaces. Let’s understand the tweakable parameters that allow us to fine-tune the performance of the language model:

Decode Type:

Beam Search: This is a search algorithm used in sequence-to-sequence models like language models to find the most likely sequence of tokens that form a coherent sentence. It works by exploring different paths through the search space and selecting the one with the highest cumulative probability.
Sampling: This is a method where the model generates text by sampling from the distribution over possible next tokens, often used to produce more diverse outputs.

Beam Search:

Num Beams: The number of beams refers to the number of hypotheses that are being explored simultaneously during beam search. A higher number can lead to more diverse and potentially better outputs but increases computational cost.
Repetition Penalty: This parameter controls the likelihood of the model repeating sequences. A higher penalty discourages repetition, which can improve coherence and fluency in generated text.

Sampling:

Top P: This parameter controls the diversity of samples produced by the model. A value of 0.8 means that only the top 80% most probable words will be sampled, which can make the generated text more fluent and less repetitive.
Temperature: This is a hyperparameter that controls the randomness of the model’s output. Lower temperatures (like 0.7) favor more certain, less diverse outputs, while higher temperatures (like 1.05) can lead to more diverse outputs but may also introduce more errors.
Repetition Penalty: This is similar to the Repetition Penalty mentioned earlier, controlling the likelihood of repetition in generated text.

Now that we are satisfied with the demo, let’s try to run it locally. I ran 2 variants of it: the python version (on GPU+Linux) and the cpp version (on Mac OS X).

Linux + DGPU

I try to start with rich-compute devices and then shift to Mac M1, to compare performance. I start with Ubuntu20 + NVIDIA GeForce RTX 3090. The setup is simple (refer to the HuggingFace page) and the demo script was as follows:

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True, torch_dtype=torch.float16)
model = model.to(device='cuda')

tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-Llama3-V-2_5', trust_remote_code=True)
model.eval()

image = Image.open('xx.jpg').convert('RGB')
question = 'What is in the image?'
msgs = [{'role': 'user', 'content': question}]

res = model.chat(
    image=image,
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True, # if sampling=False, beam_search will be used by default
    temperature=0.7,
    # system_prompt='' # pass system_prompt if needed
)
print(res)

## if you want to use streaming, please make sure sampling=True and stream=True
## the model.chat will return a generator
res = model.chat(
    image=image,
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    temperature=0.7,
    stream=True
)

generated_text = ""
for new_text in res:
    generated_text += new_text
    print(new_text, flush=True, end='')

Running the default openbmb/MiniCPM-Llama3-V-2_5 weights downloads around 25 GBs worth of shards. I was not able to run if on the GPU cards with 16GB of memory, hence, switched to a 24GB card + 128 GB RAM. For 10 iterations, this ran with the following latency:

$ python3 llama.py
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [03:31<00:00, 30.17s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
elapsed 16.026109218597412s
elapsed 1.6472110748291016s
elapsed 1.647573709487915s
elapsed 1.3239448070526123s
elapsed 0.790363073348999s
elapsed 1.5080006122589111s
elapsed 1.8222713470458984s
elapsed 0.8295838832855225s
elapsed 0.8264422416687012s
elapsed 1.380028486251831s

So, a long time for loading weights and for the first inference.

Now that we want to run it on Mac (with no dGPU and only 8GB of memory), there is no chance this python script + weights will fit there. Instead, we resort to using the cpp version via llama.cpp. We have 2 options: Ollama vs. Llama.cpp.

Background on both projects

Llama.cpp

Llama.cpp is an open-source project that provides a C++ implementation for running large language models (LLMs) efficiently. Key features include:

Performance: Optimized for fast inference, allowing models to run on various hardware, including less powerful devices like laptops and smartphones. Compatibility: Supports various LLMs, such as GPT-2, GPT-3, and other transformer-based models. Flexibility: Enables customization and fine-tuning, making it suitable for research and development purposes. Portability: Cross-platform support, ensuring it can run on different operating systems and architectures.

Ollama

Ollama is a different project, typically focused on providing specific models or a platform for deploying LLMs. Key features may include:

Model Deployment: Streamlines the process of deploying language models in production environments, including cloud-based or on-premise solutions. User Interface: Often includes user-friendly interfaces for managing models, making it accessible to users with varying technical expertise. Integration: May offer integrations with other tools and platforms, facilitating seamless workflows for data processing and analysis. Support and Maintenance: Usually backed by a team that provides regular updates, support, and maintenance, ensuring reliability and performance in production settings.

Key Differences

Purpose: Llama.cpp focuses on providing an efficient, open-source implementation for running LLMs, whereas Ollama might be geared towards model deployment and management in production environments.
Flexibility vs. Usability: Llama.cpp offers more flexibility for researchers and developers to experiment with models, while Ollama emphasizes ease of use and streamlined deployment processes.
Support and Maintenance: Ollama might come with professional support and regular updates, while Llama.cpp relies on community contributions and open-source development practices.

As Ollama is more suited for deployment, lets go with Llama.cpp as it gives us flexibility.

Inference via Llama.cpp

Setup can be understood using the official README. To summarize

git clone -b minicpm-v2.5 https://github.com/OpenBMB/llama.cpp.git
cd llama.cpp
git checkout minicpm-v2.5
make
make minicpmv-cli

When I try use the openbmb/MiniCPM-Llama3-V-2_5 4-bit quantized weights (ggml-model-Q4_K_M.gguf + mmproj-model-f16.gguf) used above, the memory proves to be insufficient, so we load the 2bit quantized weights: (ggml-model-Q2_K.gguf + mmproj-model-f16.gguf), which takes between 1 adn 1.5 GB of memory, with no GPU! 🤯.

Btw, the weights are downloaded from here

Model card to download 2bit quantized weight

Runtime

On M1

w1=/path/to/ggml-model-Q2_K.gguf
w2=/path/to/mmproj-model-f16.gguf
i=/path/to/test_image.png
./minicpmv-cli -m $w --mmproj $w2 -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image $i -p "What is in the image?"g

ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1
ggml_metal_init: picking default device: Apple M1
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/prashantdandriyal/Documents/miniCPM/llama.cpp/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M1
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  =  5726.63 MB
llama_kv_cache_init:      Metal KV buffer size =   512.00 MiB
llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:      Metal compute buffer size =   296.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    16.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2

minicpmv_init: llava init in    48.41 ms.
process_image: image token past: 0
llama_tokenize_internal: Added a BOS token to the prompt as specified by the model but the prompt also starts with a BOS token. So now the final prompt starts with 2 BOS tokens. Are you sure this is what you want?
process_image: image token past: 892

minicpmv_init: llama process image in 15464.84 ms.
<user>What is in the image?
<assistant>
The image features a baby raccoon sitting in the grass.

llama_print_timings:        load time =   16255.34 ms
llama_print_timings:      sample time =      25.07 ms /    13 runs   (    1.93 ms per token,   518.51 tokens per second)
llama_print_timings: prompt eval time =   16200.64 ms /   903 tokens (   17.94 ms per token,    55.74 tokens per second)
llama_print_timings:        eval time =    1121.31 ms /    12 runs   (   93.44 ms per token,    10.70 tokens per second)
llama_print_timings:       total time =   17515.17 ms /   915 tokens
ggml_metal_free: deallocating

So about 16s for the initialization and 1.26s for the inference — not bad.

Conclusion

We were able to run our first LLM on Mac with impressive performance (low latency). Need to dive a little deeper into Gen-AI theory and Apache TVM and Llama.cpp for optimising weights. Stay tuned.