MM1

AIBUSINESSBRAINS

Apple Launches MM1: Its First Series of Multimodal Language Models

MM1

Apple’s MM1 Model: Advancing AI with Multimodal Language Capabilities

Apple engineers have shared insights through a research paper on Multimodal Large Language Models (MLLMs). The document details the creation of a series of MLLMs, named MM1, with up to 30 billion parameters. This model stands out for its capabilities in generating image captions, answering visual questions, and understanding natural language.

Though Apple has not yet launched an AI model, this paper provides a glimpse into the company’s advances in developing models with advanced multimodal functions.

Entitled “MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training”, The paper talks about a group of advanced AI models created by Apple, called MM1. This model is really good at creating captions for images, answering questions about pictures, and understanding spoken or written language. This is because it uses carefully selected pairs of images and captions to get great results, especially when it doesn’t have many examples to learn from.

MM1 distinguishes itself through its adeptness in processing instructions across several images and interpreting complex scenes.

With 30 billion parameters, MM1’s capacity is threefold that of the vision component in OpenAI’s GPT-4, enabling sophisticated multimodal interactions.

The model was refined through extensive training on a collection of 500 million mixed image-text documents, encompassing 1 billion images and 500 billion text tokens. This extensive and varied pre-training allows MM1 to make notable in-context predictions and adhere to specific formatting after being shown just a few examples.

MM1’s Capabilities

1- Counting objects

2- Performing Optical Character Recognition (OCR) in designated image areas

3-  Applying logical reasoning to objects

4-  Executing basic arithmetic operations.

Developing AI models with visual and reasoning abilities involves creating a vision-language connector, which translates images and text into a coherent format for the model to process further. The study reveals that the design of this connector played a minor role in MM1’s success, whereas image quality and the quantity of image tokens were more influential.

Apple’s openness in sharing its findings with the AI community marks a significant step. The researchers aim to share the process of building MLLMs and provide valuable design insights to aid others in the field.

The disclosed findings are expected to guide the approach of other MLLM developers in terms of architecture and pre-training data choices.

We still have to learn how Apple plans to use the MM1 models in its products. However, the skills they’ve shown could greatly improve how Siri works, possibly adding the ability to understand and process images in the future.

Large Language Models (LLMs) act as powerful computer brains, making it easier for machines to grasp and communicate in human language. However, they can sometimes encounter problems or require improvements. For a deeper dive into these challenges, take a look at this article.

Source

Leave a Comment