Scalable Pre-Training of Large Autoregressive Image Models

Build, deploy, operate computer vision at scale

One platform for all use cases
Connect all your cameras
Flexible for your needs

We characterize Autoregressive Image Models (AIMs) by the probability distribution of an image. This is then decomposed into a product of conditional distributions. Each pixel or pixel group is conditioned on previously generated pixels. This sequential modeling captures the dependencies between pixels.

AIMs have become a key talking point in the world of machine learning due to the cutting-edge paper published by Apple Machine Learning. In this paper, researchers divulge into the collection of different model sizes, starting from a few hundred million parameters, up to a few billion. This study aimed to visualize the training performances of the models scaled in size.

In this article, we will explore the different experiments, datasets used, and derived conclusions. We will follow up with the latest research and methods utilized in the field.

About us: Viso Suite is a flexible and scalable infrastructure developed for enterprises to integrate computer vision into their tech ecosystems seamlessly. Viso Suite allows enterprise ML teams to train, deploy, manage, and secure computer vision applications in a single interface.

First, let’s walk through the basics of autoregressive modeling and its use in image modeling.

Autoregressive Models

Autoregressive models are a family of models that use historical data to predict future data points. They learn the underlying patterns of the data points and their causal relationships to predict future data points. Popular examples of autoregressive models include Autoregressive Integrated Moving Average (ARIMA) and Seasonal Autoregressive Integrated Moving Average (SARIMA). These models are mostly useful in time-series forecasting in sales and revenue.

Time-Series Forecasting Using ARIMA – Source

Autoregressive Image Models

Autoregressive Image Modeling (AIM) uses the same approach but on image pixels as data points. The approach divides the image into segments and treats the segments as a sequence of data points. The model learns to predict the next image segment given the previous data point.

Popular models like PixelCNN and PixelRNN (Recurrent Neural Networks) use autoregressive modeling to predict visual data by examining existing pixel information. These models are used in applications such as image enhancement. Some of these applications include upscaling and generative networks to create new images from scratch.

While these models directly operate on pixel values, integrating the concept of latent space or latent representations can enhance their performance and efficiency.

Pre-training Large-Scale AIMs

Pre-training an AI model involves training a large-scale foundation model on an extensive and generic dataset. The training procedure can revolve around images or text depending on the tasks the model is intended to solve.

Autoregressive image models deal with image datasets and are pre-trained on popular datasets like MS COCO and ImageNet. The researchers at Apple used the DFN dataset introduced by Fang et al. Let’s explore the dataset in detail.

Dataset

The dataset comprises 12.8 billion image-text pairs filtered from the Common Crawl dataset (text-to-image models). This dataset is further filtered to remove not safe for work content, blur faces, and remove duplicated images. Finally, alignment scores are calculated between the images and the captions, and only the top 15% of data elements are retained. The final subset contains 2 billion cleaned and filtered images, which the authors label as DFN 2B.

Architecture

The training approach remains the same as that of standard autoregressive models. The input image is divided into K equal parts and arranged in a linear combination to form a sequence. Each image segment acts as a token, and unlike language modeling, the architecture deals with a fixed number of segments.

The image segments pass to a transformer architecture, which uses self-attention to understand the pixel information. All future tokens are masked during the self-attention mechanism to ensure the model does not ‘cheat’ during the training.

We use a simple multi-layer perceptron as the prediction head on top of the transformer implementation. The 12-block MLP network projects the patch features to pixel space for the final predictions. We only utilize this head during pre-training and replace it during downstream tasks according to task requirements.

Experimentation

Multiple variations of the Autoregressive Image Models were created with variations in height and depth. The models are curated with different layers and different hidden units within each layer. We summarize the combinations in the table below:

Model variations of AIM — Model Variations

We also conduct the training on different-sized datasets, including the DFN-2B discussed above and a combination of DFN-2B and IN-1k called DFN-2B+.

Training Results

The different generation models were tested and observed for performance across several iterations. The results are as follows:

Changing Model Size: The experiment shows that increasing model parameters slightly improves the training performance. The loss reduces quickly, and the models perform better as the parameters increase.

Autoregressive Image models validation loss — Validation loss against different model sizes

Training Data Size: The AIM-0.6B model is trained against three dataset sizes to observe validation loss. The smallest data set IN-1k starts with a lower validation loss that continues to decrease but seems to bounce back after 375k iterations. The bounce back suggests that the model has begun to overfit.
The larger DFN-2B dataset starts with a higher validation loss and decreases at the same rate as the previous but does not suggest overfitting. A combined dataset (DFN-2B+) performs the best by surpassing the IN-1k dataset in validation loss and does not overfit.

Loss against different dataset sizes — Validation loss against different dataset sizes

Conclusions

The experiments’ observations concluded that the proposed models scale well in terms of performance. Training with a larger dataset (large images processed) performed better with increasing iterations. Rsearchers observed with same with increasing model capacity (increasing the number of parameters).

Overall, the models displayed similar traits as seen in Large Language Models, where larger models display better loss after numerous iterations. Interestingly enough, lower-capacity models trained for a longer schedule achieve comparable validation loss compared to higher-capacity models trained for a shorter schedule while using a similar amount of FLOPs.

Performance Comparison on Downstream Tasks

Researchers compared the AIMs against a wide range of other generative and autoregressive models on several downstream tasks. The results are summarised in the table below:

AIM outperforms most generative diffusion models such as BEiT and MAE for the same capacity or even larger. It achieves similar performance compared to the joint embedding models like DINO and iBOT and falls just behind the much more complex DINOv2.

Overall, the AIM family provides the perfect combination of performance, accuracy, and scalability.

Real-World AIM Applications

Image Generation

AIMs can assist in the process of generating images. In a method utilizing Visual Autoregressive Modeling (VAR) researchers from Peking University and Bytedance improved autoregressive image learning by allowing Autoregressive Transformers to learn visual distributions quickly and generalize well.

visual autoregressive learning — Visual samples generated with Visual Autoreregressive Training on ImageNet dataset – Source

Image Inpainting

Image inpainting is the process of using AI to fill in pixel gaps within images. The model predicts pixels based on an understanding of the surrounding visual context. Researchers from Nanyang Technological University and AMO Academy (Alibaba Group) developed an approach introducing a novel Bidirectional Autoregressive Transformer (BAT) for image inpainting.

Along with Bidirectional Encoder Representations from Transformers (BERT), BAT carries out improved generation of missing pixels for better image completion.

image inpainting with BAT and BERT — Proposed Image Inpainting Method from Researchers at Nayang Technical University and AMO Academy, Alibaba Group – Source

Super-Resolution

Super-resolution is the process of using AI to optimize the resolution of previously low quality images. Researchers from the Cooperative Medianet Innovation Center at Shanghai Jiao Tong University and Shanghai AI Laboratory proposed a method that generates realistic super-resolution images with a novel local autoregressive (LAR) module. This approach, LAR-SR, is efficient and also optimizes the output based with consistent loss based on contextual pixel information.

AIM Review and Implementation

The Autoregressive Image Models (AIMs), released by Apple research, display state-of-the-art scaling capabilities. The models are spread across different parameter counts and each of them offers a stable pre-training experience throughout.

These AIM models use a transformer architecture combined with an MLP head for pretraining. They learn from a cleaned-up dataset from the Data Filtering Networks (DFN). The experimentation phase tested different combinations of model sizes and test sets against different subsets of the main data. In each scenario, the pre-training performance scaled quite linearly with increasing model and data size.

The AIM models have exceptional scaling capabilities as observed from their validation losses. The models also display competitive performance against similar image generation and joint embedding models and strike the perfect balance between speed and accuracy.

Further Autoregressive Image Model (AIM) Research

To dive deeper and stay up-to-date with the latest research on AIMs, we suggest checking out the following publications:

Paper: Scalable Pre-training of Large Autoregressive Image Models (2024). Researchers at Apple published this paper highlighting key findings in AIMs. First, how visual feature performance scales with model capacity and data quantity. Second, how the object value function correlates with model performance on downstream tasks.
Paper: Autoregressive Image Generation without Vector Quantization (2024). This paper from researchers at MIT CSAIL, Google DeepMind, and Tsinghua University proposes modeling per-token probability distribution with a diffusion procedure. This procedure allows for the application of autoregressive models in a continuous-valued space.
Paper: Semantic-Aware Autoregressive Image Modeling for Visual Representation Learning (2023). Researchers from Megvii Technology introduced semantic-aware autoregressive image modeling to combat the lag experienced of autoregressive modeling in computer vision with self-supervised pre-training. The key finding from this research involves autoregressive model images from the semantic patches to the less semantic patches.
Paper: Few-shot Autoregressive Density Estimation: Towards Learning to Learn Distributions (2018). Researchers from Google propose modifications to PixelCNN to demonstrate state-of-the-art performance. This approach combines neural attention, meta learning techniques, and autoregressive models for useful few-shot density estimation.

Applied Computer Vision

Viso Suite makes it possible for businesses to harness the power of computer vision for everyday processes. By offering services beyond model training and deployment, organizations can manage the entire machine learning pipeline. Thus, omitting the need for point solutions.

Viso Suite enterprise intelligent solution — Viso Suite, the end-to-end computer vision solution

Learn more by booking a demo with our team of experts.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
elementor	never	This cookie is used by the website's WordPress theme. It allows the website owner to implement or change the website's content in real-time.
JSESSIONID	session	The JSESSIONID cookie is used by New Relic to store a session identifier so that New Relic can monitor session counts for an application.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
ZCAMPAIGN_CSRF_TOKEN	session	This cookie is used to distinguish between humans and bots.
zfccn	session	Zoho sets this cookie for website security when a request is sent to campaigns.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_177371481_2	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
zabUserId	1 year	This cookie is set by Zoho and identifies whether users are returning or visiting the website for the first time
zabVisitId	one year	Used for identifying returning visits of users to the webpage.
zft-sdc	24hours	It records data about the user's navigation and behavior on the website. This is used to compile statistical reports and heat maps to improve the website experience.
zps-tgr-dts	1 year	These cookies are used to measure and analyze the traffic of this website and expire in 1 year.

Cookie	Duration	Description
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.

Cookie	Duration	Description
2d719b1dd3	session	This cookie has not yet been given a description. Our team is working to provide more information.
4662279173	session	This cookie is used by Zoho Page Sense to improve the user experience.
ad2d102645	session	This cookie has not yet been given a description. Our team is working to provide more information.
zc_consent	1 year	No description available.
zc_show	1 year	No description available.
zsc2feeae1d12f14395b6d5128904ae3746	1 minute	This cookie has not yet been given a description. Our team is working to provide more information.