Meta releases machine learning model that learns abstract representations of the world with little to no human help

Meta has released the first version of I-JEPA, a machine learning model that learns abstract representations of the world through self-supervised learning on images. The model requires little to no human intervention and performs strongly on computer vision tasks, making it much more efficient than other state-of-the-art models that require more computing resources for training.

Self-supervised learning

The idea of self-supervised learning is inspired by the way humans and animals learn. We obtain much of our knowledge simply by observing the world, and AI systems should be able to learn through raw observations without the need for humans to label their training data. Self-supervised learning has made great inroads in some fields of AI, including generative models and large language models (LLMs).

Joint predictive embedding architecture (JEPA)

In 2022, Meta’s chief AI scientist Yann LeCun proposed the “joint predictive embedding architecture” (JEPA), a self-supervised model that can learn world models and important knowledge such as common sense. JEPA differs from other self-supervised models in important ways.

Generative models such as DALL-E and GPT are designed to make granular predictions. However, trying to fill in every bit of information is problematic because the world is unpredictable, and the model often gets stuck among many possible outcomes. JEPA tries to learn and predict high-level abstractions, such as what the scene must contain and how objects relate to each other. This approach makes the model less error-prone and much less costly as it learns the latent space of the environment.

The I-JEPA model

I-JEPA is an image-based implementation of LeCun’s proposed architecture. It predicts missing information by using “abstract prediction targets for which unnecessary pixel-level details are potentially eliminated, thereby leading the model to learn more semantic features.”

I-JEPA encodes existing information using a vision transformer (ViT), a variant of the transformer architecture used in LLMs but modified for image processing. It then passes on this information as context to a predictor ViT that generates semantic representations for the missing parts.

The researchers at Meta trained a generative model that creates sketches from the semantic data that I-JEPA predicts. The results show that I-JEPA’s abstractions match the reality of the scene. While I-JEPA will not generate photorealistic images, it can have numerous applications in fields such as robotics and self-driving cars, where an AI agent must be able to understand its environment and handle a few highly plausible outcomes.

Efficiency and applications

One obvious benefit of I-JEPA is its memory and compute efficiency. The pre-training stage does not require the compute-intensive data augmentation techniques used in other types of self-supervised learning methods. The researchers were able to train a 632 million-parameter model using 16 A100 GPUs in under 72 hours, about a tenth of what other techniques require.

Their experiments show that I-JEPA also requires much less fine-tuning to outperform other state-of-the-art models on computer vision tasks such as classification, object counting, and depth prediction. The researchers were able to fine-tune the model on the ImageNet-1K image classification dataset with 1% of the training data, using only 12 to 13 images per class.

Given the high availability of unlabeled data on the internet, models such as I-JEPA can prove to be very valuable for applications that previously required large amounts of manually labeled data. The training code and pre-trained models are available on GitHub, though the model is released under a non-commercial license.

I-JEPA is a significant advancement in the field of self-supervised learning and has the potential to revolutionize various industries such as robotics and self-driving cars. The model’s efficiency and ability to learn abstract representations with little to no human intervention make it a valuable tool for tasks that previously required large amounts of labeled data.

Self-supervised learning

Joint predictive embedding architecture (JEPA)

The I-JEPA model

Efficiency and applications

Articles You May Like

Leave a Reply Cancel reply