At a press conference in Paris in June, Meta Group research director Yann Le Cun announced a new kind of AI that learns by comparing abstract representations of images, rather than comparing the pixels themselves. This AI, called I-JEPA, shows excellent performance on multiple computer vision tasks. The representations learned by I-JEPA can also be used for many different applications. To this end, the researchers trained a specialized transformer-like learning model for mink, using 632 million parameters, using 16 GPUs (Type 100), in less than 72 hours. With just twelve examples labeled per class, the I-JEPA model was able to classify the images as well as any AI trained on millions of examples from the reference database. ImageNet. The scientific paper was presented at the CVPR 2023 International Conference, which was held in Toronto at the end of June, and Available online.
Create machines that learn faster
This artificial intelligence presages Yann La Cun’s vision of new statistical learning architectures that aim to create machines capable of learning internal models of how the world works. the goal? Learn faster, plan complex tasks, and adapt easily to unfamiliar situations. In short, getting a little closer to human intelligence…
In practice, the implemented model aims to predict the representation of one part of the input (such as an image or a piece of text) from the representation of other parts of the same input. By predicting representations at a high level of abstraction rather than predicting pixel values, we hope to learn useful representations that also avoid the limitations of generative methods, and are the basis for large language models that have caused a lot of excitement lately (such as ChatGPT).
Effective, but (already) outdated
What is the difference between the two? Generative constructs learn by deleting or distorting parts of the form input – for example, erasing part of an image or hiding certain words in a section of text. Then they try to predict damaged or missing pixels or words. If they are very effective, these methods have a major drawback: the model tries to fill in every missing piece of information, even if the world is inherently unpredictable. Therefore, generative methods can be prone to errors that a person would never make, because they focus too much on irrelevant details rather than capturing predictable high-level concepts. The tendency to “hallucinate”—that is, tell false things—specific to key language paradigms, contributes to this limitation. It’s also notoriously difficult for generative models to accurately model human hands (on images created in this way, characters often have extra or deformed fingers, which is a useful clue for recognizing a synthetic image).
Eliminate it seems useless
I-JEPA is based on predicting missing information in an abstract representation, which is closer to the general understanding that humans have of it. Compared to generative methods that predict pixel area, I-JEPA uses abstract predictive targets with which unnecessary detail at the pixel level is likely to be eliminated. This is what allows the model to learn more semantic features (like a human). Newly developed algorithms are able to learn high-level representations of parts of objects, without ignoring the localized location information in the image.
On a more technical level, “The idea is to create a representation of the model’s output rather than just decoding the hidden representation learned from the input data”explains Aurélien Rodriguez, a researcher at Meta Research in Paris. This breakthrough also owes a lot to all the innovations that have been made in our research laboratories around the world in recent years. Yan Le Kun says.
Beyond images, this type of approach can be extended to include text image data and associated video data in order to perform automatic video comprehension tasks whose automation is still very rudimentary. Will these new types of models take precedence over generative models? Either way, the Meta owns it in a few ways, as the company just announced the launch of a new language model called Llama2. Available in open access – unlike GPT4 – it can be found on Azure (Microsoft) and AWS (Amazon) platforms, as well as via Hugging Face.
Philip Bagot
Open the image : Example of visualizing I-JEPA predictor representations. Green squares mark reconstructions, the quality of which varies between samples (Credit: Meta).
“Evil thinker. Music scholar. Hipster-friendly communicator. Bacon geek. Amateur internet enthusiast. Introvert.”