Apple has quietly released Ferret neural network that works with text and images
Back in October, Apple, with the support of Cornell University scientists, made publicly available its own multimodal large-scale language model Ferret, which can accept image fragments as queries.
The release of Ferret on GitHub in October was not accompanied by any major announcements from Apple, but the project subsequently attracted the attention of industry experts. The way Ferret works is that the model examines the specified image fragment, identifies objects in this area, and outlines them with a frame. The system perceives the objects recognized in the image fragment as part of the query, the answer to which is provided in text format.
For example, a user can select an image of an animal and ask Ferret to recognize it. The model will answer what species the animal belongs to, and you can ask it additional questions in context, clarifying information about other objects or actions.
Ferret’s open model is a system capable of “linking and proving anything, anywhere, with any details,” explained Zhe Gan, a researcher at Apple’s AI division. Industry experts emphasize the importance of releasing the project in this format, as it demonstrates the openness of a traditionally closed company.
According to one version, Apple took this step because it wants to compete with Microsoft and Google, but does not have comparable computing resources. Because of this, it could not count on releasing its own competitor ChatGPT and had to choose between partnering with a cloud-based hyperscaler and releasing the project in an open format, as Meta.com had previously done.