Two years ago, Microsoft revealed Florence, a brand new type of AI system meant to be a total reinvention of modern computer vision designs. Not the same as the majority of vision models then, Florence was both “one and the same” and “various,” which means it (1) comprehended language in addition to pictures and (2) could complete a wide variety of responsibilities instead of being restricted to particular applications, for instance producing titles.
As part of Microsoft’s strategy to make their AI research profitable, a modified version of ‘Florence’ is being rolled out as part of Azure’s Vision APIs. This new Microsoft Vision Services program will be available today in its preview for current Azure customers, with multiple functions such as automatic captioning, background removal and video summarizing as well as image retrieval.
John Montgomery, CVP of Azure AI, mentioned in an email discussion with TechCrunch that due to working with an immense amount of pictures and words, Florence has incredible adaptability. It can be asked to locate an exact picture in a video or to understand the contrast amongst a Cosmic Crisp apple and a Honeycrisp apple and is able to fulfill both requests.
watch this video and extract important information from it”).
Members of the AI research community, like Microsoft, have mainly joined forces in their belief that multimodal models are the most promising way of creating more capable AI systems. These models have the capability to carry out assignments which mono-modal models would not be able to do – for example, watching a video and extracting significant information from it. Adding text to videos in order to provide a transcript of what is being said.
One option for achieving the same results could be to connect several single-function models, like one that definitely speaks images and another that speaks language only? However, this may not always be better, as multimodal models generally perform better than just one modality due to the contextual data they get from these various sources. For example, an AI assistant would work better with the integration of several modalities. A system that can interpret images, cost information, and past buy activity is more likely to propose tailored product proposals than one that only comprehends pricing information.
The advantages of multimodal models, beyond the improved accuracy of machine learning predictions, is the significant gains in processing speed and reduced costs that Microsoft, as a profit-driven business, can take advantage of.
In regards to Florence, it is capable of using images, videos, and language to identify relationships, such as the connection between images and text. It is also able to segment objects in a picture and place them in a different setting.
I wanted to know from Montgomery what data Microsoft utilized to educate Florence – a question I considered to be current in view of the lawsuits that are potentially going to decide whether AI systems trained on copyrighted data, such as pictures, are in breach of the rights of intellectual property owners. He refused to mention any particular ones, only stating that Florence uses “legitimately obtained” data inputs “including data from partners”. In addition, Montgomery reported that Florence’s training data had any potentially problematic material eliminated- a typical occurrence in the data sets accessible to the public.
Montgomery emphasizes that it is essential to guarantee the accuracy of the training data when utilizing large underlying models. To make sure the adapted models can carry out each vision task accurately, they have undergone tests of fairness and difficulty. In addition, the same content moderation rules employed for Azure Open AI Service and DALL-E will also be applied.
We’ll just have to trust the company. Apparently, some clients are. According to Montgomery, Reddit will be making use of the Florence-driven APIs to make captions for images on its platform, so that it can provide “alternative text” which will help people with vision difficulties stay in line with conversations.
Montgomery commented that Florence’s capacity to produce up to 10,000 tags per image would give Reddit a lot more power regarding the quantity of items they can recognize in an image, thus creating better captions. Additionally, this captioning would enable Reddit users to increase article ranking for locating posts.
Microsoft is incorporating Florence into numerous of their own platforms, products, and services.
For the services provided by LinkedIn and Reddit, Florence is offering caption editing and alternate text image descriptions. Florence is used by Microsoft Teams to give video segmenting abilities. Moreover, Florence is responsible for the automated alt text development in PowerPoint, Outlook and Word. Additionally, Florence has also enabled image tagging, image searching and background creation with Designer and OneDrive.
Montgomery predicts that Florence will be used for more than just one thing in the future, such as identifying faults in production processes and allowing self-serve checkouts at stores. However, none of those use cases require a model that can support multiple types of vision. However, Montgomery believes that having multimodal vision will add something beneficial.
Montgomery asserted that Florence is an entirely fresh perspective of vision models. With the ability to seamlessly convert between images and text, a plethora of opportunities emerges. People will be able to utilize improved image searches, as well as develop image and vision models, while also applying language and speech models to create innovative applications. Additionally, users will be able to enhance their personalized versions.