Is This the Future of Vision? Florence-2 by Microsoft Redefines AI Perception
June 23, 2024

Florence-2 is a cutting-edge vision foundation model developed by Microsoft, implemented using HuggingFace's transformers. This model is designed to excel at a variety of vision and vision-language tasks through a prompt-based approach, making it versatile and highly effective in interpreting and generating responses based on simple text prompts.

What Makes Florence-2 Special?

Florence-2 stands out due to its ability to handle tasks like image captioning, object detection, and segmentation with impressive accuracy. It achieves this by leveraging the extensive FLD-5B dataset, which contains 5.4 billion annotations across 126 million images. This vast dataset enables Florence-2 to master multi-task learning, making it a formidable competitor in the realm of vision foundation models.

Key Features and Capabilities

Florence-2 employs a sequence-to-sequence architecture, allowing it to perform exceptionally well in both zero-shot and fine-tuned settings. This means the model can generate accurate results even without specific training on a task (zero-shot) and can further improve when fine-tuned with specific data.

  • Florence-2-base: 0.23 billion parameters, pretrained with FLD-5B.
  • Florence-2-large: 0.77 billion parameters, pretrained with FLD-5B.
  • Florence-2-base-ft: 0.23 billion parameters, fine-tuned on a collection of downstream tasks.
  • Florence-2-large-ft: 0.77 billion parameters, fine-tuned on a collection of downstream tasks.

Getting Started with Florence-2

To start using Florence-2, you can follow the example code provided below. This code demonstrates how to load the model and processor, and how to use a simple prompt for object detection:

import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-large", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large", trust_remote_code=True)
prompt = "<OD>"
url = "" image =, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors="pt")
generated_ids = model.generate( input_ids=inputs["input_ids"], pixel_values=inputs["pixel_values"], max_new_tokens=1024, num_beams=3, do_sample=False )
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed_answer = processor.post_process_generation(generated_text, task="<OD>", image_size=(image.width, image.height))

Performance Benchmarks

Florence-2 exhibits outstanding performance in various benchmarks, particularly in zero-shot settings. For instance, it achieves notable scores on tasks like image captioning and object detection, proving its robustness and versatility. Here are some of the highlights:

  • COCO Captioning: Florence-2-large achieves a CIDEr score of 135.6.
  • NoCaps Val: Florence-2-large scores 120.8.
  • COCO Detection: Florence-2-large achieves a mean Average Precision (mAP) of 37.5.

Fine-Tuned Performance

When fine-tuned, Florence-2 models continue to impress across multiple tasks, including visual question answering (VQA) and more. The fine-tuned versions (Florence-2-base-ft and Florence-2-large-ft) demonstrate superior performance, making them suitable for a broad range of applications.


Florence-2 is a powerful vision foundation model that pushes the boundaries of what's possible in vision and vision-language tasks. Its ability to handle diverse tasks with high accuracy, coupled with its robust performance in both zero-shot and fine-tuned settings, makes it a valuable tool for developers and researchers alike. Whether you're working on image captioning, object detection, or any other vision-related task, Florence-2 offers a versatile and reliable solution.


  • What is Florence-2?
    Florence-2 is a vision foundation model developed by Microsoft, implemented using HuggingFace's transformers. It is designed to handle various vision and vision-language tasks through a prompt-based approach.
  • What makes Florence-2 special?
    Florence-2 stands out due to its ability to perform tasks like image captioning, object detection, and segmentation with high accuracy. It uses the extensive FLD-5B dataset, containing 5.4 billion annotations across 126 million images.
  • What are the key features and capabilities of Florence-2?
    Florence-2 employs a sequence-to-sequence architecture, excelling in both zero-shot and fine-tuned settings. It can generate accurate results without specific task training and improve further with fine-tuning.
  • What models are available for Florence-2?
    There are four models: Florence-2-base (0.23 billion parameters), Florence-2-large (0.77 billion parameters), Florence-2-base-ft (0.23 billion parameters, fine-tuned), and Florence-2-large-ft (0.77 billion parameters, fine-tuned).
  • How can I get started with Florence-2?
    You can start using Florence-2 by loading the model and processor and using a simple prompt for tasks like object detection. Example code is provided in the blog to help you get started.
  • How does Florence-2 perform in benchmarks?
    Florence-2 performs exceptionally well in various benchmarks, especially in zero-shot settings. For instance, it achieves a CIDEr score of 135.6 in COCO Captioning and a mAP of 37.5 in COCO Detection.
  • What is the fine-tuned performance of Florence-2?
    When fine-tuned, Florence-2 models show superior performance across multiple tasks, including visual question answering (VQA). The fine-tuned models are suitable for a wide range of applications.



Last updated on June 23, 2024