Florence-2 is a cutting-edge vision foundation model developed by Microsoft, implemented using HuggingFace's transformers. This model is designed to excel at a variety of vision and vision-language tasks through a prompt-based approach, making it versatile and highly effective in interpreting and generating responses based on simple text prompts.
What Makes Florence-2 Special?
Florence-2 stands out due to its ability to handle tasks like image captioning, object detection, and segmentation with impressive accuracy. It achieves this by leveraging the extensive FLD-5B dataset, which contains 5.4 billion annotations across 126 million images. This vast dataset enables Florence-2 to master multi-task learning, making it a formidable competitor in the realm of vision foundation models.
Key Features and Capabilities
Florence-2 employs a sequence-to-sequence architecture, allowing it to perform exceptionally well in both zero-shot and fine-tuned settings. This means the model can generate accurate results even without specific training on a task (zero-shot) and can further improve when fine-tuned with specific data.
- Florence-2-base: 0.23 billion parameters, pretrained with FLD-5B.
- Florence-2-large: 0.77 billion parameters, pretrained with FLD-5B.
- Florence-2-base-ft: 0.23 billion parameters, fine-tuned on a collection of downstream tasks.
- Florence-2-large-ft: 0.77 billion parameters, fine-tuned on a collection of downstream tasks.
Getting Started with Florence-2
To start using Florence-2, you can follow the example code provided below. This code demonstrates how to load the model and processor, and how to use a simple prompt for object detection:
import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("microsoft/Florence-2-large", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("microsoft/Florence-2-large", trust_remote_code=True)
prompt = "<OD>"
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors="pt")
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=1024,
num_beams=3,
do_sample=False
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed_answer = processor.post_process_generation(generated_text, task="<OD>", image_size=(image.width, image.height))
print(parsed_answer)
Performance Benchmarks
Florence-2 exhibits outstanding performance in various benchmarks, particularly in zero-shot settings. For instance, it achieves notable scores on tasks like image captioning and object detection, proving its robustness and versatility. Here are some of the highlights:
- COCO Captioning: Florence-2-large achieves a CIDEr score of 135.6.
- NoCaps Val: Florence-2-large scores 120.8.
- COCO Detection: Florence-2-large achieves a mean Average Precision (mAP) of 37.5.
Fine-Tuned Performance
When fine-tuned, Florence-2 models continue to impress across multiple tasks, including visual question answering (VQA) and more. The fine-tuned versions (Florence-2-base-ft and Florence-2-large-ft) demonstrate superior performance, making them suitable for a broad range of applications.
Conclusion
Florence-2 is a powerful vision foundation model that pushes the boundaries of what's possible in vision and vision-language tasks. Its ability to handle diverse tasks with high accuracy, coupled with its robust performance in both zero-shot and fine-tuned settings, makes it a valuable tool for developers and researchers alike. Whether you're working on image captioning, object detection, or any other vision-related task, Florence-2 offers a versatile and reliable solution.
FAQs
-
What is Florence-2?
Florence-2 is a vision foundation model developed by Microsoft, implemented using HuggingFace's transformers. It is designed to handle various vision and vision-language tasks through a prompt-based approach. -
What makes Florence-2 special?
Florence-2 stands out due to its ability to perform tasks like image captioning, object detection, and segmentation with high accuracy. It uses the extensive FLD-5B dataset, containing 5.4 billion annotations across 126 million images. -
What are the key features and capabilities of Florence-2?
Florence-2 employs a sequence-to-sequence architecture, excelling in both zero-shot and fine-tuned settings. It can generate accurate results without specific task training and improve further with fine-tuning. -
What models are available for Florence-2?
There are four models: Florence-2-base (0.23 billion parameters), Florence-2-large (0.77 billion parameters), Florence-2-base-ft (0.23 billion parameters, fine-tuned), and Florence-2-large-ft (0.77 billion parameters, fine-tuned). -
How can I get started with Florence-2?
You can start using Florence-2 by loading the model and processor and using a simple prompt for tasks like object detection. Example code is provided in the blog to help you get started. -
How does Florence-2 perform in benchmarks?
Florence-2 performs exceptionally well in various benchmarks, especially in zero-shot settings. For instance, it achieves a CIDEr score of 135.6 in COCO Captioning and a mAP of 37.5 in COCO Detection. -
What is the fine-tuned performance of Florence-2?
When fine-tuned, Florence-2 models show superior performance across multiple tasks, including visual question answering (VQA). The fine-tuned models are suitable for a wide range of applications.
References
References
- Florence-2 on HuggingFace
- Florence-2 technical report
- Jupyter Notebook for inference and visualization of Florence-2-large
- Xiao, B., Wu, H., Xu, W., Dai, X., Hu, H., Lu, Y., Zeng, M., Liu, C., & Yuan, L. (2023). Florence-2: Advancing a unified representation for a variety of vision tasks. arXiv preprint arXiv:2311.06242.