Key Points
- Multimodal Upgrade: Microsoft enhances its Phi Silica on-device small language model (SLM) with vision-based multimodal capabilities, enabling image understanding on Windows.
- Efficient Integration: The new feature leverages existing components, adding only a small 80-million parameter projector model, to maintain resource efficiency.
- Accessibility Boost: The update improves screen reader descriptions for blind or low-vision users, generating more detailed and accurate Alt Text on Copilot+ PCs.
Microsoft Enhances Windows with Vision-Based Multimodal Capabilities for Phi Silica
Microsoft announced a significant update to its Phi Silica on-device small language model (SLM), incorporating vision-based multimodal capabilities. This integration unlocks new possibilities for local SLMs on Windows, particularly for accessibility and productivity experiences. The key innovation is the addition of image understanding, now available on Copilot+ PCs with Snapdragon, Intel, and upcoming AMD processors, utilizing the dedicated NPU (Neural Processing Unit).
Efficient Architecture
To minimize resource usage, Microsoft’s approach reuses the existing Phi Silica model and the Florence image encoder, already deployed in Windows Recall and improved search features. A small multimodal projector module (80 million parameters) was trained to translate vision embeddings into a format compatible with Phi Silica. This design ensures compatibility with the frozen, quantized vision encoder and maintains the model’s efficiency, avoiding the need for separate, resource-intensive vision language models.
Accessibility Enhancements
The multimodal functionality significantly improves the screen reading experience for individuals who are blind or have low vision. By generating variable-length descriptions (short Alt Text to detailed explanations), Phi Silica offers more accurate and useful image descriptions compared to current cloud-based methods. Examples demonstrate the enhanced detail, providing richer context for users relying on screen readers like Microsoft Narrator.
Technical Implementation
- Vision Encoder: Reuses the Florence model to extract image tokens.
- Projector Model: Trained to align vision embeddings with Phi Silica’s embedding space, enabling seamless integration.
- Quantization: Ensures efficient inference on the NPU, with 4-bit weight precision using QuaRot, and calibrated for visual embedding representation.
- OCR Integration: Supports precise text extraction from images (e.g., charts, graphs) by fusing visual and textual information.
Performance and Evaluation
The system generates descriptions in approximately 4 seconds for short texts (135 characters) and 7 seconds for detailed ones (400-450 characters), currently optimized for English. Evaluation via the LLM-as-a-judge technique, using GPT-4o, showed improved accuracy and completeness across various image categories compared to existing Florence-driven descriptions.
Implications and Future Steps
This update underscores Microsoft’s commitment to accessibility and efficient AI integration on Windows. By enhancing Phi Silica with vision capabilities, the company paves the way for more sophisticated, resource-conscious multimodal experiences. Future updates will expand language support, further enriching the feature set for global users. As Microsoft continues to innovate in the client-AI space, users can expect more seamless, accessible, and productive interactions with their Windows devices.
Read the rest: Source Link
You might also like: Try AutoCAD 2026 for Windows, best free FTP Clients on Windows & browse Windows games to download.
Remember to like our facebook and our twitter @WindowsMode for a chance to win a free Surface every month.