UniLM And The Big Convergence: Microsoft's Foundation Models Across Language, Vision, Speech, And Multimodal

Inside the mega-repo behind LayoutLM, BEiT, LongNet, RetNet, E5, and the Kosmos series

microsoft

Get All The Latest to Your Inbox!

Advertise Here!

Gain premium exposure to our growing audience of professionals. Learn More

Inquire Now

Some GitHub repositories feel like a lab in motion. microsoft/unilm is one of them - a sprawling portfolio of research code, models, and toolkits that document Microsoft Research's multi-year journey toward general-purpose foundation models across tasks, languages, and modalities.

It is not a single library but a living map: language (UniLM, MiniLM, InfoXLM), vision (BEiT), speech (WavLM, VALL-E), and multimodal systems (LayoutLM, Kosmos) sit side by side with new architectures (RetNet, LongNet) and utilities to fine-tune and decode at scale, all linked from the root README.md.

microsoft

Organization

unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Homepage

77.2k KB

2.7k Network

309 Subscribers

Python

beitbeit-3bitnetdeepnetdocument-ai ...

The central need it addresses is fragmentation. Real-world AI work spans text, images, audio, and layouts; it requires training stability, efficiency, and transfer across domains. UniLM gathers these strands under one roof, making research artifacts reproducible and comparable while pointing to production-ready counterparts like TorchScale for core architectures.

Key Features

Multimodal Document AI with LayoutLM: pretraining over text, layout, and images; see layoutlm/README.md and successors layoutlmv2, layoutlmv3, and multilingual layoutxlm (Xu et al., 2020).

Vision pretraining via BEiT: BERT-style masked image modeling for Transformers; see beit/README.md plus beit2 and beit3 (Bao et al., 2022).

Architectures at scale: RetNet and LongNet, with pointers to retnet/README.md (Sun et al., 2023) and longnet/README.md.

Text embeddings with E5: strong retrieval baselines and scripts; see e5/README.md (Wang et al., 2022/2024).

Speech stack: WavLM pretraining and VALL-E neural codec TTS under wavlm and valle.

Multimodal LLMs: the Kosmos series for grounding and literacy under kosmos-1, kosmos-2, and kosmos-2.5.

Utilities and toolkits: s2s-ft for fine-tuning seq2seq models and decoding for lossless speedups (Aggressive Decoding).

Clear governance: CONTRIBUTING.md, CODE_OF_CONDUCT.md, and an MIT LICENSE.

The Problem And The Solution

Modern AI teams face a paradox: models are bigger and more capable, yet projects still splinter across stacks and modalities. Document AI wants layout and pixels. OCR needs language and vision. Speech pipelines need self-supervised pretraining that generalizes.

The solution proposed here is a coherent set of pretraining strategies and architectures that work across tasks (predictive and generative), languages (100+), and modalities (text, image, audio, text+image layout).

The repo's index shows the arc clearly - from unified language modeling (UniLM) to multimodal readers (LayoutLM), vision Transformers pretraining (BEiT), efficient embeddings (E5), and architectural leaps like RetNet and LongNet that target stability, efficiency, and length extrapolation.

Why I Like It

UniLM reads like a research diary with great indexing. Each subproject has a focused README.md with links to papers, checkpoints, and simple commands; many models are mirrored on Hugging Face for quick trials.

The root page tracks milestones over years, so you can see ideas such as document foundation models (LayoutLM), self-supervised vision pretraining (BEiT), and retention mechanisms (RetNet) evolve and cross-pollinate.

It's also pragmatic: fine-tuning code lives in s2s-ft, decoding tricks in decoding, and the architecture work is broken out to TorchScale for reuse.

# Minimal RetNet example using TorchScale (per retnet/README.md)
import torch
from torchscale.architecture.config import RetNetConfig
from torchscale.architecture.retnet import RetNetDecoder

# Define a small vocabulary just for the demo
config = RetNetConfig(vocab_size=32000)
model = RetNetDecoder(config)

# Toy batch of token ids (batch=2, seq=8)
input_ids = torch.randint(0, config.vocab_size, (2, 8))
logits = model(input_ids)[0]  # logits: (batch, seq, vocab_size)
print(logits.shape)

Under The Hood

The repository is a constellation rather than a monolith. Many research directories are self-contained with scripts and configs, while foundational architecture code is consolidated in TorchScale (RetNet, DeepNet, X-MoE, LongNet).

The root README summarizes themes: stability (DeepNet), generality (Foundation Transformers), capability (length extrapolation), and efficiency (X-MoE).

Subprojects link to arXiv preprints, checkpoints, and sometimes demos. For example, LayoutLM shows how word-patch alignment and unified masking lift Document AI benchmarks, while BEiT adapts masked modeling to images with a tokenizer over visual tokens.

Architectural work is especially notable. RetNet replaces self-attention with a retention mechanism that supports parallel, recurrent, and chunkwise forms for efficient long-sequence modeling (Sun et al., 2023). LongNet explores dilated attention to scale token windows to extreme lengths. These are shipped for practical use via TorchScale's pip package, keeping research and engineering close enough to adopt.

On the multimodal front, LayoutLM stands out as an early Document Foundation Model that fuses text, layout, and pixels, with multilingual extensions in LayoutXLM (Xu et al., 2020). In vision, BEiT pretraining delivers strong transfer to classification and segmentation (Bao et al., 2022). For retrieval tasks, E5 supplies text embeddings and evaluation scripts on BEIR and MTEB (Wang et al., 2022/2024).

Use Cases

Document AI in enterprises: invoices, forms, and scanned PDFs benefit from LayoutLM's tri-modal pretraining and tailored fine-tuning flows; multilingual workflows map to LayoutXLM.

Vision tasks from classification to semantic segmentation can leverage BEiT pretraining and fine-tuning recipes.

Retrieval and RAG stacks can plug in E5 embeddings as a solid baseline for semantic search, while RetNet and LongNet hint at next-gen backbones for efficient, very-long-context LLMs.

On the speech side, WavLM and VALL-E showcase unified pretraining and synthesis for ASR and TTS. The Kosmos line points to grounded multimodal interfaces that read charts or text-heavy images with instruction following.

Community & Contribution

The project follows Microsoft's open source governance. Before contributing, review CONTRIBUTING.md and sign the CLA when prompted by the bot. Behavior is guided by the Code of Conduct. Issues and discussions happen per subproject; many READMEs include email contacts for research correspondence. The cadence in the root README's News section shows steady releases and paper acceptances over multiple years, which is a good signal of project vitality.

Usage & License Terms

Licensing is permissive MIT; see LICENSE. In short, you may use, copy, modify, merge, publish, distribute, sublicense, and sell copies of the software, provided the copyright notice and permission notice are included in substantial portions. The software is provided "AS IS" without warranty of any kind. Individual subprojects may include additional notices (see NOTICE.md), and many reference or build upon transformers.

Impact & Future Potential

UniLM's significance is twofold. First, it helped popularize the idea of unified pretraining across tasks and modalities, long before today's multimodal LLMs became mainstream. Second, it treats architecture as an open question, shipping concrete alternatives to attention (RetNet) and scaling rules (DeepNet, LongNet) that others can try in practice.

As the industry standardizes on large multimodal systems, repositories like this one - with clear READMEs, recipes, and checkpoints - offer a faster path from paper to product. Expect continued cross-pollination with TorchScale and downstream libraries, and more examples that demonstrate transfer across domains at lower cost.

About Microsoft Research

UniLM is maintained by researchers affiliated with Microsoft Research and collaborators. The work aligns with Microsoft's broader push toward general AI; see the portal at aka.ms/GeneralAI and the Document AI project page referenced in LayoutLM. Open source participation follows the Microsoft Open Source Code of Conduct and standard CLA process.

Conclusion

If you want a guided tour through the evolution of foundation models, you could do worse than scrolling the UniLM directory tree. Start at the root README, pick a subproject like LayoutLM, BEiT, RetNet, or E5, and run a small experiment. Keep TorchScale handy for production-grade architecture code, and revisit the News section to watch where the research is heading.

References: (Xu et al., 2020); (Bao et al., 2022); (Sun et al., 2023); (LongNet, 2023); (Wang et al., 2022); (Wang et al., 2024).

in Github Repos

# beit e5 foundation models kosmos layoutlm longnet microsoft retnet unilm wavlm

Authors:

microsoft

Joshua Berkowitz August 24, 2025

Views 6556

Share this post

blogs

Our latest content

Check out what's new !

See all

Ads

Prompt Maker Image Generator

Struggling with the perfect AI image prompt? My free app helps you generate brilliant ideas and instantly creates an image to match. Go from concept to creation in two clicks!

Try It

Most Popular Articles

Check out what the hot topics are!

See all

Follow us

UniLM And The Big Convergence: Microsoft's Foundation Models Across Language, Vision, Speech, And Multimodal

Get All The Latest to Your Inbox!

Advertise Here!

Inquire Now

microsoft

unilm

Key Features

The Problem And The Solution

Why I Like It

Under The Hood

Use Cases

Community & Contribution

Usage & License Terms

Impact & Future Potential

About Microsoft Research

Conclusion

Share this post

Tags

blogs

Our latest content

Prompt Maker Image Generator

Most Popular Articles

Every shirt tells a story—and every story

#ClothingForACause