IT Brief India - Technology news for CIOs & IT decision-makers
India
Linux Foundation launches DocLang group for AI documents

Linux Foundation launches DocLang group for AI documents

Wed, 17th Jun 2026 (Today)
Sofiah Nichole Salivio
SOFIAH NICHOLE SALIVIO News Editor

LF AI & Data Foundation has launched the DocLang Specification Working Group, bringing together IBM, NVIDIA, Red Hat, ABBYY and HumanSignal.

The group will develop DocLang, an open document format for artificial intelligence systems. Operating under the Joint Development Foundation's vendor-neutral governance model, it aims to create a standard for how document data is prepared, exchanged and managed for AI use.

Supporters say the effort addresses a longstanding problem for businesses that rely on large collections of PDFs, Word documents, images and HTML pages. Those formats were designed largely for people to read, not for software models to interpret consistently.

That mismatch has become more visible as companies adopt generative AI tools and agent-based software to search, summarise and extract information from internal documents. Key business information often sits in files whose layout, structure and meaning are difficult to preserve when converted for machine use.

DocLang is intended to provide a common representation that preserves both document structure and page layout in a single format. The specification is also expected to include controls that let downstream systems enforce policies on privacy, extraction scope and permissions for model training.

Open standard

Mark Collier, General Manager of AI & Infrastructure at the Linux Foundation and Executive Director of LF AI & Data, outlined the case for the new group.

"Documents remain one of the most important sources of enterprise knowledge, but most were never designed for AI-driven workflows," Collier said. "With the launch of the DocLang Working Group, we are bringing the open source community together to develop a vendor-neutral, interoperable standard that helps organizations prepare document data for AI more reliably, transparently, and at scale. Combined with projects like Docling, this effort can help create a more open foundation for document understanding across the AI ecosystem."

The working group builds on Docling, an open source document-processing toolkit hosted by LF AI & Data. IBM Research Zurich's AI for Knowledge team originally developed the software before releasing it as open source in 2024.

Docling ingests files including PDF, DOCX, PPTX, XLSX, HTML and images, then converts them into structured outputs. Its internal document model records text, tables, figures, reading order and layout. DocLang is intended to define a standard way to express and exchange that structured output between systems.

Together, the two projects form a broader open source stack for document AI, covering ingestion, parsing, standardised representation, and use by language models and agentic AI systems.

Industry backing

The launch has support from some of the technology groups most active in enterprise AI infrastructure. IBM, NVIDIA and Red Hat are founding members of the working group, while ABBYY and HumanSignal are contributors.

NVIDIA views the specification as a way to expand use of a common document format across sectors.

"NVIDIA looks forward to working with the Linux Foundation and the broader DocLang ecosystem to accelerate the adoption of this AI-native document format across industries," said Kari Briski, Vice President of Generative AI at NVIDIA.

ABBYY, which specialises in document processing and optical character recognition software, argued that the issue goes beyond simple file conversion. It said the structure of most business documents creates uncertainty for AI systems, increasing computing demands and leading to inconsistent results.

"DocLang is designed to solve one of the foundational problems in enterprise AI: documents were built for humans, not machines," said Maxime Vermeir, Vice President of AI Strategy at ABBYY. "By introducing a minimal, standardized, and AI-native representation of document structure, layout, meaning and governance, DocLang creates a far more deterministic foundation for modern AI systems. The results in an AI native context layer at scale."

The specification is intended to preserve semantic meaning and page geometry in a single format, represent elements such as headings, paragraphs and tables alongside their positions on a page, and align document representation more closely with modern tokenisation and modelling methods. That reflects a wider push across the AI industry to make unstructured enterprise data more usable by language models without losing context or control.

For the Linux Foundation's AI and data arm, the new group also extends its role beyond hosting software projects into standards work around how AI systems handle document-based information, an area that remains fragmented despite the central role documents play in business operations.