5. Open WebUI 对接文档提取

什么是文档提取

Document Extraction in Open WebUI

Open WebUI provides powerful document extraction capabilities that allow you to process and analyze various types of documents within your RAG (Retrieval Augmented Generation) workflows. Document extraction is essential for transforming unstructured document content into structured data that can be effectively used by language models.

What is Document Extraction?

Document extraction refers to the process of automatically identifying and extracting text and data from various file formats, including:

PDFs (both text-based and scanned)

Images containing text

Handwritten documents

And more

With proper document extraction, Open WebUI can help you:

Convert image-based documents to searchable text

Preserve document structure and layout information

Extract data in structured formats for further processing

Support multilingual content recognition

Available Extraction Methods

Open WebUI supports multiple document extraction engines to accommodate different needs and document types. Each extraction method has its own strengths and is suitable for different scenarios.

Explore the documentation for each available extraction method to learn how to set it up and use it effectively with your Open WebUI instance.

💡

这里使用微软的Document Intelligence做文档提取，将非结构化的内容（比如PDF文件）转换成语言模型可以理解的结构化数据

Open WebUI 默认文本抽取方式有限，Open WebUI 在向量化/检索前通常使用 pdfminer / PyPDF2 之类工具提取文字。纯图片（扫描件）PDF 没有内嵌文本层，这些库只能得到“空文本”，例如：

总结文本型PDF——ok

总结图片型PDF，看似“上传成功”但内部没有分词文本 → 最终检索为空，表现为“读不出来”——fail

Open WebUI后端已集成Document Intelligence依赖库，前端配置只要在环境变量中增加Azure DI服务的Key和Endpoint ，且 Post 更新 CONTENT_EXTRACTION_ENGINE 即可： Feat: Adding Support for Azure AI Document Intelligence for Content Extraction · open-webui/open-webui · Discussion #9583

启用步骤（推荐顺序）

在 .env 添加：


DOCUMENT_INTELLIGENCE_ENDPOINT=...
DOCUMENT_INTELLIGENCE_KEY=...

启动 / 重启容器。

通过 API 或前端 RAG 设置界面发送正确的更新请求：


POST /api/v1/retrieval/config/update
{
  "CONTENT_EXTRACTION_ENGINE": "document_intelligence",
  "DOCUMENT_INTELLIGENCE_ENDPOINT": "https://xxx.cognitiveservices.azure.com/",
  "DOCUMENT_INTELLIGENCE_KEY": "<masked>"
}

验证：


GET /api/v1/retrieval/config
-> "CONTENT_EXTRACTION_ENGINE": "document_intelligence"

上传一个纯图片 PDF → 检索文本验证。

也可以把上述一键开启


$TOKEN = "<admin_token>"
$BASE  = "http://localhost:3000/api/v1/retrieval"

$payload = @{
  CONTENT_EXTRACTION_ENGINE       = "document_intelligence"
  DOCUMENT_INTELLIGENCE_ENDPOINT  = "https://xxx.cognitiveservices.azure.com/"
  DOCUMENT_INTELLIGENCE_KEY       = "<KEY>"
  CHUNK_SIZE                      = 1000
  CHUNK_OVERLAP                   = 100
}

Invoke-RestMethod -Method POST -Uri "$BASE/config/update" -Headers @{
  Authorization = "Bearer $TOKEN"
  "Content-Type" = "application/json"
} -Body ($payload | ConvertTo-Json -Depth 4)

Invoke-RestMethod -Method GET -Uri "$BASE/config" -Headers @{
  Authorization = "Bearer $TOKEN"
  "Content-Type" = "application/json"
} | ConvertTo-Json -Depth 4 | Select-String -Pattern "CONTENT_EXTRACTION_ENGINE","DOCUMENT_INTELLIGENCE_ENDPOINT"

验证：