🛥️

5. Open WebUI 对接文档提取

什么是文档提取
Document Extraction in Open WebUI
Open WebUI provides powerful document extraction capabilities that allow you to process and analyze various types of documents within your RAG (Retrieval Augmented Generation) workflows. Document extraction is essential for transforming unstructured document content into structured data that can be effectively used by language models.
What is Document Extraction?
Document extraction refers to the process of automatically identifying and extracting text and data from various file formats, including:
  • PDFs (both text-based and scanned)
  • Images containing text
  • Handwritten documents
  • And more
With proper document extraction, Open WebUI can help you:
  • Convert image-based documents to searchable text
  • Preserve document structure and layout information
  • Extract data in structured formats for further processing
  • Support multilingual content recognition
Available Extraction Methods
Open WebUI supports multiple document extraction engines to accommodate different needs and document types. Each extraction method has its own strengths and is suitable for different scenarios.
Explore the documentation for each available extraction method to learn how to set it up and use it effectively with your Open WebUI instance.
💡
这里使用微软的Document Intelligence做文档提取,将非结构化的内容(比如PDF文件)转换成语言模型可以理解的结构化数据
Open WebUI 默认文本抽取方式有限,Open WebUI 在向量化/检索前通常使用 pdfminer / PyPDF2 之类工具提取文字。纯图片(扫描件)PDF 没有内嵌文本层,这些库只能得到“空文本”,例如:
总结文本型PDF——ok
notion image
总结图片型PDF,看似“上传成功”但内部没有分词文本 → 最终检索为空,表现为“读不出来”——fail
notion image
notion image
 
Open WebUI后端已集成Document Intelligence依赖库,前端配置只要在环境变量中增加Azure DI服务的Key和Endpoint ,且 Post 更新 CONTENT_EXTRACTION_ENGINE 即可: Feat: Adding Support for Azure AI Document Intelligence for Content Extraction · open-webui/open-webui · Discussion #9583

启用步骤(推荐顺序)

  1. 在 .env 添加:
DOCUMENT_INTELLIGENCE_ENDPOINT=... DOCUMENT_INTELLIGENCE_KEY=...
  1. 启动 / 重启容器。
  1. 通过 API 或前端 RAG 设置界面发送正确的更新请求:
POST /api/v1/retrieval/config/update { "CONTENT_EXTRACTION_ENGINE": "document_intelligence", "DOCUMENT_INTELLIGENCE_ENDPOINT": "https://xxx.cognitiveservices.azure.com/", "DOCUMENT_INTELLIGENCE_KEY": "<masked>" }
  1. 验证:
GET /api/v1/retrieval/config -> "CONTENT_EXTRACTION_ENGINE": "document_intelligence"
  1. 上传一个纯图片 PDF → 检索文本验证。
也可以把上述一键开启
$TOKEN = "<admin_token>" $BASE = "http://localhost:3000/api/v1/retrieval" $payload = @{ CONTENT_EXTRACTION_ENGINE = "document_intelligence" DOCUMENT_INTELLIGENCE_ENDPOINT = "https://xxx.cognitiveservices.azure.com/" DOCUMENT_INTELLIGENCE_KEY = "<KEY>" CHUNK_SIZE = 1000 CHUNK_OVERLAP = 100 } Invoke-RestMethod -Method POST -Uri "$BASE/config/update" -Headers @{ Authorization = "Bearer $TOKEN" "Content-Type" = "application/json" } -Body ($payload | ConvertTo-Json -Depth 4) Invoke-RestMethod -Method GET -Uri "$BASE/config" -Headers @{ Authorization = "Bearer $TOKEN" "Content-Type" = "application/json" } | ConvertTo-Json -Depth 4 | Select-String -Pattern "CONTENT_EXTRACTION_ENGINE","DOCUMENT_INTELLIGENCE_ENDPOINT"
验证:
notion image