kreuzberg
Verified Safeby Goldziher
Overview
Extracts text, tables, images, and metadata from a wide range of document formats (PDF, Office, images, HTML, etc.), with support for multiple OCR backends and an extensible plugin system. Can be run as a Micro-Agent Communication Protocol (MCP) server.
Installation
kreuzberg mcpEnvironment Variables
- KREUZBERG_CONFIG_PATH
- KREUZBERG_OCR_BACKEND
- KREUZBERG_OCR_LANGUAGE
- KREUZBERG_CHUNK_MAX_CHARS
- KREUZBERG_CHUNK_MAX_OVERLAP
- KREUZBERG_USE_CACHE
- KREUZBERG_TOKEN_REDUCTION_MODE
- KREUZBERG_ENCODING_CACHE_MAX_ENTRIES
- KREUZBERG_ENCODING_CACHE_MAX_BYTES
- KREUZBERG_API_HOST
- KREUZBERG_API_PORT
- KREUZBERG_API_CORS_ORIGINS
- KREUZBERG_API_MAX_REQUEST_BODY_BYTES
- KREUZBERG_API_MAX_MULTIPART_FIELD_BYTES
- KREUZBERG_API_MAX_UPLOAD_MB
Security Notes
The server processes untrusted external input (documents) and relies on FFI bindings to a Rust core, as well as external tools like LibreOffice and Tesseract. The codebase demonstrates strong awareness of security concerns, including explicit validators for common vulnerabilities like zip bombs, XML entity expansion, and string growth limits. Input validation is performed before crossing FFI boundaries. However, as with any system handling arbitrary external data and exposing APIs (HTTP/MCP), full security depends on proper deployment, network hardening, and potentially additional access control layers by the user. Debug logging in some test files, while not production code, is noted.
Similar Servers
kreuzberg
Extracts text, tables, images, and metadata from 56 file formats including PDF, Office documents, and images. Supports multiple OCR backends, extensible plugins, and is designed for data preprocessing in AI/ML workflows.
pdf-reader-mcp
Provides a robust server for AI agents to extract text, images, and metadata from PDF documents, preserving content order for better comprehension.
mineru-tianshu
Enterprise-grade AI data preprocessing platform for converting diverse unstructured multi-modal data (documents, images, audio, video, bioinformatics formats) into structured Markdown and JSON formats, leveraging GPU acceleration and a robust task management system with user authentication and MCP protocol integration.
flexible-graphrag
The Flexible GraphRAG MCP Server integrates document processing, knowledge graph building, hybrid search, and AI query capabilities via the Model Context Protocol (MCP) for clients like Claude Desktop and MCP Inspector.