kreuzberg
Verified Safeby kreuzberg-dev
Overview
Extracts text, tables, images, and metadata from 56 file formats including PDF, Office documents, and images. Supports multiple OCR backends, extensible plugins, and is designed for data preprocessing in AI/ML workflows.
Installation
python -c "import asyncio; from kreuzberg import extract_file, ExtractionConfig; async def main(): config = ExtractionConfig(use_cache=True, enable_quality_processing=True); result = await extract_file('document.pdf', config=config); print(result.content); asyncio.run(main())"Environment Variables
- KREUZBERG_LOG_LEVEL
- KREUZBERG_CONFIG_PATH
- TESSDATA_PREFIX
- KREUZBERG_ENCODING_CACHE_MAX_ENTRIES
- KREUZBERG_ENCODING_CACHE_MAX_BYTES
- CORS_ALLOW_ORIGINS
- LISTEN_ADDR
- LISTEN_PORT
Security Notes
The core Rust library implements robust security features, including explicit data validation (e.g., zip bomb, XML entity expansion prevention) and graceful panic handling at FFI boundaries. External process execution (e.g., LibreOffice for older Office formats) introduces an inherent risk, but the project appears to be aware and implements safeguards. No direct 'eval' or obvious hardcoded sensitive credentials were found in the provided code snippets. Overall, the project demonstrates a strong focus on secure processing of potentially untrusted inputs, but risks associated with native code execution and external dependencies should always be considered.
Similar Servers
kreuzberg
Extracts text, tables, images, and metadata from a wide range of document formats (PDF, Office, images, HTML, etc.), with support for multiple OCR backends and an extensible plugin system. Can be run as a Micro-Agent Communication Protocol (MCP) server.
pdf-reader-mcp
Provides a robust server for AI agents to extract text, images, and metadata from PDF documents, preserving content order for better comprehension.
pdflens-mcp
This MCP server provides tools for reading and extracting information from PDF files, including text and images, designed for AI clients.
lyra-tool-discovery
This MCP server is designed to fetch, parse, and organize documentation from websites implementing the llms.txt standard. It transforms raw documentation into structured, agent-ready formats, exposing tools for AI agents, LLMs, and automation workflows to consume documentation programmatically.