data-extractor
Verified Safeby ThreeFish-AI
Overview
A commercial-grade MCP Server designed for robust web page and PDF content extraction and conversion to Markdown, purpose-built for long-term enterprise deployment.
Installation
uv run data-extractorEnvironment Variables
- DATA_EXTRACTOR_SERVER_NAME
- DATA_EXTRACTOR_CONCURRENT_REQUESTS
- DATA_EXTRACTOR_RATE_LIMIT_REQUESTS_PER_MINUTE
- DATA_EXTRACTOR_REQUEST_TIMEOUT
- DATA_EXTRACTOR_MAX_RETRIES
- DATA_EXTRACTOR_ENABLE_JAVASCRIPT
- DATA_EXTRACTOR_BROWSER_HEADLESS
- DATA_EXTRACTOR_BROWSER_TIMEOUT
- DATA_EXTRACTOR_USE_RANDOM_USER_AGENT
- DATA_EXTRACTOR_USE_PROXY
- DATA_EXTRACTOR_PROXY_URL
- DATA_EXTRACTOR_ENABLE_CACHING
- DATA_EXTRACTOR_CACHE_TTL_HOURS
- DATA_EXTRACTOR_LOG_LEVEL
- DATA_EXTRACTOR_DOWNLOAD_DELAY
- DATA_EXTRACTOR_RANDOMIZE_DOWNLOAD_DELAY
- DATA_EXTRACTOR_BROWSER_WINDOW_SIZE
- DATA_EXTRACTOR_CACHE_MAX_SIZE
- DATA_EXTRACTOR_LOG_REQUESTS
- DATA_EXTRACTOR_LOG_RESPONSES
- DATA_EXTRACTOR_TRANSPORT_MODE
- DATA_EXTRACTOR_HTTP_HOST
- DATA_EXTRACTOR_HTTP_PORT
- DATA_EXTRACTOR_HTTP_PATH
- DATA_EXTRACTOR_HTTP_CORS_ORIGINS
Security Notes
The server's core functionality involves making network requests to arbitrary external URLs for web scraping and PDF downloads. This introduces inherent risks from interacting with potentially malicious or untrusted external content. However, the codebase does not exhibit signs of malicious intent, hardcoded secrets, or dangerous dynamic code execution patterns (like `eval` on untrusted input). It includes good practices such as sanitizing HTML (removing scripts and styles) and respecting `robots.txt` rules. The use of robust, well-maintained libraries like Scrapy, Selenium, Playwright, PyMuPDF, and PyPDF also contributes to overall security, provided these libraries themselves are kept updated and properly configured. Configuration for sensitive details like proxy URLs uses environment variables, which is a secure approach.
Similar Servers
scrapegraph-mcp
Provides AI-powered web scraping, structured data extraction, multi-page crawling, and agentic automation capabilities for language models.
html-to-markdown-mcp
Converts HTML content from web pages or raw strings into Markdown format, with options for including metadata, truncating content, and saving to files.
scrapi-mcp
This MCP server enables AI agents to scrape web pages and retrieve their content as HTML or Markdown, with advanced browser interaction capabilities.
mcp-server-requests
An MCP server that provides HTTP request capabilities, enabling LLMs to fetch and process web content, including saving to files.