data-extractor

Name: data-extractor
Author: ThreeFish-AI

Verified Safe

by ThreeFish-AI

View Source

Overview

A commercial-grade MCP Server designed for robust web page and PDF content extraction and conversion to Markdown, purpose-built for long-term enterprise deployment.

Installation

Run Command

uv run data-extractor

Environment Variables

DATA_EXTRACTOR_SERVER_NAME
DATA_EXTRACTOR_CONCURRENT_REQUESTS
DATA_EXTRACTOR_RATE_LIMIT_REQUESTS_PER_MINUTE
DATA_EXTRACTOR_REQUEST_TIMEOUT
DATA_EXTRACTOR_MAX_RETRIES
DATA_EXTRACTOR_ENABLE_JAVASCRIPT
DATA_EXTRACTOR_BROWSER_HEADLESS
DATA_EXTRACTOR_BROWSER_TIMEOUT
DATA_EXTRACTOR_USE_RANDOM_USER_AGENT
DATA_EXTRACTOR_USE_PROXY
DATA_EXTRACTOR_PROXY_URL
DATA_EXTRACTOR_ENABLE_CACHING
DATA_EXTRACTOR_CACHE_TTL_HOURS
DATA_EXTRACTOR_LOG_LEVEL
DATA_EXTRACTOR_DOWNLOAD_DELAY
DATA_EXTRACTOR_RANDOMIZE_DOWNLOAD_DELAY
DATA_EXTRACTOR_BROWSER_WINDOW_SIZE
DATA_EXTRACTOR_CACHE_MAX_SIZE
DATA_EXTRACTOR_LOG_REQUESTS
DATA_EXTRACTOR_LOG_RESPONSES
DATA_EXTRACTOR_TRANSPORT_MODE
DATA_EXTRACTOR_HTTP_HOST
DATA_EXTRACTOR_HTTP_PORT
DATA_EXTRACTOR_HTTP_PATH
DATA_EXTRACTOR_HTTP_CORS_ORIGINS

Security Notes

The server's core functionality involves making network requests to arbitrary external URLs for web scraping and PDF downloads. This introduces inherent risks from interacting with potentially malicious or untrusted external content. However, the codebase does not exhibit signs of malicious intent, hardcoded secrets, or dangerous dynamic code execution patterns (like `eval` on untrusted input). It includes good practices such as sanitizing HTML (removing scripts and styles) and respecting `robots.txt` rules. The use of robust, well-maintained libraries like Scrapy, Selenium, Playwright, PyMuPDF, and PyPDF also contributes to overall security, provided these libraries themselves are kept updated and properly configured. Configuration for sensitive details like proxy URLs uses environment variables, which is a secure approach.

Similar Servers

scrapegraph-mcp

Provides AI-powered web scraping, structured data extraction, multi-page crawling, and agentic automation capabilities for language models.

Other

$High

html-to-markdown-mcp

Converts HTML content from web pages or raw strings into Markdown format, with options for including metadata, truncating content, and saving to files.

Other

$Medium