data-extractor
Verified Safeby ThreeFish-AI
Overview
A commercial-grade MCP Server for robust web page and PDF content extraction, localization into Markdown, and long-term deployment in enterprise environments.
Installation
uv run data-extractorEnvironment Variables
- DATA_EXTRACTOR_USE_PROXY
- DATA_EXTRACTOR_PROXY_URL
- DATA_EXTRACTOR_ENABLE_JAVASCRIPT
- DATA_EXTRACTOR_BROWSER_HEADLESS
- DATA_EXTRACTOR_TRANSPORT_MODE
- DATA_EXTRACTOR_HTTP_HOST
- DATA_EXTRACTOR_HTTP_PORT
Security Notes
The project handles network requests for web scraping and PDF downloading, which inherently carries risks if not used responsibly. The documentation explicitly advises users to comply with `robots.txt` and website terms of use. Features like stealth scraping and form submission can be powerful but require careful usage to avoid ethical or legal issues. There are no obvious hardcoded secrets or 'eval' usage, and configuration uses environment variables for sensitive data. Proper proxy configuration and responsible usage are key for security.
Similar Servers
scrapegraph-mcp
Provides a Model Context Protocol (MCP) server that integrates with ScrapeGraph AI, enabling language models to perform advanced AI-powered web scraping and structured data extraction across single pages, multiple pages, and search results.
scrapi-mcp
Serves as a Model Context Protocol (MCP) server that utilizes the ScrAPI service to scrape web pages and return their content in either HTML or Markdown format.
html-to-markdown-mcp
Converts HTML content (from a URL or raw string) into clean, formatted Markdown and can save it to a file.
mcp-server-requests
An MCP server that provides HTTP request capabilities, enabling LLMs to fetch and process web content, including saving to files.