scraper-mcp
by jessaminesimple608
Overview
An MCP server for efficient web scraping, offering tools to extract raw HTML, convert to markdown, extract plain text, and discover links from webpages.
Installation
docker-compose up -dEnvironment Variables
- TRANSPORT
- HOST
- PORT
- CACHE_DIR
- HTTP_PROXY
- HTTPS_PROXY
- NO_PROXY
- SCRAPEOPS_API_KEY
- SCRAPEOPS_RENDER_JS
- SCRAPEOPS_RESIDENTIAL
- SCRAPEOPS_COUNTRY
- SCRAPEOPS_KEEP_HEADERS
- SCRAPEOPS_DEVICE
- ENABLE_CACHE_TOOLS
Security Notes
Critical security risks identified: 1. Default SSL verification is disabled: The `RequestsProvider` uses `verify_ssl=False` by default, making all HTTPS requests vulnerable to Man-in-the-Middle (MITM) attacks. 2. Unauthenticated Admin API and Dashboard: The `/healthz`, `/api/stats`, `/api/cache/clear`, `/api/config` endpoints and the root dashboard (`/`) are exposed without any visible authentication or authorization. This allows anyone with network access to query server statistics, clear the cache, and modify runtime configuration (e.g., proxy settings, concurrency). 3. Exposure of Cache Management Tools: If `ENABLE_CACHE_TOOLS` is set, cache management tools are also exposed via MCP without authentication. These issues make the server unsafe for deployment in public or untrusted networks without additional security measures (e.g., a reverse proxy with authentication).
Similar Servers
DevDocs
DevDocs is a web crawling and content extraction platform designed to accelerate software development by converting documentation into LLM-ready formats for intelligent data querying and fine-tuning.
mcp-omnisearch
Provides a unified interface for various search, AI response, content processing, and enhancement tools via Model Context Protocol (MCP).
scrapegraph-mcp
Provides AI-powered web scraping, structured data extraction, multi-page crawling, and agentic automation capabilities for language models.
webscraping-ai-mcp-server
Integrates with WebScraping.AI to provide LLM-powered web data extraction, including question answering, structured data extraction, and HTML/text retrieval, with advanced features like JavaScript rendering and proxy management.