Back to Home
ThreeFish-AI icon

data-extractor

Verified Safe

by ThreeFish-AI

Overview

A commercial-grade MCP Server designed for robust web page and PDF content extraction and conversion to Markdown, purpose-built for long-term enterprise deployment.

Installation

Run Command
uv run data-extractor

Environment Variables

  • DATA_EXTRACTOR_SERVER_NAME
  • DATA_EXTRACTOR_CONCURRENT_REQUESTS
  • DATA_EXTRACTOR_RATE_LIMIT_REQUESTS_PER_MINUTE
  • DATA_EXTRACTOR_REQUEST_TIMEOUT
  • DATA_EXTRACTOR_MAX_RETRIES
  • DATA_EXTRACTOR_ENABLE_JAVASCRIPT
  • DATA_EXTRACTOR_BROWSER_HEADLESS
  • DATA_EXTRACTOR_BROWSER_TIMEOUT
  • DATA_EXTRACTOR_USE_RANDOM_USER_AGENT
  • DATA_EXTRACTOR_USE_PROXY
  • DATA_EXTRACTOR_PROXY_URL
  • DATA_EXTRACTOR_ENABLE_CACHING
  • DATA_EXTRACTOR_CACHE_TTL_HOURS
  • DATA_EXTRACTOR_LOG_LEVEL
  • DATA_EXTRACTOR_DOWNLOAD_DELAY
  • DATA_EXTRACTOR_RANDOMIZE_DOWNLOAD_DELAY
  • DATA_EXTRACTOR_BROWSER_WINDOW_SIZE
  • DATA_EXTRACTOR_CACHE_MAX_SIZE
  • DATA_EXTRACTOR_LOG_REQUESTS
  • DATA_EXTRACTOR_LOG_RESPONSES
  • DATA_EXTRACTOR_TRANSPORT_MODE
  • DATA_EXTRACTOR_HTTP_HOST
  • DATA_EXTRACTOR_HTTP_PORT
  • DATA_EXTRACTOR_HTTP_PATH
  • DATA_EXTRACTOR_HTTP_CORS_ORIGINS

Security Notes

The server's core functionality involves making network requests to arbitrary external URLs for web scraping and PDF downloads. This introduces inherent risks from interacting with potentially malicious or untrusted external content. However, the codebase does not exhibit signs of malicious intent, hardcoded secrets, or dangerous dynamic code execution patterns (like `eval` on untrusted input). It includes good practices such as sanitizing HTML (removing scripts and styles) and respecting `robots.txt` rules. The use of robust, well-maintained libraries like Scrapy, Selenium, Playwright, PyMuPDF, and PyPDF also contributes to overall security, provided these libraries themselves are kept updated and properly configured. Configuration for sensitive details like proxy URLs uses environment variables, which is a secure approach.

Similar Servers

Stats

Interest Score32
Security Score8
Cost ClassHigh
Avg Tokens8000
Stars2
Forks2
Last Update2026-01-19

Tags

web scrapingPDF processingMarkdown conversiondata extractionenterprise