datalake-mcp-server
by BERDataLakehouse
Overview
A FastAPI service enabling AI assistants to interact with Delta Lake tables stored in MinIO through Spark, using the Model Context Protocol (MCP) for natural language data operations.
Installation
exec python -m src.mainEnvironment Variables
- KBASE_AUTH_URL
- KBASE_ADMIN_ROLES
- KBASE_REQUIRED_ROLES
- POSTGRES_URL
- POSTGRES_DB
- POSTGRES_USER
- POSTGRES_PASSWORD
- BERDL_REDIS_HOST
- BERDL_REDIS_PORT
- MINIO_ENDPOINT_URL
- MINIO_SECURE
- SPARK_HOME
- SPARK_MASTER_URL
- SPARK_CONNECT_URL
- BERDL_HIVE_METASTORE_URI
- SPARK_WORKER_COUNT
- SPARK_WORKER_CORES
- SPARK_WORKER_MEMORY
- SPARK_MASTER_CORES
- SPARK_MASTER_MEMORY
- GOVERNANCE_API_URL
- BERDL_POD_IP
- SERVICE_ROOT_PATH
- LOG_LEVEL
- REQUEST_TIMEOUT_SECONDS
- SPARK_CONNECT_URL_TEMPLATE
- K8S_ENVIRONMENT
- SHARED_SPARK_MASTER_URL
- SPARK_QUERY_TIMEOUT
- SPARK_COLLECT_TIMEOUT
Security Notes
The service explicitly warns that it allows arbitrary 'read-oriented' queries to be executed against Delta Lake tables and that query results will be sent to the model host server (unless locally hosted). It also strictly warns against deploying to production without KBase leadership approval and advises against querying sensitive data. Positive security aspects include: - KBase authentication with role-based access control (`KBASE_REQUIRED_ROLES`). - Strong input validation for structured SELECT queries to prevent SQL injection (`_validate_identifier`, `_escape_value`). - A basic (but not comprehensive) `_check_query_is_valid` function to disallow common destructive SQL keywords (`drop`, `delete`, `insert`, `update`, `create`, `alter`, `merge`, `truncate`, `vacuum`), metacharacters (`;`, `/*`, `--`), and PostgreSQL system schemas (`pg_`, `information_schema`). - Dynamic loading of MinIO credentials from user home directories (`/home/{username}/.berdl_minio_credentials`), avoiding hardcoded secrets for data access. - Explicit validation in `src/main.py` requiring `POSTGRES_USER` to be `readonly_user`, indicating a read-only database connection for the Hive Metastore backend. - Use of LRUCache for KBase Auth tokens and Redis for query results, with user-scoped cache keys (`_generate_cache_key`) to prevent cross-user data leakage from the cache. Areas for caution/improvement: - The `_check_query_is_valid` function is not comprehensive and explicitly noted as such. Reliance on it for complete SQL injection prevention is not advised. The primary defense for arbitrary `read-oriented` queries is the read-only database user and S3 policies. - The core risk remains the 'arbitrary read-oriented queries' which, if misused by an AI or malicious actor, could expose data that the user *shouldn't* have access to, even if read-only, if S3 policies aren't perfectly aligned with the KBase Auth roles/groups logic. - The warning about sending query results to the model host server highlights a significant data privacy concern for users.
Similar Servers
fastapi_mcp
Automatically converts FastAPI endpoints into Model Context Protocol (MCP) tools for seamless integration with LLM agents.
mcpstore
Orchestrate Microservice Context Protocol (MCP) services and adapt them for integration with various AI frameworks like LangChain, managing tools and their lifecycle.
mcp-trino
Enables AI assistants to interact with Trino's distributed SQL query engine for data analytics through a standardized Model Context Protocol (MCP) server.
openapi-mcp-server
Converts OpenAPI specifications into Model Context Protocol (MCP) tools, enabling AI assistants to interact with APIs.