Back to Home
BERDataLakehouse icon

datalake-mcp-server

by BERDataLakehouse

Overview

A FastAPI service enabling AI assistants to interact with Delta Lake tables stored in MinIO through Spark, using the Model Context Protocol (MCP) for natural language data operations.

Installation

Run Command
exec python -m src.main

Environment Variables

  • KBASE_AUTH_URL
  • KBASE_ADMIN_ROLES
  • KBASE_REQUIRED_ROLES
  • POSTGRES_URL
  • POSTGRES_DB
  • POSTGRES_USER
  • POSTGRES_PASSWORD
  • BERDL_REDIS_HOST
  • BERDL_REDIS_PORT
  • MINIO_ENDPOINT_URL
  • MINIO_SECURE
  • SPARK_HOME
  • SPARK_MASTER_URL
  • SPARK_CONNECT_URL
  • BERDL_HIVE_METASTORE_URI
  • SPARK_WORKER_COUNT
  • SPARK_WORKER_CORES
  • SPARK_WORKER_MEMORY
  • SPARK_MASTER_CORES
  • SPARK_MASTER_MEMORY
  • GOVERNANCE_API_URL
  • BERDL_POD_IP
  • SERVICE_ROOT_PATH
  • LOG_LEVEL
  • REQUEST_TIMEOUT_SECONDS
  • SPARK_CONNECT_URL_TEMPLATE
  • K8S_ENVIRONMENT
  • SHARED_SPARK_MASTER_URL
  • SPARK_QUERY_TIMEOUT
  • SPARK_COLLECT_TIMEOUT

Security Notes

The service explicitly warns that it allows arbitrary 'read-oriented' queries to be executed against Delta Lake tables and that query results will be sent to the model host server (unless locally hosted). It also strictly warns against deploying to production without KBase leadership approval and advises against querying sensitive data. Positive security aspects include: - KBase authentication with role-based access control (`KBASE_REQUIRED_ROLES`). - Strong input validation for structured SELECT queries to prevent SQL injection (`_validate_identifier`, `_escape_value`). - A basic (but not comprehensive) `_check_query_is_valid` function to disallow common destructive SQL keywords (`drop`, `delete`, `insert`, `update`, `create`, `alter`, `merge`, `truncate`, `vacuum`), metacharacters (`;`, `/*`, `--`), and PostgreSQL system schemas (`pg_`, `information_schema`). - Dynamic loading of MinIO credentials from user home directories (`/home/{username}/.berdl_minio_credentials`), avoiding hardcoded secrets for data access. - Explicit validation in `src/main.py` requiring `POSTGRES_USER` to be `readonly_user`, indicating a read-only database connection for the Hive Metastore backend. - Use of LRUCache for KBase Auth tokens and Redis for query results, with user-scoped cache keys (`_generate_cache_key`) to prevent cross-user data leakage from the cache. Areas for caution/improvement: - The `_check_query_is_valid` function is not comprehensive and explicitly noted as such. Reliance on it for complete SQL injection prevention is not advised. The primary defense for arbitrary `read-oriented` queries is the read-only database user and S3 policies. - The core risk remains the 'arbitrary read-oriented queries' which, if misused by an AI or malicious actor, could expose data that the user *shouldn't* have access to, even if read-only, if S3 policies aren't perfectly aligned with the KBase Auth roles/groups logic. - The warning about sending query results to the model host server highlights a significant data privacy concern for users.

Similar Servers

Stats

Interest Score0
Security Score7
Cost ClassMedium
Avg Tokens1000
Stars0
Forks0
Last Update2026-01-17

Tags

FastAPIDelta LakeSparkMinIOAI AssistantsModel Context Protocol