3.4 KiB
3.4 KiB
Research: Refactor CLI Scripts to Web Application
1. Search Tool Architecture
Problem: The search_script.py fetches metadata for all datasets and performs regex matching in memory. This can be resource-intensive and slow for large Superset instances.
Options:
- Synchronous API Endpoint: The frontend calls an API, waits, and displays results.
- Pros: Simple, immediate feedback.
- Cons: Risk of HTTP timeout (e.g., Nginx/Browser limits) if the dataset fetch takes too long.
- Asynchronous Task (TaskManager): The frontend triggers a task, polls for status, and displays results when done.
- Pros: Robust, no timeouts, consistent with "Mapping" and "Migration" tools.
- Cons: Slower user experience for quick searches.
Decision: Synchronous API with Optimization.
- Rationale: Search is typically an interactive "read-only" operation. Users expect immediate results. The
superset_toolclient'sget_datasetsis reasonably efficient. - Mitigation: We will implement the API to return a standard JSON response. If performance becomes an issue in testing, we can easily wrap the service logic in a TaskManager plugin.
2. Dataset Mapper & Connection Management
Problem: run_mapper.py relies on command-line arguments and keyring for database credentials. The Web UI needs a way to store and reuse these credentials securely.
Options:
- Input Every Time: User enters DB credentials for every mapping operation.
- Pros: Secure (no storage).
- Cons: Poor UX, tedious.
- Saved Connections: Store connection details (Host, Port, DB, User, Password) in the application database.
- Pros: Good UX.
- Cons: Security risk if not encrypted.
Decision: Saved Connections (Encrypted).
- Rationale: The spec explicitly requires: "Connection configurations must be saved for reuse".
- Implementation:
- Create a new SQLAlchemy model
ConnectionConfiginbackend/src/models/connection.py. - Store passwords encrypted (or at least obfuscated if full encryption infra isn't ready, but ideally encrypted). Given the scope, we will store them in the existing SQLite database.
- The Mapper logic will be refactored into a
MapperPlugin(or updated existing one) that accepts aconnection_idor explicit config.
- Create a new SQLAlchemy model
3. Debug Tools Integration
Problem: debug_db_api.py and get_dataset_structure.py are standalone scripts that print to stdout or write files.
Decision: Direct API Services.
- Debug API: Create an endpoint
POST /api/tools/debug/test-db-connectionthat runs the logic fromdebug_db_api.pyand returns the log/result JSON. - Dataset Structure: Create an endpoint
GET /api/tools/debug/dataset/{id}/structurethat runs logic fromget_dataset_structure.pyand returns the JSON directly.
4. Legacy Code Cleanup
Plan:
- Implement the new Web tools.
- Verify feature parity.
- Delete:
search_script.pyrun_mapper.pydebug_db_api.pyget_dataset_structure.pybackup_script.py(Spec confirms it's superseded by009-backup-scheduler)
5. Security & Access
Decision: All authenticated users can access these tools.
- Rationale: Spec says "All authenticated users".
- Implementation: Use existing
Depends(get_current_user)for all new routes.