mle · experimental
This skill is experimental. Recipes cover the ML engineering lifecycle but assume familiarity with PyTorch and Python packaging.
Context skill for the full ML engineering lifecycle: research, data pipelines, distributed training, model evaluation, observability, and model publishing.
Requirements
- Python 3.11+
uv— Python package and project manager (curl -LsSf https://astral.sh/uv/install.sh | sh)- CUDA toolkit — optional; required for GPU training recipes
Philosophy
ML engineering is a systems problem, not just a modeling problem. A model that trains but can't be reproduced, monitored, or deployed is an experiment, not an asset. These recipes treat the entire lifecycle as an engineering system: versioning, observability, regression prevention, and fault tolerance built in from the start.
Recipes
- Researching a New Problem Domain — literature review, dataset discovery, baseline establishment, problem framing
- Data Pipelines with Ray Data — distributed loading, transformation, feature engineering, streaming to training
- Storage Formats for ML — Parquet, Arrow, HDF5, LMDB; when to use each; dataset versioning
- Distributed Training with Ray Train — DDP with Ray, Lightning integration, fault tolerance
- Model Training and Evaluation — training loop patterns, validation strategy, metrics, early stopping
- Observability with TensorBoard and Lightning — logging metrics, gradients, images; comparing runs; profiling
- Experiment Tracking with MLflow — runs, model registry, regression prevention, CI integration
- Model Publishing — ONNX, TorchScript, MLflow registry, Ray Serve, FastAPI