Getting Started
This guide will help you get started with Collie, a lightweight MLOps framework.
Installation
Prerequisites
Python 3.10 or higher
MLflow server (for tracking and model registry)
Install Collie
Basic Installation (Core Framework Only)
pip install collie-mlops
This installs the core MLOps orchestration framework (~100MB).
Install with ML Frameworks
Choose the installation that fits your needs:
For Traditional ML (Tabular Data)
# Individual frameworks
pip install collie-mlops[sklearn] # scikit-learn
pip install collie-mlops[xgboost] # XGBoost
pip install collie-mlops[lightgbm] # LightGBM
# Or install all tabular ML frameworks (~250MB)
pip install collie-mlops[tabular]
For Deep Learning
# PyTorch ecosystem (includes Transformers) (~3GB)
pip install collie-mlops[pytorch]
# Or use the alias
pip install collie-mlops[deep-learning]
For Complete Installation
# All frameworks (~3.5GB)
pip install collie-mlops[all]
For Development
# All frameworks + dev tools
pip install collie-mlops[dev,all]
Note
Install only what you need to keep your environment lightweight! The core framework works independently and you can add specific ML frameworks as needed.
Setting Up MLflow
Collie requires a running MLflow server. Start one locally:
# Start MLflow server
mlflow server --host 0.0.0.0 --port 5000
Or use an existing MLflow tracking server.
Your First Pipeline
Let’s create a simple ML pipeline using the Iris dataset.
Step 1: Create a Transformer
The Transformer handles data preprocessing:
from collie import Transformer
from sklearn.datasets import load_iris
import pandas as pd
class IrisTransformer(Transformer):
def handle(self, event):
# Load data
data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
# Log data statistics using MLflow
self.mlflow.log_params({
"n_samples": len(df),
"n_features": len(data.feature_names)
})
# Log the input dataset
self.log_pd_data(
data=df,
context="training",
source="sklearn.datasets.load_iris"
)
# Return Event with TransformerPayload
from collie import TransformerPayload, Event
payload = TransformerPayload(
train_data=df,
extra_data={
"feature_names": list(data.feature_names),
"target_col": "target"
}
)
return Event(payload=payload)
Step 2: Create a Trainer
The Trainer handles model training:
from collie import Trainer
from sklearn.ensemble import RandomForestClassifier
class IrisTrainer(Trainer):
def handle(self, event):
# Get data from transformer
df = event.payload.train_data
feature_names = event.payload.extra_data.get("feature_names", [])
target_col = event.payload.extra_data.get("target_col", "target")
# Prepare training data
X = df[feature_names]
y = df[target_col]
# Define hyperparameters
params = {
"n_estimators": 100,
"max_depth": 10,
"random_state": 42
}
# Log hyperparameters using MLflow
self.mlflow.log_params(params)
model = RandomForestClassifier(**params)
model.fit(X, y)
# Calculate and log training accuracy
train_accuracy = model.score(X, y)
self.mlflow.log_metric("train_accuracy", train_accuracy)
# Log feature importance
importance = dict(zip(feature_names, model.feature_importances_))
self.mlflow.log_dict(importance, "feature_importance.json")
# Return Event with TrainerPayload
from collie import TrainerPayload, Event, EventType
payload = TrainerPayload(model=model)
return Event(payload=payload)
Step 3: Create an Orchestrator
The Orchestrator coordinates all components:
from collie import Orchestrator
# Create orchestrator with your components
orchestrator = Orchestrator(
components=[
IrisTransformer(),
IrisTrainer()
],
tracking_uri="http://localhost:5000",
registered_model_name="iris_classifier",
experiment_name="iris_experiment"
)
# Run the pipeline
orchestrator.run()
Step 4: View Results
Open your MLflow UI to see the results:
# MLflow UI should be available at
http://localhost:5000
You’ll see:
Logged parameters (n_samples, n_features, hyperparameters)
Logged metrics (train_accuracy)
Logged artifacts (feature_importance.json)
Registered model (iris_classifier) #if you have a Pusher component
Next Steps
Now that you have a basic pipeline running, you can:
Add Evaluation - Create an Evaluator to assess model performance
Add Tuning - Create a Tuner for hyperparameter optimization
Add Deployment - Create a Pusher to deploy models
See the Core Concepts guide for more details.