Machine Learning Tutorial

This tutorial provides a comprehensive guide to using the machine learning baselines in the neurological LRD analysis library.

Prerequisites

Before starting this tutorial, ensure you have the library installed:

pip install neurological-lrd-analysis

You’ll also need some additional dependencies for ML functionality:

pip install optuna joblib scikit-learn

Tutorial Overview

This tutorial covers:

Feature Extraction: Extracting 74+ features from time series data
ML Model Training: Training Random Forest, SVR, and Gradient Boosting models
Hyperparameter Optimization: Using Optuna for automated tuning
Pretrained Models: Creating and using pre-trained models
Fast Inference: Real-time prediction capabilities
Benchmarking: Comparing classical and ML methods

Step 1: Feature Extraction

The first step in using ML methods is to extract features from your time series data.

import numpy as np
from neurological_lrd_analysis import TimeSeriesFeatureExtractor, fbm_davies_harte

# Generate sample time series data
data = fbm_davies_harte(1000, 0.7, seed=42)

# Create feature extractor
extractor = TimeSeriesFeatureExtractor()

# Extract features
features = extractor.extract_features(data)
print(f"Extracted {len(features)} features")

# Display feature names and values
for name, value in features.items():
    print(f"{name}: {value:.4f}")

Feature Categories

The feature extractor provides features in several categories:

Statistical Features - Basic statistics: mean, variance, skewness, kurtosis - Distribution features: percentiles, quartiles, range - Autocorrelation features: at various lags

Spectral Features - Power spectral density features - Spectral centroid, bandwidth, rolloff - Frequency band power ratios (delta, theta, alpha, beta, gamma)

Wavelet Features - Wavelet energy at multiple scales - Wavelet entropy and complexity - Multiresolution analysis

Fractal Features - Detrended Fluctuation Analysis (DFA) - Higuchi fractal dimension - Generalized Hurst exponent

Biomedical Features - EEG-specific features - ECG-specific features - Respiratory features

Step 2: Training ML Models

Now let’s train ML models using the extracted features.

from neurological_lrd_analysis import (
    RandomForestEstimator, SVREstimator, GradientBoostingEstimator
)

# Generate training data
X_train = []
y_train = []

for hurst in [0.3, 0.5, 0.7, 0.9]:
    for _ in range(10):  # 10 samples per Hurst value
        data = fbm_davies_harte(1000, hurst, seed=np.random.randint(0, 10000))
        features = extractor.extract_features(data)
        X_train.append(list(features.values()))
        y_train.append(hurst)

X_train = np.array(X_train)
y_train = np.array(y_train)

# Train Random Forest
rf_estimator = RandomForestEstimator()
rf_result = rf_estimator.train(X_train, y_train, validation_split=0.2)
print(f"Random Forest - Training score: {rf_result.training_score:.4f}")
print(f"Random Forest - Validation score: {rf_result.validation_score:.4f}")

# Train SVR
svr_estimator = SVREstimator()
svr_result = svr_estimator.train(X_train, y_train, validation_split=0.2)
print(f"SVR - Training score: {svr_result.training_score:.4f}")
print(f"SVR - Validation score: {svr_result.validation_score:.4f}")

# Train Gradient Boosting
gb_estimator = GradientBoostingEstimator()
gb_result = gb_estimator.train(X_train, y_train, validation_split=0.2)
print(f"Gradient Boosting - Training score: {gb_result.training_score:.4f}")
print(f"Gradient Boosting - Validation score: {gb_result.validation_score:.4f}")

Step 3: Hyperparameter Optimization

Use Optuna to automatically find the best hyperparameters for your models.

from neurological_lrd_analysis import (
    OptunaOptimizer, create_optuna_study, optimize_hyperparameters
)

# Optimize Random Forest hyperparameters
print("Optimizing Random Forest hyperparameters...")
rf_study = create_optuna_study(
    model_type="random_forest",
    X_train=X_train,
    y_train=y_train,
    n_trials=50
)

print(f"Best Random Forest parameters: {rf_study.best_params}")
print(f"Best Random Forest score: {rf_study.best_value:.4f}")

# Optimize SVR hyperparameters
print("Optimizing SVR hyperparameters...")
svr_study = create_optuna_study(
    model_type="svr",
    X_train=X_train,
    y_train=y_train,
    n_trials=50
)

print(f"Best SVR parameters: {svr_study.best_params}")
print(f"Best SVR score: {svr_study.best_value:.4f}")

Step 4: Creating Pretrained Models

Create a comprehensive suite of pretrained models for fast inference.

from neurological_lrd_analysis import (
    create_pretrained_suite, PretrainedModelManager, TrainingConfig, MLBaselineType
)

# Create pretrained model suite
print("Creating pretrained model suite...")
manager = create_pretrained_suite(
    models_dir="pretrained_models",
    force_retrain=True
)

# List created models
models = manager.list_models()
print(f"Created {len(models)} pretrained models:")
for model in models:
    print(f"  - {model.model_id}: {model.model_type.value}")
    print(f"    Validation score: {model.performance_metrics.get('validation_score', 'N/A'):.4f}")

Step 5: Fast Inference

Use the pretrained models for fast inference on new data.

from neurological_lrd_analysis import (
    quick_predict, quick_ensemble_predict, PretrainedInference
)

# Generate test data
test_data = fbm_davies_harte(1000, 0.6, seed=123)

# Single model predictions
print("Single model predictions:")
hurst_rf = quick_predict(test_data, "pretrained_models", "random_forest")
print(f"Random Forest prediction: {hurst_rf:.4f}")

hurst_svr = quick_predict(test_data, "pretrained_models", "svr")
print(f"SVR prediction: {hurst_svr:.4f}")

# Ensemble prediction (best accuracy)
print("Ensemble prediction:")
hurst_ensemble, uncertainty = quick_ensemble_predict(test_data, "pretrained_models")
print(f"Ensemble prediction: {hurst_ensemble:.4f} ± {uncertainty:.4f}")

# Batch prediction
print("Batch prediction:")
inference = PretrainedInference("pretrained_models")
test_data_list = [fbm_davies_harte(1000, h, seed=123+i) for h in [0.4, 0.6, 0.8] for i in range(3)]
predictions = inference.predict_batch(test_data_list)
print(f"Batch predictions: {predictions}")

Step 6: Comprehensive Benchmarking

Compare classical and ML methods using the comprehensive benchmark system.

from neurological_lrd_analysis import (
    ClassicalMLBenchmark, run_comprehensive_benchmark,
    BiomedicalHurstEstimatorFactory, EstimatorType
)

# Create test scenarios
test_scenarios = []
for hurst in [0.3, 0.5, 0.7, 0.9]:
    for length in [500, 1000, 2000]:
        data = fbm_davies_harte(length, hurst, seed=42)
        test_scenarios.append({
            'data': data,
            'true_hurst': hurst,
            'length': length,
            'scenario': f'fBm_H{hurst}_L{length}'
        })

# Create benchmark system
benchmark = ClassicalMLBenchmark(
    pretrained_models_dir="pretrained_models",
    classical_estimators=[EstimatorType.DFA, EstimatorType.RS_ANALYSIS, EstimatorType.HIGUCHI],
    ml_estimators=['random_forest', 'svr', 'ensemble']
)

# Run comprehensive benchmark
print("Running comprehensive benchmark...")
results = benchmark.run_comprehensive_benchmark(
    test_scenarios=test_scenarios,
    save_results=True
)

# Display results
print("\nBenchmark Results:")
print("=" * 60)
print(f"{'Method':<15} {'Type':<10} {'MAE':<8} {'RMSE':<8} {'Corr':<8} {'Time(ms)':<10}")
print("-" * 60)

for method_name, summary in results['summaries'].items():
    print(f"{method_name:<15} {summary.method_type:<10} "
          f"{summary.mean_absolute_error:<8.4f} {summary.root_mean_squared_error:<8.4f} "
          f"{summary.correlation:<8.4f} {summary.mean_computation_time*1000:<10.1f}")

Step 7: Advanced Usage

Explore advanced features of the ML baselines system.

Feature Importance Analysis

# Get feature importance from trained models
rf_importance = rf_estimator.get_feature_importance()
print("Random Forest Feature Importance (top 10):")
for i, importance in enumerate(rf_importance[:10]):
    print(f"  Feature {i}: {importance:.4f}")

Cross-Validation Analysis

# Perform cross-validation analysis
cv_scores = rf_estimator.get_cv_scores()
print(f"Random Forest CV scores: {cv_scores}")
print(f"Mean CV score: {np.mean(cv_scores):.4f} ± {np.std(cv_scores):.4f}")

Model Persistence

# Save trained models
rf_estimator.save_model("rf_model.pkl")
svr_estimator.save_model("svr_model.pkl")

# Load saved models
from neurological_lrd_analysis import RandomForestEstimator, SVREstimator

loaded_rf = RandomForestEstimator()
loaded_rf.load_model("rf_model.pkl")

loaded_svr = SVREstimator()
loaded_svr.load_model("svr_model.pkl")

# Use loaded models for prediction
test_features = extractor.extract_features(test_data)
test_features_array = np.array([list(test_features.values())])

rf_pred = loaded_rf.predict(test_features_array)
svr_pred = loaded_svr.predict(test_features_array)

print(f"Loaded RF prediction: {rf_pred[0]:.4f}")
print(f"Loaded SVR prediction: {svr_pred[0]:.4f}")

Performance Analysis

import matplotlib.pyplot as plt
import seaborn as sns

# Create performance comparison plot
methods = list(results['summaries'].keys())
mae_values = [results['summaries'][m].mean_absolute_error for m in methods]
time_values = [results['summaries'][m].mean_computation_time * 1000 for m in methods]

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.barh(methods, mae_values)
plt.xlabel('Mean Absolute Error')
plt.title('Performance Comparison (MAE)')

plt.subplot(1, 2, 2)
plt.barh(methods, time_values)
plt.xlabel('Computation Time (ms)')
plt.title('Speed Comparison')

plt.tight_layout()
plt.savefig('ml_benchmark_results.png', dpi=300, bbox_inches='tight')
plt.show()

Best Practices

Feature Engineering: Always use the comprehensive feature extractor for best results
Hyperparameter Optimization: Use Optuna for automated tuning
Model Selection: Ensemble methods typically provide the best accuracy
Validation: Always use proper train/validation splits
Persistence: Save trained models for reuse
Benchmarking: Compare ML methods with classical methods

Troubleshooting

Common Issues and Solutions

Import Errors - Ensure all dependencies are installed: pip install optuna joblib scikit-learn - Check that the library is properly installed: pip install neurological-lrd-analysis

Memory Issues - Reduce the number of features or samples - Use smaller hyperparameter search spaces - Consider using fewer models in the ensemble

Performance Issues - Use pretrained models for fast inference - Consider using fewer features for real-time applications - Optimize hyperparameters for your specific use case

Model Training Issues - Ensure sufficient training data - Check for data quality issues - Use proper validation splits

Next Steps

Explore the API Reference for detailed documentation
Check out the Benchmarking Guide for performance analysis
Try the Jupyter Notebooks for interactive examples
Contribute to the project on GitHub

For more information, see the complete documentation at https://neurological-lrd-analysis.readthedocs.io/.