Machine Learning Tutorial
This tutorial provides a comprehensive guide to using the machine learning baselines in the neurological LRD analysis library.
Prerequisites
Before starting this tutorial, ensure you have the library installed:
pip install neurological-lrd-analysis
You’ll also need some additional dependencies for ML functionality:
pip install optuna joblib scikit-learn
Tutorial Overview
This tutorial covers:
Feature Extraction: Extracting 74+ features from time series data
ML Model Training: Training Random Forest, SVR, and Gradient Boosting models
Hyperparameter Optimization: Using Optuna for automated tuning
Pretrained Models: Creating and using pre-trained models
Fast Inference: Real-time prediction capabilities
Benchmarking: Comparing classical and ML methods
Step 1: Feature Extraction
The first step in using ML methods is to extract features from your time series data.
import numpy as np
from neurological_lrd_analysis import TimeSeriesFeatureExtractor, fbm_davies_harte
# Generate sample time series data
data = fbm_davies_harte(1000, 0.7, seed=42)
# Create feature extractor
extractor = TimeSeriesFeatureExtractor()
# Extract features
features = extractor.extract_features(data)
print(f"Extracted {len(features)} features")
# Display feature names and values
for name, value in features.items():
print(f"{name}: {value:.4f}")
Feature Categories
The feature extractor provides features in several categories:
Statistical Features - Basic statistics: mean, variance, skewness, kurtosis - Distribution features: percentiles, quartiles, range - Autocorrelation features: at various lags
Spectral Features - Power spectral density features - Spectral centroid, bandwidth, rolloff - Frequency band power ratios (delta, theta, alpha, beta, gamma)
Wavelet Features - Wavelet energy at multiple scales - Wavelet entropy and complexity - Multiresolution analysis
Fractal Features - Detrended Fluctuation Analysis (DFA) - Higuchi fractal dimension - Generalized Hurst exponent
Biomedical Features - EEG-specific features - ECG-specific features - Respiratory features
Step 2: Training ML Models
Now let’s train ML models using the extracted features.
from neurological_lrd_analysis import (
RandomForestEstimator, SVREstimator, GradientBoostingEstimator
)
# Generate training data
X_train = []
y_train = []
for hurst in [0.3, 0.5, 0.7, 0.9]:
for _ in range(10): # 10 samples per Hurst value
data = fbm_davies_harte(1000, hurst, seed=np.random.randint(0, 10000))
features = extractor.extract_features(data)
X_train.append(list(features.values()))
y_train.append(hurst)
X_train = np.array(X_train)
y_train = np.array(y_train)
# Train Random Forest
rf_estimator = RandomForestEstimator()
rf_result = rf_estimator.train(X_train, y_train, validation_split=0.2)
print(f"Random Forest - Training score: {rf_result.training_score:.4f}")
print(f"Random Forest - Validation score: {rf_result.validation_score:.4f}")
# Train SVR
svr_estimator = SVREstimator()
svr_result = svr_estimator.train(X_train, y_train, validation_split=0.2)
print(f"SVR - Training score: {svr_result.training_score:.4f}")
print(f"SVR - Validation score: {svr_result.validation_score:.4f}")
# Train Gradient Boosting
gb_estimator = GradientBoostingEstimator()
gb_result = gb_estimator.train(X_train, y_train, validation_split=0.2)
print(f"Gradient Boosting - Training score: {gb_result.training_score:.4f}")
print(f"Gradient Boosting - Validation score: {gb_result.validation_score:.4f}")
Step 3: Hyperparameter Optimization
Use Optuna to automatically find the best hyperparameters for your models.
from neurological_lrd_analysis import (
OptunaOptimizer, create_optuna_study, optimize_hyperparameters
)
# Optimize Random Forest hyperparameters
print("Optimizing Random Forest hyperparameters...")
rf_study = create_optuna_study(
model_type="random_forest",
X_train=X_train,
y_train=y_train,
n_trials=50
)
print(f"Best Random Forest parameters: {rf_study.best_params}")
print(f"Best Random Forest score: {rf_study.best_value:.4f}")
# Optimize SVR hyperparameters
print("Optimizing SVR hyperparameters...")
svr_study = create_optuna_study(
model_type="svr",
X_train=X_train,
y_train=y_train,
n_trials=50
)
print(f"Best SVR parameters: {svr_study.best_params}")
print(f"Best SVR score: {svr_study.best_value:.4f}")
Step 4: Creating Pretrained Models
Create a comprehensive suite of pretrained models for fast inference.
from neurological_lrd_analysis import (
create_pretrained_suite, PretrainedModelManager, TrainingConfig, MLBaselineType
)
# Create pretrained model suite
print("Creating pretrained model suite...")
manager = create_pretrained_suite(
models_dir="pretrained_models",
force_retrain=True
)
# List created models
models = manager.list_models()
print(f"Created {len(models)} pretrained models:")
for model in models:
print(f" - {model.model_id}: {model.model_type.value}")
print(f" Validation score: {model.performance_metrics.get('validation_score', 'N/A'):.4f}")
Step 5: Fast Inference
Use the pretrained models for fast inference on new data.
from neurological_lrd_analysis import (
quick_predict, quick_ensemble_predict, PretrainedInference
)
# Generate test data
test_data = fbm_davies_harte(1000, 0.6, seed=123)
# Single model predictions
print("Single model predictions:")
hurst_rf = quick_predict(test_data, "pretrained_models", "random_forest")
print(f"Random Forest prediction: {hurst_rf:.4f}")
hurst_svr = quick_predict(test_data, "pretrained_models", "svr")
print(f"SVR prediction: {hurst_svr:.4f}")
# Ensemble prediction (best accuracy)
print("Ensemble prediction:")
hurst_ensemble, uncertainty = quick_ensemble_predict(test_data, "pretrained_models")
print(f"Ensemble prediction: {hurst_ensemble:.4f} ± {uncertainty:.4f}")
# Batch prediction
print("Batch prediction:")
inference = PretrainedInference("pretrained_models")
test_data_list = [fbm_davies_harte(1000, h, seed=123+i) for h in [0.4, 0.6, 0.8] for i in range(3)]
predictions = inference.predict_batch(test_data_list)
print(f"Batch predictions: {predictions}")
Step 6: Comprehensive Benchmarking
Compare classical and ML methods using the comprehensive benchmark system.
from neurological_lrd_analysis import (
ClassicalMLBenchmark, run_comprehensive_benchmark,
BiomedicalHurstEstimatorFactory, EstimatorType
)
# Create test scenarios
test_scenarios = []
for hurst in [0.3, 0.5, 0.7, 0.9]:
for length in [500, 1000, 2000]:
data = fbm_davies_harte(length, hurst, seed=42)
test_scenarios.append({
'data': data,
'true_hurst': hurst,
'length': length,
'scenario': f'fBm_H{hurst}_L{length}'
})
# Create benchmark system
benchmark = ClassicalMLBenchmark(
pretrained_models_dir="pretrained_models",
classical_estimators=[EstimatorType.DFA, EstimatorType.RS_ANALYSIS, EstimatorType.HIGUCHI],
ml_estimators=['random_forest', 'svr', 'ensemble']
)
# Run comprehensive benchmark
print("Running comprehensive benchmark...")
results = benchmark.run_comprehensive_benchmark(
test_scenarios=test_scenarios,
save_results=True
)
# Display results
print("\nBenchmark Results:")
print("=" * 60)
print(f"{'Method':<15} {'Type':<10} {'MAE':<8} {'RMSE':<8} {'Corr':<8} {'Time(ms)':<10}")
print("-" * 60)
for method_name, summary in results['summaries'].items():
print(f"{method_name:<15} {summary.method_type:<10} "
f"{summary.mean_absolute_error:<8.4f} {summary.root_mean_squared_error:<8.4f} "
f"{summary.correlation:<8.4f} {summary.mean_computation_time*1000:<10.1f}")
Step 7: Advanced Usage
Explore advanced features of the ML baselines system.
Feature Importance Analysis
# Get feature importance from trained models
rf_importance = rf_estimator.get_feature_importance()
print("Random Forest Feature Importance (top 10):")
for i, importance in enumerate(rf_importance[:10]):
print(f" Feature {i}: {importance:.4f}")
Cross-Validation Analysis
# Perform cross-validation analysis
cv_scores = rf_estimator.get_cv_scores()
print(f"Random Forest CV scores: {cv_scores}")
print(f"Mean CV score: {np.mean(cv_scores):.4f} ± {np.std(cv_scores):.4f}")
Model Persistence
# Save trained models
rf_estimator.save_model("rf_model.pkl")
svr_estimator.save_model("svr_model.pkl")
# Load saved models
from neurological_lrd_analysis import RandomForestEstimator, SVREstimator
loaded_rf = RandomForestEstimator()
loaded_rf.load_model("rf_model.pkl")
loaded_svr = SVREstimator()
loaded_svr.load_model("svr_model.pkl")
# Use loaded models for prediction
test_features = extractor.extract_features(test_data)
test_features_array = np.array([list(test_features.values())])
rf_pred = loaded_rf.predict(test_features_array)
svr_pred = loaded_svr.predict(test_features_array)
print(f"Loaded RF prediction: {rf_pred[0]:.4f}")
print(f"Loaded SVR prediction: {svr_pred[0]:.4f}")
Performance Analysis
import matplotlib.pyplot as plt
import seaborn as sns
# Create performance comparison plot
methods = list(results['summaries'].keys())
mae_values = [results['summaries'][m].mean_absolute_error for m in methods]
time_values = [results['summaries'][m].mean_computation_time * 1000 for m in methods]
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.barh(methods, mae_values)
plt.xlabel('Mean Absolute Error')
plt.title('Performance Comparison (MAE)')
plt.subplot(1, 2, 2)
plt.barh(methods, time_values)
plt.xlabel('Computation Time (ms)')
plt.title('Speed Comparison')
plt.tight_layout()
plt.savefig('ml_benchmark_results.png', dpi=300, bbox_inches='tight')
plt.show()
Best Practices
Feature Engineering: Always use the comprehensive feature extractor for best results
Hyperparameter Optimization: Use Optuna for automated tuning
Model Selection: Ensemble methods typically provide the best accuracy
Validation: Always use proper train/validation splits
Persistence: Save trained models for reuse
Benchmarking: Compare ML methods with classical methods
Troubleshooting
Common Issues and Solutions
Import Errors - Ensure all dependencies are installed: pip install optuna joblib scikit-learn - Check that the library is properly installed: pip install neurological-lrd-analysis
Memory Issues - Reduce the number of features or samples - Use smaller hyperparameter search spaces - Consider using fewer models in the ensemble
Performance Issues - Use pretrained models for fast inference - Consider using fewer features for real-time applications - Optimize hyperparameters for your specific use case
Model Training Issues - Ensure sufficient training data - Check for data quality issues - Use proper validation splits
Next Steps
Explore the API Reference for detailed documentation
Check out the Benchmarking Guide for performance analysis
Try the Jupyter Notebooks for interactive examples
Contribute to the project on GitHub
For more information, see the complete documentation at https://neurological-lrd-analysis.readthedocs.io/.