Enhanced Benchmarking Guide
Overview
The Neurological LRD Analysis library includes comprehensive benchmarking capabilities with detailed statistical reporting. This guide explains how to use the enhanced benchmarking system to evaluate estimator performance with bias, error, confidence intervals, and p-value analysis.
Enhanced Statistical Metrics
The benchmarking system reports comprehensive statistical metrics for each estimator:
Core Metrics
Bias: Systematic error (estimate - true_value)
Absolute Error: |estimate - true_value|
Relative Error: Absolute error / true_value
Root Mean Square Error (RMSE): √(mean(bias²))
Standard Error: Standard deviation of estimates
Confidence Interval: Statistical confidence bounds
P-value: Statistical significance of regression fit
Performance Metrics
Success Rate: Percentage of successful estimations
Computation Time: Average time per estimation
Convergence Flag: Whether the method converged
Significance Rate: Percentage of statistically significant estimates
Usage Examples
Basic Enhanced Benchmarking
Using the Command Line Script
# Run benchmark with default settings
python scripts/run_benchmark.py
# Customize benchmark parameters
python scripts/run_benchmark.py --hurst-values 0.3,0.5,0.7,0.8 --lengths 512,1024,2048 --output-dir ./my_results
# Quick test with fewer samples
python scripts/run_benchmark.py --hurst-values 0.5,0.7 --lengths 512 --bootstrap 50
Using the Python API
from neurological_lrd_analysis.benchmark_core.generation import generate_grid
from neurological_lrd_analysis.benchmark_core.runner import BenchmarkConfig, run_benchmark_on_dataset, analyze_benchmark_results, create_leaderboard
# Generate test datasets
datasets = generate_grid(
hurst_values=[0.3, 0.5, 0.7, 0.8],
lengths=[512, 1024, 2048],
contaminations=['none', 'noise'],
contamination_level=0.1,
seed=42
)
# Configure benchmark
config = BenchmarkConfig(
output_dir='./benchmark_results',
n_bootstrap=100,
confidence_level=0.95,
save_results=True,
verbose=False
)
# Run benchmark
results = run_benchmark_on_dataset(datasets, config)
# Analyze results
analysis = analyze_benchmark_results(results)
leaderboard = create_leaderboard(results)
print("Enhanced Benchmarking Results:")
print(leaderboard.to_string(index=False))
Comprehensive Statistical Analysis
# Detailed analysis
for estimator, metrics in analysis.items():
print(f"\n{estimator}:")
print(f" Success Rate: {metrics['success_rate']:.1%}")
print(f" Mean Bias: {metrics['mean_bias']:.4f}")
print(f" Std Bias: {metrics['std_bias']:.4f}")
print(f" Mean Absolute Error: {metrics['mean_absolute_error']:.4f}")
print(f" RMSE: {metrics['rmse']:.4f}")
print(f" Mean Relative Error: {metrics['mean_relative_error']:.1%}")
print(f" Significance Rate: {metrics['significance_rate']:.1%}")
print(f" Mean Computation Time: {metrics['mean_computation_time']:.4f}s")
Bias Analysis by Hurst Value
# Analyze bias patterns by true Hurst value
hurst_values = sorted(set(r.true_hurst for r in results if r.true_hurst is not None))
for hurst in hurst_values:
print(f"\nHurst = {hurst}:")
hurst_results = [r for r in results if r.true_hurst == hurst and r.convergence_flag and r.bias is not None]
if hurst_results:
biases = [r.bias for r in hurst_results]
print(f" Total estimates: {len(hurst_results)}")
print(f" Mean bias: {np.mean(biases):.4f}")
print(f" Std bias: {np.std(biases):.4f}")
print(f" Min bias: {np.min(biases):.4f}")
print(f" Max bias: {np.max(biases):.4f}")
## Application-Specific Scoring
The benchmarking system uses a weighted scoring function to rank estimators based on their suitability for specific use cases.
### Scoring Weights Configuration
```python
from neurological_lrd_analysis.benchmark_core.runner import ScoringWeights
# Default weights (balanced)
default_weights = ScoringWeights(
success_rate=0.25,
accuracy=0.25,
speed=0.25,
robustness=0.25
)
# Real-time BCI weights (prioritize speed and success)
bci_weights = ScoringWeights(
success_rate=0.4,
accuracy=0.2,
speed=0.3,
robustness=0.1
)
# Precision research weights (prioritize accuracy and robustness)
research_weights = ScoringWeights(
success_rate=0.2,
accuracy=0.4,
speed=0.1,
robustness=0.3
)
Custom Weight Usage
config = BenchmarkConfig(
output_dir='./results',
scoring_weights=bci_weights
)
results = run_benchmark_on_dataset(datasets, config)
leaderboard = create_leaderboard(results) # Leaderboard uses the weights
## Benchmark Configuration Options
### BenchmarkConfig Parameters
```python
config = BenchmarkConfig(
output_dir='./results', # Output directory
true_hurst=None, # Specific Hurst value (None for all)
n_bootstrap=100, # Bootstrap samples for CI
confidence_level=0.95, # Confidence level
random_state=42, # Random seed
estimators=None, # Specific estimators (None for all)
save_results=True, # Save results to files
verbose=False # Verbose output
)
Data Generation Options
datasets = generate_grid(
hurst_values=[0.3, 0.5, 0.7], # Hurst exponents to test
lengths=[512, 1024, 2048], # Data lengths
contaminations=['none', 'noise'], # Contamination types
contamination_level=0.1, # Contamination level (0-1)
seed=42 # Random seed
)
Statistical Interpretation
Bias Analysis
Positive Bias: Estimator tends to overestimate
Negative Bias: Estimator tends to underestimate
Zero Bias: Unbiased estimator (ideal)
Error Analysis
Mean Absolute Error (MAE): Average absolute deviation
Root Mean Square Error (RMSE): Penalizes large errors more
Relative Error: Error relative to true value
Significance Analysis
P-value < 0.05: Statistically significant regression
Significance Rate: Percentage of significant estimates
High significance rate: Consistent statistical reliability
Output Files
The enhanced benchmarking system generates several output files:
CSV Files
benchmark_results.csv: Raw results with all metricsleaderboard.csv: Ranked estimator performancedetailed_analysis.csv: Comprehensive statistical analysisbias_analysis.csv: Bias patterns by Hurst value
Text Files
benchmark_summary.txt: Human-readable summary
Example Output Structure
benchmark_results/
├── benchmark_results.csv # Raw results
├── benchmark_summary.txt # Summary report
├── leaderboard.csv # Performance ranking
├── detailed_analysis.csv # Statistical analysis
└── bias_analysis.csv # Bias patterns
Advanced Usage
Custom Estimator Selection
# Benchmark specific estimators only
config = BenchmarkConfig(
output_dir='./custom_results',
estimators=['DFA', 'R/S', 'GPH'], # Only these estimators
n_bootstrap=200,
confidence_level=0.99
)
Performance Comparison
# Compare estimators across different conditions
conditions = {
'clean_data': generate_grid([0.5, 0.7], [1024], ['none']),
'noisy_data': generate_grid([0.5, 0.7], [1024], ['noise']),
'short_data': generate_grid([0.5, 0.7], [512], ['none']),
'long_data': generate_grid([0.5, 0.7], [2048], ['none'])
}
for condition_name, datasets in conditions.items():
config = BenchmarkConfig(output_dir=f'./{condition_name}_results')
results = run_benchmark_on_dataset(datasets, config)
analysis = analyze_benchmark_results(results)
print(f"\n{condition_name}:")
for estimator, metrics in analysis.items():
print(f" {estimator}: MAE={metrics['mean_absolute_error']:.3f}")
Statistical Testing
# Compare estimator performance statistically
from scipy import stats
def compare_estimators(results, estimator1, estimator2):
"""Compare two estimators using paired t-test."""
est1_results = [r for r in results if r.estimator == estimator1 and r.convergence_flag]
est2_results = [r for r in results if r.estimator == estimator2 and r.convergence_flag]
# Match by true Hurst value
est1_errors = []
est2_errors = []
for r1 in est1_results:
for r2 in est2_results:
if r1.true_hurst == r2.true_hurst:
est1_errors.append(r1.absolute_error)
est2_errors.append(r2.absolute_error)
break
if len(est1_errors) > 1:
t_stat, p_value = stats.ttest_rel(est1_errors, est2_errors)
print(f"Comparison: {estimator1} vs {estimator2}")
print(f" T-statistic: {t_stat:.4f}")
print(f" P-value: {p_value:.4f}")
print(f" Significant difference: {'Yes' if p_value < 0.05 else 'No'}")
# Example usage
compare_estimators(results, 'DFA', 'R/S')
Best Practices
Benchmark Design
Use multiple Hurst values: Test across the range [0.1, 0.9]
Vary data lengths: Test with short (512), medium (1024), and long (2048) series
Include contamination: Test robustness with noise, missing data, outliers
Adequate sample size: Use at least 50 bootstrap samples for CI
Statistical Interpretation
Consider bias patterns: Look for systematic over/under-estimation
Evaluate consistency: Check standard deviation of bias
Assess significance: High significance rate indicates reliability
Compare across conditions: Some estimators may be better for specific scenarios
Performance Optimization
Parallel processing: Use multiple cores for large benchmarks
Memory management: Process datasets in batches for very large studies
Result caching: Save intermediate results for long-running benchmarks
Troubleshooting
Common Issues
Low Success Rates
# Check for data quality issues
if metrics['success_rate'] < 0.8:
print(f"Low success rate for {estimator}: {metrics['success_rate']:.1%}")
print("Consider:")
print("- Increasing data length")
print("- Checking data quality")
print("- Adjusting estimator parameters")
High Bias
# Investigate bias patterns
if abs(metrics['mean_bias']) > 0.2:
print(f"High bias for {estimator}: {metrics['mean_bias']:.3f}")
print("Consider:")
print("- Method may be inappropriate for data type")
print("- Check preprocessing requirements")
print("- Verify parameter settings")
Low Significance Rate
# Check statistical reliability
if metrics['significance_rate'] < 0.5:
print(f"Low significance rate for {estimator}: {metrics['significance_rate']:.1%}")
print("Consider:")
print("- Increasing bootstrap samples")
print("- Checking data stationarity")
print("- Verifying scaling range selection")
## Machine Learning Benchmarks
The `ClassicalMLBenchmark` class provides a specialized interface for comparing classical Hurst estimators against pre-trained machine learning models.
### Comparison Workflow
```python
from neurological_lrd_analysis.ml_baselines import ClassicalMLBenchmark
# 1. Initialize benchmark with pre-trained models
benchmark = ClassicalMLBenchmark(
model_dir="pretrained_models",
output_dir="ml_benchmark_results"
)
# 2. Run comprehensive comparison across multiple scenarios
results = benchmark.run_comprehensive_benchmark(
n_samples=50, # Samples per scenario
hurst_range=(0.3, 0.9)
)
# 3. Access summary statistics
print(f"Classical vs ML Leaderboard:")
print(results['leaderboard'])
Key ML Benchmark Metrics
In addition to standard metrics, the ML benchmark analyzes:
Generalization Error: Performance on synthetic vs. real biomedical scenarios.
Inference Speed: Comparison of feature extraction + ML prediction vs. classical integration.
Uncertainty Calibration: How well ML confidence intervals (from ensembles) correlate with actual error.
## Example Results Interpretation
### Sample Output
Estimator Success Rate Mean Bias Std Bias Mean Absolute Error RMSE Significance Rate DFA 100.0% -0.0662 0.1921 0.1751 0.2032 85.0% R/S 100.0% -0.0027 0.1939 0.1754 0.1940 92.0% GPH 50.0% 0.0987 0.0993 0.1015 0.1400 75.0%
### Interpretation
- **DFA**: Consistent performance, slight negative bias, good significance
- **R/S**: Best overall performance, minimal bias, highest significance
- **GPH**: Lower success rate but good performance when successful
This enhanced benchmarking system provides comprehensive statistical analysis to help users select the most appropriate Hurst estimator for their specific biomedical time series data.