System Architecture

Advanced ML Pipeline for Exoplanet Detection

A sophisticated stacking ensemble approach combining multiple machine learning algorithms with hyperparameter optimization to achieve state-of-the-art exoplanet classification accuracy.

NASA Space Apps Challenge Solution

Our hackathon submission leverages real NASA datasets and cutting-edge machine learning to democratize exoplanet discovery. Built in 48 hours with production-ready architecture and comprehensive data processing pipeline.

01

Multi-Source Data Acquisition

Automated download from NASA Exoplanet Archive (KOI, TOI datasets) and NASA K2 mission data via API calls. Real-time access to 21,000+ exoplanet candidates with comprehensive astronomical parameters.

02

Data Cleaning & Integration

Column-wise missing value analysis (40% threshold), automatic disposition mapping (CONFIRMED→1, CANDIDATE→2, FALSE POSITIVE→3), and intelligent column selection from 25+ astronomical features.

Data Quality Strategy

  • 40% Missing Value Threshold: Research shows features with >40% missingness introduce more noise than signal. Empirical analysis of NASA archives confirms this cutoff preserves information while removing sparse columns.
  • Multi-Mission Integration: Combines KOI (Kepler), TOI (TESS), and K2 datasets using standardized nomenclature, maximizing training diversity across different stellar populations and detection methods.
  • Class Balance Preservation: Maintains natural distribution of confirmed planets (rare), candidates (moderate), and false positives (common) to reflect real discovery rates.
03

Feature Engineering & Scaling

Log10 transformation for astronomical measurements, unit standardization (radius conversion to meters), robust scaling to handle outliers, and class-based KNN imputation (k=5) for missing values.

Scientific Data Preparation

  • Log10 Transformation: Astronomical quantities span orders of magnitude (planetary radii: Earth-size to Jupiter-size). Log scaling ensures linear algorithms can capture these power-law relationships effectively.
  • Robust Scaling: Uses median/IQR instead of mean/std to handle extreme outliers common in sky surveys (bright giants vs faint dwarfs), preventing single objects from skewing distributions.
  • Class-based KNN Imputation (k=5): Missing values filled using similar objects of the same class, preserving astrophysical correlations (stellar mass-radius relations, planet-star correlations).
  • Unit Standardization: Converts mixed units (solar radii, Earth radii, AU) to SI units, ensuring consistent numerical ranges for optimization algorithms.
04

Advanced ML Ensemble

Stacked ensemble with 4 specialized algorithms, Optuna hyperparameter optimization (30 trials each), 10-fold stratified cross-validation, and meta-learning for optimal weight combination.

Architectural Decision: Why Stacked Ensemble?

  • Bias-Variance Decomposition: Single models suffer from either high bias (underfitting complex exoplanet signals) or high variance (overfitting to noise). Stacking reduces both by combining diverse learners.
  • Model Diversity Principle: Each algorithm captures different aspects: tree-based models excel at feature interactions, TabNet provides attention-based selection, ensuring complementary error patterns.
  • Meta-Learning Advantage: Rather than simple averaging, a trained meta-model learns optimal weights based on input characteristics, adapting to different regions of the feature space.
  • Astronomical Data Complexity: Exoplanet signals exhibit multi-scale patterns (orbital periods, transit depths, stellar noise) that no single algorithm can capture optimally.

Model Components

Each base model brings unique strengths to exoplanet signal detection, creating a robust ensemble architecture.

CB

CatBoost

Gradient boosting with automatic categorical feature handling and reduced overfitting through ordered boosting.

  • Automatic feature scaling
  • Robust to outliers
  • Built-in regularization
  • GPU acceleration

Why CatBoost for Exoplanets?

Ordered Boosting: Prevents target leakage common in astronomical datasets with temporal correlations. Categorical Handling: Native support for stellar classifications (spectral types, discovery methods) without encoding artifacts. Noise Robustness: Built-in gradient noise handling crucial for noisy photometric measurements.

LGB

LightGBM

Fast gradient boosting framework optimized for efficiency and accuracy with leaf-wise tree growth.

  • Memory efficient
  • Fast training speed
  • High accuracy
  • Network parallelization

Why LightGBM for Large-Scale Data?

Leaf-wise Growth: Optimal for high-dimensional astronomical features (25+ parameters per candidate). Memory Efficiency: Essential for processing 21,000+ exoplanet candidates in real-time. Speed-Accuracy Balance: Enables rapid hyperparameter exploration while maintaining precision on subtle transit signals.

XGB

XGBoost

Optimized distributed gradient boosting library with advanced regularization and scalability features.

  • Advanced regularization
  • Parallel processing
  • Feature importance
  • Cross-platform support

Why XGBoost for Feature Selection?

Regularization: L1/L2 penalties prevent overfitting to correlated astronomical measurements (stellar radius vs mass). Feature Importance: SHAP-compatible explanations crucial for scientific validation of discoveries. Sparsity Handling: Optimized for datasets with many missing values common in multi-mission archives.

TN

TabNet

Deep learning architecture specifically designed for tabular data with interpretable feature selection.

  • Attention mechanism
  • Sequential decision making
  • Feature importance
  • End-to-end learning

Why TabNet for Deep Pattern Recognition?

Attention Mechanism: Automatically focuses on relevant features per candidate (orbital period for hot Jupiters, transit depth for rocky planets). Sequential Processing: Models temporal dependencies in multi-epoch observations. Interpretability: Provides feature masks showing which measurements drove each classification decision.

Technical Specifications

Detailed implementation parameters and optimization strategies for reproducible results.

Hyperparameter Optimization

Framework: Optuna TPE Sampler
Trials per Model: 30 iterations
Optimization Metric: Cross-validation Accuracy
Search Strategy: Bayesian Optimization

Optimization Strategy Rationale

TPE Sampler Choice: Tree-structured Parzen Estimator outperforms random/grid search by modeling promising vs unpromising regions. 30 Trial Budget: Empirically validated sweet spot - sufficient exploration without overfitting to validation noise. Bayesian Approach: Builds surrogate model of hyperparameter space, concentrating trials where improvement is most likely.

Cross-Validation Strategy

Method: Stratified K-Fold
Base Training Folds: 10-fold CV
Optuna CV Folds: 5-fold CV
Random Seed: 67 (reproducible)

Cross-Validation Design Rationale

Stratified Splitting: Maintains class proportions in each fold, critical for imbalanced exoplanet data (confirmed planets are rare). 10-fold for Training: Optimal bias-variance tradeoff for 21k samples, providing robust meta-learning features. 5-fold for Optuna: Faster hyperparameter exploration while maintaining statistical validity. Fixed Seed 67: Ensures reproducible splits for scientific verification and result comparison.

Data Split Configuration

Training Set: 80% stratified
Test Set: 20% holdout
Class Weighting: Automatic balancing
Validation: Out-of-fold predictions

Meta-Learning Ensemble

Meta-Model: Logistic Regression CV
Input Features: Base model probabilities
Regularization: Cross-validated L1/L2
Early Stopping: 20 rounds patience

Meta-Learning Architecture

Logistic Regression Choice: Linear meta-learner prevents overfitting to base model quirks while remaining interpretable for scientific analysis. Probability Features: Uses class probabilities rather than hard predictions, preserving uncertainty information crucial for discovery confidence. CV Regularization: Automatically selects optimal L1/L2 balance, enabling feature selection among base models. Early Stopping: 20-round patience prevents meta-overfitting while allowing convergence.

Performance & Artifacts

Comprehensive model evaluation and output artifacts for deployment and analysis.

Model Artifacts

  • Models: Serialized base models (joblib)
  • Parameters: Optimized hyperparameters (JSON)
  • Meta-model: Ensemble weights and coefficients
  • Predictions: Test set classifications (CSV)
  • Features: Meta-learning feature matrices

Evaluation Metrics

  • Accuracy: Primary optimization target
  • Precision: Per-class detection precision
  • Recall: Sensitivity for each category
  • F1-Score: Harmonic mean performance
  • ROC-AUC: Multi-class discrimination