Multi-Source Data Acquisition
Automated download from NASA Exoplanet Archive (KOI, TOI datasets) and NASA K2 mission data via API calls. Real-time access to 21,000+ exoplanet candidates with comprehensive astronomical parameters.
System Architecture
A sophisticated stacking ensemble approach combining multiple machine learning algorithms with hyperparameter optimization to achieve state-of-the-art exoplanet classification accuracy.
Our hackathon submission leverages real NASA datasets and cutting-edge machine learning to democratize exoplanet discovery. Built in 48 hours with production-ready architecture and comprehensive data processing pipeline.
Automated download from NASA Exoplanet Archive (KOI, TOI datasets) and NASA K2 mission data via API calls. Real-time access to 21,000+ exoplanet candidates with comprehensive astronomical parameters.
Column-wise missing value analysis (40% threshold), automatic disposition mapping (CONFIRMED→1, CANDIDATE→2, FALSE POSITIVE→3), and intelligent column selection from 25+ astronomical features.
Log10 transformation for astronomical measurements, unit standardization (radius conversion to meters), robust scaling to handle outliers, and class-based KNN imputation (k=5) for missing values.
Stacked ensemble with 4 specialized algorithms, Optuna hyperparameter optimization (30 trials each), 10-fold stratified cross-validation, and meta-learning for optimal weight combination.
Each base model brings unique strengths to exoplanet signal detection, creating a robust ensemble architecture.
Gradient boosting with automatic categorical feature handling and reduced overfitting through ordered boosting.
Ordered Boosting: Prevents target leakage common in astronomical datasets with temporal correlations. Categorical Handling: Native support for stellar classifications (spectral types, discovery methods) without encoding artifacts. Noise Robustness: Built-in gradient noise handling crucial for noisy photometric measurements.
Fast gradient boosting framework optimized for efficiency and accuracy with leaf-wise tree growth.
Leaf-wise Growth: Optimal for high-dimensional astronomical features (25+ parameters per candidate). Memory Efficiency: Essential for processing 21,000+ exoplanet candidates in real-time. Speed-Accuracy Balance: Enables rapid hyperparameter exploration while maintaining precision on subtle transit signals.
Optimized distributed gradient boosting library with advanced regularization and scalability features.
Regularization: L1/L2 penalties prevent overfitting to correlated astronomical measurements (stellar radius vs mass). Feature Importance: SHAP-compatible explanations crucial for scientific validation of discoveries. Sparsity Handling: Optimized for datasets with many missing values common in multi-mission archives.
Deep learning architecture specifically designed for tabular data with interpretable feature selection.
Attention Mechanism: Automatically focuses on relevant features per candidate (orbital period for hot Jupiters, transit depth for rocky planets). Sequential Processing: Models temporal dependencies in multi-epoch observations. Interpretability: Provides feature masks showing which measurements drove each classification decision.
Detailed implementation parameters and optimization strategies for reproducible results.
TPE Sampler Choice: Tree-structured Parzen Estimator outperforms random/grid search by modeling promising vs unpromising regions. 30 Trial Budget: Empirically validated sweet spot - sufficient exploration without overfitting to validation noise. Bayesian Approach: Builds surrogate model of hyperparameter space, concentrating trials where improvement is most likely.
Stratified Splitting: Maintains class proportions in each fold, critical for imbalanced exoplanet data (confirmed planets are rare). 10-fold for Training: Optimal bias-variance tradeoff for 21k samples, providing robust meta-learning features. 5-fold for Optuna: Faster hyperparameter exploration while maintaining statistical validity. Fixed Seed 67: Ensures reproducible splits for scientific verification and result comparison.
Logistic Regression Choice: Linear meta-learner prevents overfitting to base model quirks while remaining interpretable for scientific analysis. Probability Features: Uses class probabilities rather than hard predictions, preserving uncertainty information crucial for discovery confidence. CV Regularization: Automatically selects optimal L1/L2 balance, enabling feature selection among base models. Early Stopping: 20-round patience prevents meta-overfitting while allowing convergence.
Comprehensive model evaluation and output artifacts for deployment and analysis.