Real-Bogus Classification¶
STDPipe provides two complementary approaches for classifying detected objects as real astronomical sources (stars, galaxies) or artifacts (cosmic rays, hot pixels, satellite trails, etc.):
Feature-based classification (
stdpipe.realbogus_features) - uses explicit morphological features with sklearn classifiersCNN-based classification (
stdpipe.realbogus) - uses deep learning for maximum accuracy
This page documents both approaches and helps you choose the right one for your use case.
Quick Start¶
Feature-based (no TensorFlow needed):
from stdpipe import photometry, realbogus_features as rbf
# Detect objects
obj = photometry.get_objects_sep(image, thresh=3.0)
# Classify with scoring (no training needed)
obj = rbf.classify(obj, image, classifier='scoring',
threshold=0.5, add_score=True, flag_bogus=True)
# Filter to real objects
real = obj[obj['rb_score'] >= 0.5]
CNN-based (requires TensorFlow):
from stdpipe import photometry, realbogus
# Detect objects
obj = photometry.get_objects_sep(image, thresh=3.0)
# Classify with CNN
obj = realbogus.classify_realbogus(obj, image, threshold=0.5,
add_score=True, flag_bogus=True)
# Filter to real objects
real = obj[obj['rb_score'] >= 0.5]
Comparison of Approaches¶
Aspect |
Feature-based |
CNN-based |
|---|---|---|
Dependencies |
sklearn only (already required) |
TensorFlow required |
Training data |
100s of examples |
1000s of examples |
Training time |
Seconds |
Minutes to hours |
Inference speed |
~1ms per object |
~10ms per object |
GPU required |
No |
Optional but helpful |
Interpretability |
High (explicit features) |
Low (black box) |
Accuracy |
Good for typical artifacts |
Better for complex cases |
Customization |
Easy (adjust weights/thresholds) |
Requires retraining |
Recommendation:
Use feature-based for quick filtering and when interpretability matters
Use CNN-based when maximum accuracy is needed and TensorFlow is available
Can use both: feature-based for quick pre-filtering, CNN for detailed classification
Feature-Based Classification¶
The stdpipe.realbogus_features module provides real-bogus classification using explicit morphological features extracted from catalogs and image cutouts.
Key advantages:
No TensorFlow dependency - works on any system with sklearn
Interpretable - can explain why an object is classified as bogus
Fast - no GPU required, milliseconds per object
Small training data - can train on hundreds of examples
Customizable - features and thresholds can be tuned per instrument
Multiple modes - catalog-only, cutout-based, or hybrid
Feature Extraction¶
Features are extracted from two sources:
Catalog Features (from detection catalog, no image needed):
fwhm,fwhm_ratio- PSF size and consistencyellipticity,elongation- Shape parameters (high for trails)peakiness- FLUX_MAX / FLUX_AUTO (high for cosmic rays)snr- Signal-to-noise ratio
Cutout Features (from image cutouts):
sharpness- Central concentration (high = cosmic ray/hot pixel)concentration- Flux distribution ratiosymmetry- Rotational symmetry (high = asymmetric artifact)roundness- Shape roundness (low = trail/elongated)psf_match- PSF fit quality χ²peak_offset- Centroid-to-peak distanceedge_gradient- Edge sharpness (high = sharp cosmic ray edge)bg_consistency- Background uniformity in annulus
You can use catalog-only features when you don’t have the image, cutout-only for maximum accuracy, or hybrid mode combining both.
Classifiers¶
Three classifier types are available:
1. Scoring Classifier (No Training)
Rule-based scoring using predefined weights and thresholds. Good for quick filtering without any training data.
# Use scoring classifier (no training needed)
obj = rbf.classify(obj, image, classifier='scoring',
threshold=0.5, add_score=True)
# Examine scores
print(f"Mean score: {obj['rb_score'].mean():.2f}")
print(f"Real sources: {sum(obj['rb_score'] > 0.5)}")
2. Isolation Forest (Unsupervised)
Anomaly detection that learns “normal” from the data. Good when you have mostly real sources and want to find outliers.
# Classify with IsolationForest (learns from data)
obj = rbf.classify(obj, image, classifier='isolation',
remove_trend=True, add_score=True)
# Objects with low scores are outliers (likely artifacts)
artifacts = obj[obj['rb_score'] < 0.3]
3. Random Forest (Supervised)
Traditional supervised classification. Requires labeled training data but provides best accuracy.
# Train on labeled data
features, names = rbf.extract_features(train_obj, train_image, method='hybrid')
clf, metrics = rbf.train_classifier(features, train_labels,
classifier='randomforest')
print(f"Accuracy: {metrics['test_accuracy']:.1%}")
# Apply to new data
obj = rbf.classify(obj, image, classifier='randomforest',
model=clf, add_score=True)
Trend Removal¶
Many features vary systematically across the image (due to PSF variation, vignetting) or with magnitude. Trend removal normalizes features to make classification more robust:
# Enable spatial trend removal
obj = rbf.classify(obj, image,
remove_trend=True,
trend_cols=['x', 'y'], # Spatial trends
add_score=True)
# Can also include magnitude trends
obj = rbf.classify(obj, image,
remove_trend=True,
trend_cols=['x', 'y', 'MAG_AUTO'],
add_score=True)
Training and Evaluation Tool¶
The examples/train_realbogus_features.py script provides a complete command-line tool for training, testing, and evaluating classifiers:
Train a classifier:
# Train RandomForest on simulated data
python train_realbogus_features.py train --n-images 100 --output model.pkl
# Train with hybrid features and trend removal
python train_realbogus_features.py train \
--method hybrid --trend-removal --n-images 200 --output model.pkl
Test on single image:
# Test scoring classifier (no training needed)
python train_realbogus_features.py test --classifier scoring
# Test trained model
python train_realbogus_features.py test --model model.pkl
# Test on real image
python train_realbogus_features.py test --model model.pkl \
--image myimage.fits --fwhm 3.5
Evaluate on multiple images:
# Comprehensive evaluation
python train_realbogus_features.py evaluate \
--model model.pkl --n-images 50 --output results/
Compare feature methods:
# Compare catalog/cutout/hybrid methods
python train_realbogus_features.py compare --output comparison.png
The tool automatically generates comprehensive visualizations and performance metrics.
Customization¶
The scoring classifier can be customized for your instrument:
# Define custom rules
custom_rules = {
'sharpness': {'weight': 0.2, 'ideal': 1.3, 'bad': 'high', 'threshold': 4.0},
'roundness': {'weight': 0.3, 'ideal': 1.0, 'bad': 'low', 'threshold': 0.2},
'fwhm_ratio': {'weight': 0.2, 'ideal': 1.0, 'bad': 'both', 'threshold': 0.3},
}
clf = rbf.ScoringClassifier(rules=custom_rules)
features, _ = rbf.extract_features(obj, image, method='hybrid')
scores = clf.predict_proba(features)
Integration with artefacts.py¶
The legacy stdpipe.artefacts module now uses stdpipe.realbogus_features internally:
from stdpipe import artefacts
# New unified function
good = artefacts.filter_detections(
obj, image,
method='hybrid',
classifier='isolation',
remove_trend=True
)
API Reference¶
CNN-Based Classification¶
The stdpipe.realbogus module provides deep learning-based classification using convolutional neural networks (CNNs). This approach learns features automatically from training data and can achieve higher accuracy than feature-based methods, especially for complex artifact types.
Requirements¶
CNN-based classification requires TensorFlow:
pip install tensorflow
# Or with GPU support
pip install tensorflow[and-cuda]
Basic Usage¶
from stdpipe import realbogus
# Classify with pre-trained model (if available)
obj = realbogus.classify_realbogus(
obj, image,
threshold=0.5,
add_score=True,
flag_bogus=True
)
# Filter to real objects
real = obj[obj['rb_score'] >= 0.5]
print(f"Kept {len(real)}/{len(obj)} objects")
Training Custom Models¶
For best results on your specific instrument, train a custom CNN model:
from stdpipe import realbogus
# Prepare training data (cutouts and labels)
# labels: 1 = real, 0 = bogus
train_cutouts = realbogus.extract_cutouts(train_obj, train_image)
# Train CNN model
model = realbogus.train_model(
train_cutouts, train_labels,
epochs=50,
batch_size=32,
validation_split=0.2
)
# Save model
model.save('my_realbogus_model.h5')
# Apply to new data
obj = realbogus.classify_realbogus(
obj, image,
model='my_realbogus_model.h5',
threshold=0.5
)
API Reference¶
- stdpipe.realbogus.classify_realbogus(obj, image, model=None, model_file=None, bg=None, err=None, mask=None, fwhm=None, asinh_softening=None, threshold=0.5, add_score=True, flag_bogus=True, batch_size=128, verbose=False)[source]
Classify detected objects as real or bogus using CNN.
This is the main entry point for real-bogus classification.
- Parameters:
- objastropy.table.Table
Object catalog with ‘x’ and ‘y’ columns (from photometry.get_objects_*)
- imagendarray
Science image
- modelkeras.Model, optional
Pre-loaded model. If None, loads from model_file.
- model_filestr, optional
Path to model file. If None, uses default model.
- bgndarray or float, optional
Background map or scalar value
- errndarray or float, optional
Error/noise map or scalar value
- maskndarray, optional
Boolean mask (True = masked pixels)
- cutout sizederived
Cutout size is inferred from the model input shape. If the model has dynamic spatial dimensions, defaults to 31x31 (radius 15).
- fwhmfloat, optional
Image FWHM. If None, estimated from catalog.
- asinh_softeningfloat, optional
Asinh softening in units of background sigma. If None, uses DEFAULT_ASINH_SOFTENING_SIGMA.
- thresholdfloat, optional
Classification threshold (0-1). Objects with score > threshold are real. Default: 0.5
- add_scorebool, optional
Add ‘rb_score’ column to output catalog. Default: True
- flag_bogusbool, optional
Set flags=0x1000 for bogus objects and filter them out. Default: True
- batch_sizeint, optional
Batch size for inference. Default: 128
- verbosebool or callable, optional
Print progress. Can be callable for custom logging. Default: False
- Returns:
- obj_filteredastropy.table.Table
Filtered catalog with real sources only (if flag_bogus=True) or full catalog with ‘rb_score’ column (if flag_bogus=False)
Examples
>>> from stdpipe import photometry, realbogus >>> obj = photometry.get_objects_sep(image, thresh=3.0) >>> obj_clean = realbogus.classify_realbogus(obj, image) >>> print(f"Kept {len(obj_clean)}/{len(obj)} objects")
Combining Both Approaches¶
For optimal performance, you can use both approaches in a pipeline:
from stdpipe import photometry, realbogus_features as rbf
# Detect objects
obj = photometry.get_objects_sep(image, thresh=3.0)
# Quick pre-filter with feature-based classifier
obj = rbf.classify(obj, image, classifier='scoring', threshold=0.3)
candidates = obj[obj['rb_score'] >= 0.3]
print(f"Pre-filtered to {len(candidates)} candidates")
# Detailed classification with CNN (if available)
try:
from stdpipe import realbogus
candidates = realbogus.classify_realbogus(
candidates, image,
threshold=0.5,
add_score=True
)
# Override rb_score with CNN score
except ImportError:
print("TensorFlow not available, using feature-based scores")
# Final filtering
real = candidates[candidates['rb_score'] >= 0.5]
This two-stage approach:
Quickly filters obvious artifacts with feature-based classifier
Applies more computationally expensive CNN only to candidates
Reduces overall computation time while maintaining high accuracy
Output Flags¶
Both classification methods use the same flag convention:
# Flag 0x800 marks objects classified as bogus
bogus_mask = (obj['flags'] & 0x800) != 0
real_mask = (obj['flags'] & 0x800) == 0
# Clear bogus flags if needed
obj['flags'] &= ~0x800
Both methods also add a rb_score column with values in [0, 1], where:
Values near 1.0 indicate high confidence the object is real
Values near 0.0 indicate high confidence the object is bogus
Typical threshold is 0.5, but can be adjusted based on your requirements
Performance Considerations¶
Feature-based:
Speed: 1000-5000 objects/second (hybrid mode with scoring)
Memory: ~1 KB per object for cutouts, ~100 bytes for features
Optimization: Use catalog-only mode when image features aren’t needed
CNN-based:
Speed: 100-500 objects/second (CPU), 1000-5000 objects/second (GPU)
Memory: Larger cutouts (~4 KB per object), model weights (~10-100 MB)
Optimization: Batch process objects, use GPU if available
Troubleshooting¶
All objects classified as bogus:
Check that FWHM is correctly estimated
Try lowering the threshold (e.g., 0.3 instead of 0.5)
Verify image units are linear (not log scale)
For feature-based: examine feature distributions to tune rules
Cosmic rays not detected:
For feature-based: increase
sharpnessweight in scoring rulesEnsure cutout radius is large enough to capture surrounding background
Check that background subtraction is working correctly
Bright stars classified as bogus:
Saturated stars have unusual profiles and may be misclassified
Add saturation mask before classification
Use
flagscolumn to exclude saturated objects before classification
IsolationForest marks rare objects as outliers:
This is expected (outlier = unusual)
Use supervised RandomForest if you have labeled data
Or use scoring classifier which doesn’t learn from data
References¶
Feature-based classification:
Bloom, J. S., et al. (2012). “Automating Discovery and Classification of Transients and Variable Stars in the Synoptic Survey Era.” PASP 124:1175
Wright, D. E., et al. (2015). “A Machine Learning Approach for Dynamical Mass Measurements of Galaxy Clusters.” ApJ 809:159
Liu, F. T., et al. (2008). “Isolation Forest.” ICDM 2008
CNN-based classification:
Duev, D. A., et al. (2019). “Real-bogus classification for the Zwicky Transient Facility using deep learning.” MNRAS 489:3582
Cabrera-Vives, G., et al. (2017). “Deep-HiTS: Rotation Invariant Convolutional Neural Network for Transient Detection.” ApJ 836:97
Example Workflows¶
Workflow 1: Quick Filtering for Survey Data
from stdpipe import photometry, realbogus_features as rbf
# Detect objects
obj = photometry.get_objects_sep(image, thresh=3.0, aper=5.0)
print(f"Detected {len(obj)} objects")
# Quick filtering with scoring classifier
obj = rbf.classify(obj, image, classifier='scoring',
method='hybrid', threshold=0.5,
add_score=True, flag_bogus=True)
# Filter to real sources
real = obj[(obj['rb_score'] >= 0.5) & ((obj['flags'] & 0x800) == 0)]
print(f"Kept {len(real)} real sources")
Workflow 2: High-Precision Classification
from stdpipe import photometry, realbogus_features as rbf
# Detect objects
obj = photometry.get_objects_sep(image, thresh=2.5)
# Train IsolationForest on this specific image
features, names = rbf.extract_features(obj, image, method='hybrid')
clf = rbf.IsolationForestClassifier(contamination=0.1)
clf.fit(features)
# Classify with trend removal
obj = rbf.classify(obj, image, classifier='isolation',
model=clf, remove_trend=True,
trend_cols=['x', 'y'], add_score=True)
# Conservative threshold
real = obj[obj['rb_score'] >= 0.7]
Workflow 3: Building a Labeled Training Set
from stdpipe import photometry, realbogus_features as rbf
import numpy as np
# Use scoring classifier to get initial classifications
obj = rbf.classify(obj, image, classifier='scoring',
method='hybrid', add_score=True)
# High-confidence real sources
certain_real = obj[obj['rb_score'] > 0.9]
# High-confidence bogus sources
certain_bogus = obj[obj['rb_score'] < 0.1]
# Uncertain objects for manual review
uncertain = obj[(obj['rb_score'] >= 0.1) & (obj['rb_score'] <= 0.9)]
print(f"Need manual review for {len(uncertain)} objects")
# Build training labels
labels = np.concatenate([
np.ones(len(certain_real)),
np.zeros(len(certain_bogus)),
manual_labels # From visual inspection
])
# Train RandomForest
all_obj = Table(rows=list(certain_real) + list(certain_bogus) + list(uncertain))
features, _ = rbf.extract_features(all_obj, image, method='hybrid')
clf, metrics = rbf.train_classifier(features, labels,
classifier='randomforest')
print(f"Trained model accuracy: {metrics['test_accuracy']:.1%}")
Workflow 4: Cross-Validation on Multiple Images
from stdpipe import realbogus_features as rbf
from sklearn.model_selection import cross_val_score
import numpy as np
# Collect features from multiple images
all_features = []
all_labels = []
for image, truth_catalog in zip(images, truth_catalogs):
obj = photometry.get_objects_sep(image, thresh=3.0)
features, _ = rbf.extract_features(obj, image, method='hybrid')
# Match to truth catalog for labels
labels = match_to_truth(obj, truth_catalog)
all_features.append(features)
all_labels.extend(labels)
# Combine features
combined_features = {
key: np.concatenate([f[key] for f in all_features])
for key in all_features[0].keys()
}
# Cross-validation
clf = rbf.RandomForestClassifier(n_estimators=100)
scores = cross_val_score(clf, combined_features, all_labels, cv=5)
print(f"Cross-validation accuracy: {scores.mean():.3f} ± {scores.std():.3f}")