The analysis module¶
Functions for the analysis of extracted feature matrix.
-
hcga.analysis.
_compute_fold
(X, y, model, indices, analysis_type, compute_shap)[source]¶ Compute a single fold for parallel computation.
-
hcga.analysis.
_evaluate_kfold
(X, y, model, folds, analysis_type, compute_shap)[source]¶ Evaluate the kfolds.
-
hcga.analysis.
_filter_graphs
(features, graph_removal=0.05)[source]¶ Remove samples with more than X% bad values.
-
hcga.analysis.
_filter_interpretable
(features, features_info, interpretability)[source]¶ Get only features with certain interpretability.
-
hcga.analysis.
_get_reduced_feature_set
(X, shap_top_features, n_top_features=100, alpha=0.9)[source]¶ Reduce the feature set by taking uncorrelated features.
-
hcga.analysis.
_get_shap_feature_importance
(shap_values)[source]¶ From a list of shap values per folds, compute the global shap feature importance.
-
hcga.analysis.
_normalise_feature_data
(features, scaler=None, fit_scaler=True)[source]¶ Normalise the feature matrix to remove the mean and scale to unit variance.
-
hcga.analysis.
_preprocess_features
(features, features_info, graph_removal, interpretability, trained_model=None)[source]¶ Collect all feature filters.
-
hcga.analysis.
_print_accuracy
(acc_scores, analysis_type, reduced=False)[source]¶ Print the classification or regression accuracies.
-
hcga.analysis.
_save_predictions_to_csv
(features, predictions, folder='results')[source]¶ Save the prediction results for unlabelled data.
-
hcga.analysis.
_save_to_csv
(features_info_df, analysis_results, folder='results')[source]¶ Save csv file with analysis data.
-
hcga.analysis.
analysis
(features, features_info, graphs=None, analysis_type='classification', folder='.', graph_removal=0.3, interpretability=1, model='XG', compute_shap=True, kfold=True, reduce_set=True, reduced_set_size=100, reduced_set_max_correlation=0.9, plot=True, max_feats_plot=20, max_feats_plot_dendrogram=100, n_repeats=1, n_splits=None, random_state=42, test_size=0.2, trained_model=None, save_model=False)[source]¶ Main function to classify graphs and plot results.
- Parameters
features (dataframe) – extracted features
features_info (dataframe) – features information
graphs (GraphCollection) – input graphs
analysis_type (str) – ‘classification’ or ‘regression’
folder (str) – folder to save analysis
graph_removal (float) – remove samples with more than graph_removal % bad values
interpretabiliy (int) – filter out features below this interpretability
model (str) – model to preform analysis
compute_shap (bool) – compute SHAP values or not
kfold (bool) – run with kfold
reduce_set (bool) – is True, the classification will be rerun on a reduced set of top features (from shapely analysis)
reduce_set_size (int) – number of features to keep for reduces set
reduced_set_max_correlation (float) – to discared highly correlated top features in reduced set of features
plot (bool) – save plots
max_feats_plot (int) – max number of feature analysis to plot
n_repeats (int) – number of k-fold repeats
n_splits (int) – numbere of split for k-fold, None=automatic estimation
random_state (int) – rng seed
test_size (float) – size of test dataset (see sklearn.model_selection.ShuffleSplit)
trained_model (str) – provide path to pretrained model to apply to new data
save_modeel (bool) – save the obtained model to reuse later
- Returns
dictionary with results
- Return type
(dict)
-
hcga.analysis.
classify_pairwise
(features, features_info, model='XG', graph_removal=0.3, interpretability=1, n_top_features=5, reduce_set=False, reduced_set_size=100, reduced_set_max_correlation=0.5, n_repeats=1, n_splits=None, analysis_type='classification')[source]¶ Classify all possible pairs of clases with kfold and returns top features.
The top features for each pair with high enough accuracies are collected in a list, for later analysis.
- Parameters
features (dataframe) – extracted features
features_info (dataframe) – features information
model (str) – model to preform analysis
graph_removal (float) – remove samples with more than graph_removal % bad values
n_top_features (int) – number of top features to save
reduce_set (bool) – is True, the classification will be rerun on a reduced set of top features (from shapely analysis)
reduce_set_size (int) – number of features to keep for reduces set
reduced_set_max_correlation (float) – to discared highly correlated top features in reduced set of features
n_repeats (int) – number of k-fold repeats
n_splits (int) – numbere of split for k-fold, None=automatic estimation
analysis_type (str) – ‘classification’ or ‘regression’
- Returns
accuracies dataframe, list of top features, number of top pairs
- Return type
(dataframe, list, int)
-
hcga.analysis.
features_to_Xy
(features)[source]¶ Decompose features dataframe to numpy arrays X and y.
-
hcga.analysis.
fit_model
(features, model, analysis_type='classification', reduce_set=True, reduced_set_size=100, reduced_set_max_correlation=0.9, test_size=0.2, random_state=42, compute_shap=True)[source]¶ Train a single model.
- Parameters
features (dataframe) – extracted features
model (str) – model to preform analysis
analysis_type (str) – ‘classification’ or ‘regression’
reduce_set (bool) – is True, the classification will be rerun on a reduced set of top features (from shapely analysis)
reduce_set_size (int) – number of features to keep for reduces set
reduced_set_max_correlation (float) – to discared highly correlated top features in reduced set of features
random_state (int) – rng seed
test_size (float) – size of test dataset (see sklearn.model_selection.ShuffleSplit)
compute_shap (bool) – compute SHAP values or not
- Returns
dictionary with results
- Return type
(dict)
-
hcga.analysis.
fit_model_kfold
(features, model, analysis_type='classification', reduce_set=True, reduced_set_size=100, reduced_set_max_correlation=0.9, n_repeats=1, random_state=42, n_splits=None, compute_shap=True)[source]¶ Classify graphs from extracted features with kfold.
- Parameters
features (dataframe) – extracted features
model (str) – model to preform analysis
analysis_type (str) – ‘classification’ or ‘regression’
reduce_set (bool) – is True, the classification will be rerun on a reduced set of top features (from shapely analysis)
reduce_set_size (int) – number of features to keep for reduces set
reduced_set_max_correlation (float) – to discared highly correlated top features in reduced set of features
n_repeats (int) – number of k-fold repeats
random_state (int) – rng seed
n_splits (int) – numbere of split for k-fold, None=automatic estimation
compute_shap (bool) – compute SHAP values or not
- Returns
dictionary with results
- Return type
(dict)