The analysis module

Functions for the analysis of extracted feature matrix.

hcga.analysis._compute_fold(X, y, model, indices, analysis_type, compute_shap)[source]

Compute a single fold for parallel computation.

hcga.analysis._evaluate_kfold(X, y, model, folds, analysis_type, compute_shap)[source]

Evaluate the kfolds.

hcga.analysis._filter_features(features)[source]

Filter features and create feature matrix.

hcga.analysis._filter_graphs(features, graph_removal=0.05)[source]

Remove samples with more than X% bad values.

hcga.analysis._filter_interpretable(features, features_info, interpretability)[source]

Get only features with certain interpretability.

hcga.analysis._get_model(model, analysis_type)[source]

Get a model.

hcga.analysis._get_reduced_feature_set(X, shap_top_features, n_top_features=100, alpha=0.9)[source]

Reduce the feature set by taking uncorrelated features.

hcga.analysis._get_shap_feature_importance(shap_values)[source]

From a list of shap values per folds, compute the global shap feature importance.

hcga.analysis._normalise_feature_data(features, scaler=None, fit_scaler=True)[source]

Normalise the feature matrix to remove the mean and scale to unit variance.

hcga.analysis._number_folds(y)[source]

Get number of folds.

hcga.analysis._preprocess_features(features, features_info, graph_removal, interpretability, trained_model=None)[source]

Collect all feature filters.

hcga.analysis._print_accuracy(acc_scores, analysis_type, reduced=False)[source]

Print the classification or regression accuracies.

hcga.analysis._save_predictions_to_csv(features, predictions, folder='results')[source]

Save the prediction results for unlabelled data.

hcga.analysis._save_to_csv(features_info_df, analysis_results, folder='results')[source]

Save csv file with analysis data.

hcga.analysis.analysis(features, features_info, graphs=None, analysis_type='classification', folder='.', graph_removal=0.3, interpretability=1, model='XG', compute_shap=True, kfold=True, reduce_set=True, reduced_set_size=100, reduced_set_max_correlation=0.9, plot=True, max_feats_plot=20, max_feats_plot_dendrogram=100, n_repeats=1, n_splits=None, random_state=42, test_size=0.2, trained_model=None, save_model=False)[source]

Main function to classify graphs and plot results.

Parameters
  • features (dataframe) – extracted features

  • features_info (dataframe) – features information

  • graphs (GraphCollection) – input graphs

  • analysis_type (str) – ‘classification’ or ‘regression’

  • folder (str) – folder to save analysis

  • graph_removal (float) – remove samples with more than graph_removal % bad values

  • interpretabiliy (int) – filter out features below this interpretability

  • model (str) – model to preform analysis

  • compute_shap (bool) – compute SHAP values or not

  • kfold (bool) – run with kfold

  • reduce_set (bool) – is True, the classification will be rerun on a reduced set of top features (from shapely analysis)

  • reduce_set_size (int) – number of features to keep for reduces set

  • reduced_set_max_correlation (float) – to discared highly correlated top features in reduced set of features

  • plot (bool) – save plots

  • max_feats_plot (int) – max number of feature analysis to plot

  • n_repeats (int) – number of k-fold repeats

  • n_splits (int) – numbere of split for k-fold, None=automatic estimation

  • random_state (int) – rng seed

  • test_size (float) – size of test dataset (see sklearn.model_selection.ShuffleSplit)

  • trained_model (str) – provide path to pretrained model to apply to new data

  • save_modeel (bool) – save the obtained model to reuse later

Returns

dictionary with results

Return type

(dict)

hcga.analysis.classify_pairwise(features, features_info, model='XG', graph_removal=0.3, interpretability=1, n_top_features=5, reduce_set=False, reduced_set_size=100, reduced_set_max_correlation=0.5, n_repeats=1, n_splits=None, analysis_type='classification')[source]

Classify all possible pairs of clases with kfold and returns top features.

The top features for each pair with high enough accuracies are collected in a list, for later analysis.

Parameters
  • features (dataframe) – extracted features

  • features_info (dataframe) – features information

  • model (str) – model to preform analysis

  • graph_removal (float) – remove samples with more than graph_removal % bad values

  • n_top_features (int) – number of top features to save

  • reduce_set (bool) – is True, the classification will be rerun on a reduced set of top features (from shapely analysis)

  • reduce_set_size (int) – number of features to keep for reduces set

  • reduced_set_max_correlation (float) – to discared highly correlated top features in reduced set of features

  • n_repeats (int) – number of k-fold repeats

  • n_splits (int) – numbere of split for k-fold, None=automatic estimation

  • analysis_type (str) – ‘classification’ or ‘regression’

Returns

accuracies dataframe, list of top features, number of top pairs

Return type

(dataframe, list, int)

hcga.analysis.features_to_Xy(features)[source]

Decompose features dataframe to numpy arrays X and y.

hcga.analysis.fit_model(features, model, analysis_type='classification', reduce_set=True, reduced_set_size=100, reduced_set_max_correlation=0.9, test_size=0.2, random_state=42, compute_shap=True)[source]

Train a single model.

Parameters
  • features (dataframe) – extracted features

  • model (str) – model to preform analysis

  • analysis_type (str) – ‘classification’ or ‘regression’

  • reduce_set (bool) – is True, the classification will be rerun on a reduced set of top features (from shapely analysis)

  • reduce_set_size (int) – number of features to keep for reduces set

  • reduced_set_max_correlation (float) – to discared highly correlated top features in reduced set of features

  • random_state (int) – rng seed

  • test_size (float) – size of test dataset (see sklearn.model_selection.ShuffleSplit)

  • compute_shap (bool) – compute SHAP values or not

Returns

dictionary with results

Return type

(dict)

hcga.analysis.fit_model_kfold(features, model, analysis_type='classification', reduce_set=True, reduced_set_size=100, reduced_set_max_correlation=0.9, n_repeats=1, random_state=42, n_splits=None, compute_shap=True)[source]

Classify graphs from extracted features with kfold.

Parameters
  • features (dataframe) – extracted features

  • model (str) – model to preform analysis

  • analysis_type (str) – ‘classification’ or ‘regression’

  • reduce_set (bool) – is True, the classification will be rerun on a reduced set of top features (from shapely analysis)

  • reduce_set_size (int) – number of features to keep for reduces set

  • reduced_set_max_correlation (float) – to discared highly correlated top features in reduced set of features

  • n_repeats (int) – number of k-fold repeats

  • random_state (int) – rng seed

  • n_splits (int) – numbere of split for k-fold, None=automatic estimation

  • compute_shap (bool) – compute SHAP values or not

Returns

dictionary with results

Return type

(dict)

hcga.analysis.predict_unlabelled(model, features)[source]

Predict unlabelled data.

hcga.analysis.train_all(features, model)[source]

Train on all available data.