The analysis module¶

Functions for the analysis of extracted feature matrix.

hcga.analysis._compute_fold(X, y, model, indices, analysis_type, compute_shap)[source]¶: Compute a single fold for parallel computation.

hcga.analysis._evaluate_kfold(X, y, model, folds, analysis_type, compute_shap)[source]¶: Evaluate the kfolds.

hcga.analysis._filter_features(features)[source]¶: Filter features and create feature matrix.

hcga.analysis._filter_graphs(features, graph_removal=0.05)[source]¶: Remove samples with more than X% bad values.

hcga.analysis._filter_interpretable(features, features_info, interpretability)[source]¶: Get only features with certain interpretability.

hcga.analysis._get_model(model, analysis_type)[source]¶: Get a model.

hcga.analysis._get_reduced_feature_set(X, shap_top_features, n_top_features=100, alpha=0.9)[source]¶: Reduce the feature set by taking uncorrelated features.

hcga.analysis._get_shap_feature_importance(shap_values)[source]¶: From a list of shap values per folds, compute the global shap feature importance.

hcga.analysis._normalise_feature_data(features, scaler=None, fit_scaler=True)[source]¶: Normalise the feature matrix to remove the mean and scale to unit variance.

hcga.analysis._number_folds(y)[source]¶: Get number of folds.

hcga.analysis._preprocess_features(features, features_info, graph_removal, interpretability, trained_model=None)[source]¶: Collect all feature filters.

hcga.analysis._print_accuracy(acc_scores, analysis_type, reduced=False)[source]¶: Print the classification or regression accuracies.

hcga.analysis._save_predictions_to_csv(features, predictions, folder='results')[source]¶: Save the prediction results for unlabelled data.

hcga.analysis._save_to_csv(features_info_df, analysis_results, folder='results')[source]¶: Save csv file with analysis data.

hcga.analysis.analysis(features, features_info, graphs=None, analysis_type='classification', folder='.', graph_removal=0.3, interpretability=1, model='XG', compute_shap=True, kfold=True, reduce_set=True, reduced_set_size=100, reduced_set_max_correlation=0.9, plot=True, max_feats_plot=20, max_feats_plot_dendrogram=100, n_repeats=1, n_splits=None, random_state=42, test_size=0.2, trained_model=None, save_model=False)[source]¶

Main function to classify graphs and plot results.

Parameters

features (dataframe) – extracted features
features_info (dataframe) – features information
graphs (GraphCollection) – input graphs
analysis_type (str) – ‘classification’ or ‘regression’
folder (str) – folder to save analysis
graph_removal (float) – remove samples with more than graph_removal % bad values
interpretabiliy (int) – filter out features below this interpretability
model (str) – model to preform analysis
compute_shap (bool) – compute SHAP values or not
kfold (bool) – run with kfold
reduce_set (bool) – is True, the classification will be rerun on a reduced set of top features (from shapely analysis)
reduce_set_size (int) – number of features to keep for reduces set
reduced_set_max_correlation (float) – to discared highly correlated top features in reduced set of features
plot (bool) – save plots
max_feats_plot (int) – max number of feature analysis to plot
n_repeats (int) – number of k-fold repeats
n_splits (int) – numbere of split for k-fold, None=automatic estimation
random_state (int) – rng seed
test_size (float) – size of test dataset (see sklearn.model_selection.ShuffleSplit)
trained_model (str) – provide path to pretrained model to apply to new data
save_modeel (bool) – save the obtained model to reuse later

Returns

dictionary with results

Return type

(dict)

hcga.analysis.classify_pairwise(features, features_info, model='XG', graph_removal=0.3, interpretability=1, n_top_features=5, reduce_set=False, reduced_set_size=100, reduced_set_max_correlation=0.5, n_repeats=1, n_splits=None, analysis_type='classification')[source]¶

Classify all possible pairs of clases with kfold and returns top features.

The top features for each pair with high enough accuracies are collected in a list, for later analysis.

Parameters

features (dataframe) – extracted features
features_info (dataframe) – features information
model (str) – model to preform analysis
graph_removal (float) – remove samples with more than graph_removal % bad values
n_top_features (int) – number of top features to save
reduce_set (bool) – is True, the classification will be rerun on a reduced set of top features (from shapely analysis)
reduce_set_size (int) – number of features to keep for reduces set
reduced_set_max_correlation (float) – to discared highly correlated top features in reduced set of features
n_repeats (int) – number of k-fold repeats
n_splits (int) – numbere of split for k-fold, None=automatic estimation
analysis_type (str) – ‘classification’ or ‘regression’

Returns

accuracies dataframe, list of top features, number of top pairs

Return type

(dataframe, list, int)

hcga.analysis.features_to_Xy(features)[source]¶: Decompose features dataframe to numpy arrays X and y.

hcga.analysis.fit_model(features, model, analysis_type='classification', reduce_set=True, reduced_set_size=100, reduced_set_max_correlation=0.9, test_size=0.2, random_state=42, compute_shap=True)[source]¶

Train a single model.

Parameters

features (dataframe) – extracted features
model (str) – model to preform analysis
analysis_type (str) – ‘classification’ or ‘regression’
reduce_set (bool) – is True, the classification will be rerun on a reduced set of top features (from shapely analysis)
reduce_set_size (int) – number of features to keep for reduces set
reduced_set_max_correlation (float) – to discared highly correlated top features in reduced set of features
random_state (int) – rng seed
test_size (float) – size of test dataset (see sklearn.model_selection.ShuffleSplit)
compute_shap (bool) – compute SHAP values or not

Returns

dictionary with results

Return type

(dict)

hcga.analysis.fit_model_kfold(features, model, analysis_type='classification', reduce_set=True, reduced_set_size=100, reduced_set_max_correlation=0.9, n_repeats=1, random_state=42, n_splits=None, compute_shap=True)[source]¶

Classify graphs from extracted features with kfold.

Parameters

features (dataframe) – extracted features
model (str) – model to preform analysis
analysis_type (str) – ‘classification’ or ‘regression’
reduce_set (bool) – is True, the classification will be rerun on a reduced set of top features (from shapely analysis)
reduce_set_size (int) – number of features to keep for reduces set
reduced_set_max_correlation (float) – to discared highly correlated top features in reduced set of features
n_repeats (int) – number of k-fold repeats
random_state (int) – rng seed
n_splits (int) – numbere of split for k-fold, None=automatic estimation
compute_shap (bool) – compute SHAP values or not

Returns

dictionary with results

Return type

(dict)

hcga.analysis.predict_unlabelled(model, features)[source]¶: Predict unlabelled data.

hcga.analysis.train_all(features, model)[source]¶: Train on all available data.

The analysis module¶

hcga

Navigation

Related Topics