The cli app

Hcga app module.

Users can interact with hcga directly through the command line using the purpose built command line interface app.

For those users that wish to interact with hcga via Python directly (e.g. through a notebook) then please use the hcga class.

Below is a short example of the commands necessary to run the ENZYMES dataset directly from the command line:

hcga get_data ENZYMES

hcga extract_features datasets/ENZYMES.pkl -m fast -n -1 --timeout 10

hcga feature_analysis ENZYMES`

Alternatively these commands can be bundled together into a single bash file, see ‘run_example.sh’ in the examples folder.

cli

Cli.

cli [OPTIONS] COMMAND [ARGS]...

Options

-v, --verbose

extract_features

Extract features from dataset of graphs and save the feature matrix, info and labels.

cli extract_features [OPTIONS] DATASET

Options

-rf, --results-folder <results_folder>

Location of results

Default

results

-n, --n-workers <n_workers>

Number of workers for multiprocessing

Default

1

-m, --mode <mode>

Mode of features to extract (fast, medium, slow)

Default

fast

--norm, --no-norm

Normalised features by number of edges/nodes (by default not)

Default

True

--node-feat, --no-node-feat

Use node features if any.

Default

True

-sl, --stats-level <stats_level>

Level of statistical features (basic, medium, advanced)

Default

advanced

--timeout <timeout>

Timeout for feature evaluations.

Default

10.0

-of, --output-file <output_file>

Location of results, by default same as initial dataset

--runtimes, --no-runtimes

Output runtimes

Default

False

--connected, --no-connected

Remove disconnected components

Default

False

Arguments

DATASET

Required argument

feature_analysis

Analysis of the features extracted in feature_file.

cli feature_analysis [OPTIONS] DATASET

Options

-rf, --results-folder <results_folder>

Location of results

Default

./results

-ff, --feature-file <feature_file>

Location of features

Default

all_features.pkl

--analysis-type <analysis_type>

classification/regression/unsupervised.

Default

classification

--graph-removal <graph_removal>

Fraction of failed features to remove a graph from dataset.

Default

0.3

-i, --interpretability <interpretability>

Interpretability of feature to consider

Default

1

-m, --model <model>

model for feature analysis (RF, LGBM, XG)

Default

XG

--kfold, --no-kfold

use K-fold

Default

True

--reduce-set, --no-reduce-set

True or False whether to recompute accuarcies with a reduced set of top features.

Default

True

--reduced-set-size <reduced_set_size>

Number of uncorrelated top features to consider in top reduced feature classificaion.

Default

100

--reduced-set-max-correlation <reduced_set_max_correlation>

Maximum correlation to allow for selection of top features for reduced feature classification.

Default

0.9

-p, --plot, -np, --no-plot

Optionnaly plot analysis results

Default

True

--max-feats-plot <max_feats_plot>

Number of top features to plot with violins.

Default

20

--n-splits <n_splits>

Number of splits for k-fold, None will use an automatic estimation.

--n-repeats <n_repeats>

Number of repeats of k-folds for better averaged accuracies.

Default

1

Arguments

DATASET

Required argument

get_data

Generate the benchmark or test data.

Dataset_name can be either:
  • TESTDATA: to generate synthetic dataset for testing

  • DD, ENZYMES, REDDIT-MULTI-12K, PROTEINS, MUTAG,

or any other dataset hosted on https://ls11-www.cs.tu-dortmund.de/people/morris/graphkerneldatasets

cli get_data [OPTIONS] DATASET_NAME

Options

-f, --folder <folder>

Location to save dataset

Default

./datasets

Arguments

DATASET_NAME

Required argument