The cli app¶
Hcga app module.
Users can interact with hcga directly through the command line using the purpose built command line interface app.
For those users that wish to interact with hcga via Python directly (e.g. through a notebook) then please use the hcga class.
Below is a short example of the commands necessary to run the ENZYMES dataset directly from the command line:
hcga get_data ENZYMES
hcga extract_features datasets/ENZYMES.pkl -m fast -n -1 --timeout 10
hcga feature_analysis ENZYMES`
Alternatively these commands can be bundled together into a single bash file, see ‘run_example.sh’ in the examples folder.
cli¶
Cli.
cli [OPTIONS] COMMAND [ARGS]...
Options
-
-v
,
--verbose
¶
extract_features¶
Extract features from dataset of graphs and save the feature matrix, info and labels.
cli extract_features [OPTIONS] DATASET
Options
-
-rf
,
--results-folder
<results_folder>
¶ Location of results
- Default
results
-
-n
,
--n-workers
<n_workers>
¶ Number of workers for multiprocessing
- Default
1
-
-m
,
--mode
<mode>
¶ Mode of features to extract (fast, medium, slow)
- Default
fast
-
--norm
,
--no-norm
¶
Normalised features by number of edges/nodes (by default not)
- Default
True
-
--node-feat
,
--no-node-feat
¶
Use node features if any.
- Default
True
-
-sl
,
--stats-level
<stats_level>
¶ Level of statistical features (basic, medium, advanced)
- Default
advanced
-
--timeout
<timeout>
¶ Timeout for feature evaluations.
- Default
10.0
-
-of
,
--output-file
<output_file>
¶ Location of results, by default same as initial dataset
-
--runtimes
,
--no-runtimes
¶
Output runtimes
- Default
False
-
--connected
,
--no-connected
¶
Remove disconnected components
- Default
False
Arguments
-
DATASET
¶
Required argument
feature_analysis¶
Analysis of the features extracted in feature_file.
cli feature_analysis [OPTIONS] DATASET
Options
-
-rf
,
--results-folder
<results_folder>
¶ Location of results
- Default
./results
-
-ff
,
--feature-file
<feature_file>
¶ Location of features
- Default
all_features.pkl
-
--analysis-type
<analysis_type>
¶ classification/regression/unsupervised.
- Default
classification
-
--graph-removal
<graph_removal>
¶ Fraction of failed features to remove a graph from dataset.
- Default
0.3
-
-i
,
--interpretability
<interpretability>
¶ Interpretability of feature to consider
- Default
1
-
-m
,
--model
<model>
¶ model for feature analysis (RF, LGBM, XG)
- Default
XG
-
--kfold
,
--no-kfold
¶
use K-fold
- Default
True
-
--reduce-set
,
--no-reduce-set
¶
True or False whether to recompute accuarcies with a reduced set of top features.
- Default
True
-
--reduced-set-size
<reduced_set_size>
¶ Number of uncorrelated top features to consider in top reduced feature classificaion.
- Default
100
-
--reduced-set-max-correlation
<reduced_set_max_correlation>
¶ Maximum correlation to allow for selection of top features for reduced feature classification.
- Default
0.9
-
-p
,
--plot
,
-np
,
--no-plot
¶
Optionnaly plot analysis results
- Default
True
-
--max-feats-plot
<max_feats_plot>
¶ Number of top features to plot with violins.
- Default
20
-
--n-splits
<n_splits>
¶ Number of splits for k-fold, None will use an automatic estimation.
-
--n-repeats
<n_repeats>
¶ Number of repeats of k-folds for better averaged accuracies.
- Default
1
Arguments
-
DATASET
¶
Required argument
get_data¶
Generate the benchmark or test data.
- Dataset_name can be either:
TESTDATA: to generate synthetic dataset for testing
DD, ENZYMES, REDDIT-MULTI-12K, PROTEINS, MUTAG,
or any other dataset hosted on https://ls11-www.cs.tu-dortmund.de/people/morris/graphkerneldatasets
cli get_data [OPTIONS] DATASET_NAME
Options
-
-f
,
--folder
<folder>
¶ Location to save dataset
- Default
./datasets
Arguments
-
DATASET_NAME
¶
Required argument