Configuration#
Configuration for TopoStats is done using a YAML configuration file that is specified on the
command line when invoking. The current configuration file is provided in the TopoStats repository at
topostats/default_config.yaml
but
please be aware this may not work with your installed version, particularly if you installed from PyPI.
Generating a configuration#
You can always generate a configuration file appropriate for the version you have installed (bar v2.0.0 as this option
was added afterwards). This writes the default configuration to the specified filename (i.e. it does not have to be
called config.yaml
it could be called spm-2023-02-20.yaml
)
run_topostats --create-config-file config.yaml
If no configuration file is provided this default configuration is loaded automatically and used.
Using a custom configuration#
If you have generated a configuration file you can modify and edit a configuration it to change the parameters (see fields below). Once these changes have been saved, you can run TopoStats with this configuration file as shown below.
run_topostats --config my_config.yaml
On completion a copy of the configuration that was used is written to the output directory so you have a record of the parameters used to generate the results you have. This file can be used in subsequent runs of TopoStats.
YAML Structure#
YAML files have key and value pairs, the first word, e.g. base_dir
is the key this is followed by a colon to separate
it from the value that it takes, by default base_dir
takes the value ./
(which means the current directory) and so
the entry in the file is a single line with base_dir: ./
. Other data structures are available in YAML files including
nested values and lists.
A list in YAML consists of a key (e.g. above:
) followed by the values in square brackets separated by commas such as
above: [ 500, 800 ]
. This means the above
key is a list of the values 500
and 800
. Long lists can be split over
separate lines as shown below
above:
- 100
- 200
- 300
- 400
Fields#
Aside from the comments in YAML file itself the fields are described below.
Section | Sub-Section | Data Type | Default | Description |
---|---|---|---|---|
base_dir |
string | ./ |
Directory to recursively search for files within.[^1] | |
output_dir |
string | ./output |
Directory that output should be saved to.[^1] | |
log_level |
string | info |
Verbosity of logging, options are (in increasing order) warning , error , info , debug . |
|
cores |
integer | 2 |
Number of cores to run parallel processes on. | |
file_ext |
string | .spm |
File extensions to search for. | |
loading |
channel |
string | Height |
The channel of data to be processed, what this is will depend on the file-format you are processing and the channel you wish to process. |
filter |
run |
boolean | true |
Whether to run the filtering stage, without this other stages won't run so leave as true . |
threshold_method |
str | std_dev |
Threshold method for filtering, options are ostu , std_dev or absolute . |
|
otsu_threshold_multiplier |
float | 1.0 |
Factor by which the derived Otsu Threshold should be scaled. | |
threshold_std_dev |
dictionary | 10.0, 1.0 |
A pair of values that scale the standard deviation, after scaling the standard deviation below is subtracted from the image mean to give the below/lower threshold and the above is added to the image mean to give the above/upper threshold. These values should always be positive. |
|
threshold_absolute |
dictionary | -1.0, 1.0 |
Below (first) and above (second) absolute threshold for separating data from the image background. | |
gaussian_size |
float | 0.5 |
The number of standard deviations to build the Gaussian kernel and thus affects the degree of blurring. See skimage.filters.gaussian and sigma for more information. |
|
gaussian_mode |
string | nearest |
||
grains |
run |
boolean | true |
Whether to run grain finding. Options true , false |
row_alignment_quantile |
float | 0.5 |
Quantile (0.0 to 1.0) to be used to determine the average background for the image. below values may improve flattening of large features. | |
smallest_grain_size_nm2 |
int | 100 |
The smallest size of grains to be included (in nm^2), anything smaller than this is considered noise and removed. NB must be > 0.0 . |
|
threshold_method |
float | std_dev |
Threshold method for grain finding. Options : otsu , std_dev , absolute |
|
otsu_threshold_multiplier |
1.0 |
Factor by which the derived Otsu Threshold should be scaled. | ||
threshold_std_dev |
dictionary | 10.0, 1.0 |
A pair of values that scale the standard deviation, after scaling the standard deviation below is subtracted from the image mean to give the below/lower threshold and the above is added to the image mean to give the above/upper threshold. These values should always be positive. |
|
threshold_absolute |
dictionary | -1.0, 1.0 |
Below (first), above (second) absolute threshold for separating grains from the image background. | |
direction |
above |
Defines whether to look for grains above or below thresholds or both. Options: above , below , both |
||
smallest_grain_size |
int | 50 |
Catch-all value for the minimum size of grains. Measured in nanometres squared. All grains with area below than this value are removed. | |
absolute_area_threshold |
dictionary | [300, 3000], [null, null] |
Area thresholds for above the image backround (first) and below the image background (second), which grain sizes are permitted, measured in nanometres squared. All grains outside this area range are removed. | |
remove_edge_intersecting_grains |
boolean | true |
Whether to remove grains that intersect the image border. Do not change this unless you know what you are doing. This will ruin any statistics relating to grain size, shape and DNA traces. | |
grainstats |
run |
boolean | true |
Whether to calculate grain statistics. Options : true , false |
cropped_size |
float | 40.0 |
Force cropping of grains to this length (in nm) of square cropped images (can take -1 for grain-sized box) |
|
edge_detection_method |
str | binary_erosion |
Type of edge detection method to use when determining the edges of grain masks before calculating statistics on them. Options : binary_erosion , canny . |
|
dnatracing |
run |
boolean | true |
Whether to run DNA Tracing. Options : true, false |
min_skeleton_size |
int | 10 |
The minimum number of pixels a skeleton should be for statistics to be calculated on it. Anything smaller than this is dropped but grain statistics are retained. | |
skeletonisation_method |
str | topostats |
Skeletonisation method to use, possible options are zhang , lee , thin (from Scikit-image Morphology module) or the original bespoke TopoStas method topostats . |
|
pad_width |
int | 10 | Padding for individual grains when tracing. This is sometimes required if the bounding box around grains is too tight and they touch the edge of the image. | |
cores |
int | 1 | Number of cores to use for tracing. NB Currently this is NOT used and should be left commented in the YAML file. | |
plotting |
run |
boolean | true |
Whether to run plotting. Options : true , false |
save_format |
string | png |
Format to save images in, see matplotlib.pyplot.savefig | |
pixel_interpolation |
string | null | Interpolation method for image plots. Recommended default 'null' prevents banding that occurs in some images. If interpolation is needed, we recommend gaussian . See matplotlib imshow interpolations documentation for details. |
|
image_set |
string | all |
Which images to plot. Options : all , core |
|
zrange |
list | [0, 3] |
Low (first number) and high (second number) height range for core images (can take [null, null]). NB low <= high otherwise you will see a ValueError: minvalue must be less than or equal to maxvalue error. |
|
colorbar |
boolean | true |
Whether to include the colorbar scale in plots. Options true , false |
|
axes |
boolean | true |
Wether to include the axes in the produced plots. | |
cmap |
string | nanoscope |
Colormap to use in plotting. Options : nanoscope , afmhot |
|
histogram_log_axis |
boolean | false |
Whether to plot hisograms using a logarithmic scale or not. Options: true , false . |
|
histogram_bins |
int | 200 | Number of bins to use for histograms | |
dpi |
float | 100.0 | Dots Per Inch to plot scans with, higher values give greater resolution but will increase processing time and file size. | |
summary_stats |
run |
boolean | true |
Whether to generate summary statistical plots of the distribution of different metrics grouped by the image that has been processed. |
config |
str | null |
Path to a summary config YAML file that configures/controls how plotting is done. If one is not specified either the command line argument --summary_config value will be used or if that option is not invoked the default topostats/summary_config.yaml will be used. |
Summary Configuration#
Plots summarising the distribution of metrics are generated by default. The behaviour is controlled by a configuration
file. The default example can be found in topostats/summary_config.yaml
. The fields of this file are described below.
Section | Sub-Section | Data Type | Default | Description |
---|---|---|---|---|
output_dir |
str |
./output/ |
Where output plots should be saved to. | |
csv_file |
str |
null |
Where the results file should be loaded when running toposum |
|
file_ext |
str |
png |
File type to save images as. | |
pickle_plots |
bool |
True | Whether to save images to a Python pickle. | |
var_to_label |
str |
null |
Optional YAML file that maps variable names to labels, uses topostats/var_to_label.yaml if null. |
|
molecule_id |
str |
molecule_number |
Variable containing the molecule number. | |
image_id |
str |
image |
Variable containing the image identifier. | |
hist |
bool |
True |
Whether to plot a histogram of statistics. | |
bins |
int |
20 |
Number of bins to plot in histogram. | |
stat |
str |
count |
What metric to plot on histogram valid values are count (default), frequency , probability , percent , density |
|
kde |
bool |
True |
Whether to include a Kernel Density Estimate on histograms. NB if both hist and kde are true they are overlaid. |
|
violin |
bool |
True |
Whether to generate Violin Plots. | |
figsize |
list |
[16, 9] |
||
alpha |
float |
0.5 |
||
palette |
str |
bright |
Seaborn color palette. Options colorblind , deep , muted , pastel , bright , dark , Spectral , Set2 |
|
stats_to_sum |
list |
str |
A list of strings of variables to plot, comment (placing a # at the start of the line) and uncomment as required. Possible values are area , area_cartesian_bbox , aspect_ratio , banding_angle , contour_length , end_to_end_distance , height_max , height_mean , height_median , height_min , radius_max , radius_mean , radius_median , radius_min , smallest_bounding_area , smallest_bounding_length , smallest_bounding_width , volume |
Validation#
Configuration files are validated against a schema to check that the values in the configuration file are within the expected ranges or valid parameters. This helps capture problems early and should provide informative messages as to what needs correcting if there are errors.
[^1] When writing file paths you can use absolute or relative paths. On Windows systems absolute paths start with the
drive letter (e.g. c:/
) on Linux and OSX systems they start with /
. Relative paths are started either with a ./
which denotes the current directory or one or more ../
which means the higher level directory from the current
directory. You can always find the current directory you are in using the pwd
(p
rint w
orking d
irectory). If
your work is in /home/user/path/to/my/data
and pwd
prints /home/user
then the relative path to your data is
./path/to/my/data
. The cd
command is used to c
hange d
irectory.
pwd
/home/user/
# Two ways of changing directory using a relative path
cd ./path/to/my/data
pwd
/home/user/path/to/my/data
# Using an absolute path
cd /home/user/path/to/my/data
pwd
/home/user/path/to/my/data