bob.med.tb.utils.measure

Functions

base_measures(tp, fp, tn, fn)

Calculates measures from true/false positive and negative counts

bayesian_measures(tp, fp, tn, fn, lambda_, ...)

Calculates mean and mode from true/false positive and negative counts with credible regions

beta_credible_region(k, l, lambda_, coverage)

Returns the mode, upper and lower bounds of the equal-tailed credible region of a probability estimate following Bernoulli trials.

get_centered_maxf1(f1_scores, thresholds)

Return the centered max F1 score threshold when multiple threshold give the same max F1 score

tricky_division(n, d)

Divides n by d.

Classes

SmoothedValue([window_size])

Track a series of values and provide access to smoothed values over a window or the global series average.

class bob.med.tb.utils.measure.SmoothedValue(window_size=20)[source]

Bases: object

Track a series of values and provide access to smoothed values over a window or the global series average.

update(value)[source]
property median
property avg
bob.med.tb.utils.measure.tricky_division(n, d)[source]

Divides n by d. Returns 0.0 in case of a division by zero

bob.med.tb.utils.measure.base_measures(tp, fp, tn, fn)[source]

Calculates measures from true/false positive and negative counts

This function can return standard machine learning measures from true and false positive counts of positives and negatives. For a thorough look into these and alternate names for the returned values, please check Wikipedia’s entry on Precision and Recall.

Parameters
  • tp (int) – True positive count, AKA “hit”

  • fp (int) – False positive count, AKA, “correct rejection”

  • tn (int) – True negative count, AKA “false alarm”, or “Type I error”

  • fn (int) – False Negative count, AKA “miss”, or “Type II error”

Returns

  • precision (float) – P, AKA positive predictive value (PPV). It corresponds arithmetically to tp/(tp+fp). In the case tp+fp == 0, this function returns zero for precision.

  • recall (float) – R, AKA sensitivity, hit rate, or true positive rate (TPR). It corresponds arithmetically to tp/(tp+fn). In the special case where tp+fn == 0, this function returns zero for recall.

  • specificity (float) – S, AKA selectivity or true negative rate (TNR). It corresponds arithmetically to tn/(tn+fp). In the special case where tn+fp == 0, this function returns zero for specificity.

  • accuracy (float) – A, see Accuracy. is the proportion of correct predictions (both true positives and true negatives) among the total number of pixels examined. It corresponds arithmetically to (tp+tn)/(tp+tn+fp+fn). This measure includes both true-negatives and positives in the numerator, what makes it sensitive to data or regions without annotations.

  • jaccard (float) – J, see Jaccard Index or Similarity. It corresponds arithmetically to tp/(tp+fp+fn). In the special case where tn+fp+fn == 0, this function returns zero for the Jaccard index. The Jaccard index depends on a TP-only numerator, similarly to the F1 score. For regions where there are no annotations, the Jaccard index will always be zero, irrespective of the model output. Accuracy may be a better proxy if one needs to consider the true abscence of annotations in a region as part of the measure.

  • f1_score (float) – F1, see F1-score. It corresponds arithmetically to 2*P*R/(P+R) or 2*tp/(2*tp+fp+fn). In the special case where P+R == (2*tp+fp+fn) == 0, this function returns zero for the Jaccard index. The F1 or Dice score depends on a TP-only numerator, similarly to the Jaccard index. For regions where there are no annotations, the F1-score will always be zero, irrespective of the model output. Accuracy may be a better proxy if one needs to consider the true abscence of annotations in a region as part of the measure.

bob.med.tb.utils.measure.beta_credible_region(k, l, lambda_, coverage)[source]

Returns the mode, upper and lower bounds of the equal-tailed credible region of a probability estimate following Bernoulli trials.

This implemetnation is based on [GOUTTE-2005]. It assumes \(k\) successes and \(l\) failures (\(n = k+l\) total trials) are issued from a series of Bernoulli trials (likelihood is binomial). The posterior is derivated using the Bayes Theorem with a beta prior. As there is no reason to favour high vs. low precision, we use a symmetric Beta prior (\(\alpha=\beta\)):

\[\begin{split}P(p|k,n) &= \frac{P(k,n|p)P(p)}{P(k,n)} \\ P(p|k,n) &= \frac{\frac{n!}{k!(n-k)!}p^{k}(1-p)^{n-k}P(p)}{P(k)} \\ P(p|k,n) &= \frac{1}{B(k+\alpha, n-k+eta)}p^{k+\alpha-1}(1-p)^{n-k+\beta-1} \\ P(p|k,n) &= \frac{1}{B(k+\alpha, n-k+\alpha)}p^{k+\alpha-1}(1-p)^{n-k+\alpha-1}\end{split}\]

The mode for this posterior (also the maximum a posteriori) is:

\[\text{mode}(p) = \frac{k+\lambda-1}{n+2\lambda-2}\]

Concretely, the prior may be flat (all rates are equally likely, \(\lambda=1\)) or we may use Jeoffrey’s prior (\(\lambda=0.5\)), that is invariant through re-parameterisation. Jeffrey’s prior indicate that rates close to zero or one are more likely.

The mode above works if \(k+{\alpha},n-k+{\alpha} > 1\), which is usually the case for a resonably well tunned system, with more than a few samples for analysis. In the limit of the system performance, \(k\) may be 0, which will make the mode become zero.

For our purposes, it may be more suitable to represent \(n = k + l\), with \(k\), the number of successes and \(l\), the number of failures in the binomial experiment, and find this more suitable representation:

\[\begin{split}P(p|k,l) &= \frac{1}{B(k+\alpha, l+\alpha)}p^{k+\alpha-1}(1-p)^{l+\alpha-1} \\ \text{mode}(p) &= \frac{k+\lambda-1}{k+l+2\lambda-2}\end{split}\]

This can be mapped to most rates calculated in the context of binary classification this way:

  • Precision or Positive-Predictive Value (PPV): p = TP/(TP+FP), so k=TP, l=FP

  • Recall, Sensitivity, or True Positive Rate: r = TP/(TP+FN), so k=TP, l=FN

  • Specificity or True Negative Rage: s = TN/(TN+FP), so k=TN, l=FP

  • F1-score: f1 = 2TP/(2TP+FP+FN), so k=2TP, l=FP+FN

  • Accuracy: acc = TP+TN/(TP+TN+FP+FN), so k=TP+TN, l=FP+FN

  • Jaccard: j = TP/(TP+FP+FN), so k=TP, l=FP+FN

Contrary to frequentist approaches, in which one can only say that if the test were repeated an infinite number of times, and one constructed a confidence interval each time, then X% of the confidence intervals would contain the true rate, here we can say that given our observed data, there is a X% probability that the true value of \(k/n\) falls within the provided interval.

Note

For a disambiguation with Confidence Interval, read https://en.wikipedia.org/wiki/Credible_interval.

Parameters
  • k (int) – Number of successes observed on the experiment

  • l (int) – Number of failures observed on the experiment

  • lambda (float, Optional) – The parameterisation of the Beta prior to consider. Use \(\lambda=1\) for a flat prior. Use \(\lambda=0.5\) for Jeffrey’s prior (the default).

  • coverage (float, Optional) – A floating-point number between 0 and 1.0 indicating the coverage you’re expecting. A value of 0.95 will ensure 95% of the area under the probability density of the posterior is covered by the returned equal-tailed interval.

Returns

  • mean (float) – The mean of the posterior distribution

  • mode (float) – The mode of the posterior distribution

  • lower, upper (float) – The lower and upper bounds of the credible region

bob.med.tb.utils.measure.bayesian_measures(tp, fp, tn, fn, lambda_, coverage)[source]

Calculates mean and mode from true/false positive and negative counts with credible regions

This function can return bayesian estimates of standard machine learning measures from true and false positive counts of positives and negatives. For a thorough look into these and alternate names for the returned values, please check Wikipedia’s entry on Precision and Recall. See beta_credible_region() for details on the calculation of returned values.

Parameters
  • tp (int) – True positive count, AKA “hit”

  • fp (int) – False positive count, AKA “false alarm”, or “Type I error”

  • tn (int) – True negative count, AKA “correct rejection”

  • fn (int) – False Negative count, AKA “miss”, or “Type II error”

  • lambda (float) – The parameterisation of the Beta prior to consider. Use \(\lambda=1\) for a flat prior. Use \(\lambda=0.5\) for Jeffrey’s prior.

  • coverage (float) – A floating-point number between 0 and 1.0 indicating the coverage you’re expecting. A value of 0.95 will ensure 95% of the area under the probability density of the posterior is covered by the returned equal-tailed interval.

Returns

  • precision ((float, float, float, float)) – P, AKA positive predictive value (PPV), mean, mode and credible intervals (95% CI). It corresponds arithmetically to tp/(tp+fp).

  • recall ((float, float, float, float)) – R, AKA sensitivity, hit rate, or true positive rate (TPR), mean, mode and credible intervals (95% CI). It corresponds arithmetically to tp/(tp+fn).

  • specificity ((float, float, float, float)) – S, AKA selectivity or true negative rate (TNR), mean, mode and credible intervals (95% CI). It corresponds arithmetically to tn/(tn+fp).

  • accuracy ((float, float, float, float)) – A, mean, mode and credible intervals (95% CI). See Accuracy. is the proportion of correct predictions (both true positives and true negatives) among the total number of pixels examined. It corresponds arithmetically to (tp+tn)/(tp+tn+fp+fn). This measure includes both true-negatives and positives in the numerator, what makes it sensitive to data or regions without annotations.

  • jaccard ((float, float, float, float)) – J, mean, mode and credible intervals (95% CI). See Jaccard Index or Similarity. It corresponds arithmetically to tp/(tp+fp+fn). The Jaccard index depends on a TP-only numerator, similarly to the F1 score. For regions where there are no annotations, the Jaccard index will always be zero, irrespective of the model output. Accuracy may be a better proxy if one needs to consider the true abscence of annotations in a region as part of the measure.

  • f1_score ((float, float, float, float)) – F1, mean, mode and credible intervals (95% CI). See F1-score. It corresponds arithmetically to 2*P*R/(P+R) or 2*tp/(2*tp+fp+fn). The F1 or Dice score depends on a TP-only numerator, similarly to the Jaccard index. For regions where there are no annotations, the F1-score will always be zero, irrespective of the model output. Accuracy may be a better proxy if one needs to consider the true abscence of annotations in a region as part of the measure.

bob.med.tb.utils.measure.get_centered_maxf1(f1_scores, thresholds)[source]

Return the centered max F1 score threshold when multiple threshold give the same max F1 score

Parameters
Returns

  • max F1 score (float)

  • threshold (float)