The purpose of this notebook is to validate the reimplementation of the DC1 metrics, previously available on Github repository PZDC1paper, now refactored to be part of RAIL Evaluation module. The metrics here were implemented in object-oriented Python 3, following a superclass/subclass structure, and inheriting features from qp.
from metrics import *
from utils import *
import os
%matplotlib inline
%reload_ext autoreload
%autoreload 2
The DC1 results are stored in the class DC1
(defined in utils.py
ancillary file), which exists only to provide the reference values.
dc1 = DC1()
dc1.table
Code | PIT out rate | KS | CvM | AD | CDE loss |
---|---|---|---|---|---|
ANNz2 | 0.0265 | 0.0200 | 52.2500 | 759.2000 | -6.8800 |
BPZ | 0.0192 | 0.0388 | 280.7900 | 1557.5000 | -7.8200 |
CMNN | 0.0034 | 0.0795 | 1011.1100 | 6307.5000 | -10.4300 |
Delight | 0.0006 | 0.0876 | 1075.1700 | 6167.5000 | -8.3300 |
EAZY | 0.0154 | 0.0723 | 1105.5800 | 4418.6000 | -7.0700 |
FlexZBoost | 0.0202 | 0.0240 | 68.8300 | 478.8000 | -10.6000 |
GPz | 0.0058 | 0.0241 | 66.8000 | 670.9000 | -9.9300 |
LePhare | 0.0486 | 0.0663 | 473.0500 | 383.8000 | -1.6600 |
METAPhoR | 0.0229 | 0.0438 | 298.5600 | 715.5000 | -6.2800 |
SkyNet | 0.0001 | 0.0747 | 763.0000 | 4216.4000 | -7.8900 |
TPZ | 0.0130 | 0.1138 | 1.1600 | 10565.7000 | -9.5500 |
To access individual metrics, one can call the metrics dictionary dc1.results
using the codes and metrics names as keys.
dc1.results['PIT out rate']['FlexZBoost']
0.0202
The list of codes and metrics available can be accessed by the properties dc1.codes
and dc1.metrics
.
print(dc1.codes)
('ANNz2', 'BPZ', 'CMNN', 'Delight', 'EAZY', 'FlexZBoost', 'GPz', 'LePhare', 'METAPhoR', 'SkyNet', 'TPZ')
print(dc1.metrics)
('PIT out rate', 'CDE loss', 'KS', 'CvM', 'AD')
In this notebook we use the same input dataset used in DC1 PZ paper (Schmidt et al. 2020), copied from cori (/global/cfs/cdirs/lsst/groups/PZ/PhotoZDC1/photoz_results/TESTDC1FLEXZ).
my_path = "/Users/julia/TESTDC1FLEXZ"
pdfs_file = os.path.join(my_path, "Mar5Flexzgold_pz.out")
ztrue_file = os.path.join(my_path, "Mar5Flexzgold_idszmag.out")
oldpitfile = os.path.join(my_path,"TESTPITVALS.out")
pdfs, zgrid, ztrue, photoz_mode = read_pz_output(pdfs_file, ztrue_file)
Validation file from DC1 paper (ascii format).
Metrics calculated based on the PITs computed via qp.Ensemble CDF method. The PIT values can be passed as optional input to speed up the metrics calculation. If no PIT array is provided, it is calculated on the fly.
%%time
pit = PIT(pdfs, zgrid, ztrue)
pit.evaluate()
pits = pit.metric
CPU times: user 10min 2s, sys: 7.33 s, total: 10min 9s Wall time: 24min 35s
summary = Summary(pdfs, zgrid, ztrue)
summary.markdown_metrics_table(pits=pits, show_dc1="FlexZBoost")
/Users/julia/github/RAIL/rail/evaluation/metrics.py:188: UserWarning: p-value floored: true value smaller than 0.001 ad_results = stats.anderson_ksamp([pits_clean, uniform_yvals])
Metric | Value | DC1 reference value |
---|---|---|
PIT out rate | 0.0192 | 0.0202 |
KS | 0.0234 | 0.0240 |
CvM | 71.1439 | 68.8300 |
AD | 616.1850 | 478.8000 |
KLD | 0.0348 | N/A |
CDE loss | -10.62 | -10.60 |
pit_out_rate = PitOutRate(pdfs, zgrid, ztrue).evaluate(pits=pits)
pit.plot_pit_qq(title="DC1 paper data", code="FlexZBoost",
pit_out_rate=pit_out_rate, savefig=False)
Following Sam's suggestion, I also computed the metrics reading the PIT values from the partial results of DC1 paper, instead of calculating them from scratch.
Reading DC1 PIT values (PITs computed in the past for the paper):
pits_dc1 = np.loadtxt(oldpitfile, skiprows=1,usecols=(1))
plt.figure(figsize=[10,3])
plt.plot(pits_dc1, pits-pits_dc1, 'k,')
plt.plot([0,1], [0,0], 'r--', lw=3)
plt.xlim(0, 1)
plt.ylim(-0.02, 0.02)
plt.xlabel("PITs from DC1 paper")
plt.ylabel("$\Delta$ PIT")
plt.tight_layout()
The values are slightly different.
Recalculating the metrics:
summary = Summary(pdfs, zgrid, ztrue)
summary.markdown_metrics_table(pits=pits_dc1, show_dc1="FlexZBoost")
5 PITs removed from the sample.
/Users/julia/github/RAIL/rail/evaluation/metrics.py:188: UserWarning: p-value floored: true value smaller than 0.001 ad_results = stats.anderson_ksamp([pits_clean, uniform_yvals])
Metric | Value | DC1 reference value |
---|---|---|
PIT out rate | 0.0202 | 0.0202 |
KS | 0.0240 | 0.0240 |
CvM | 68.8262 | 68.8300 |
AD | 663.4982 | 478.8000 |
KLD | 0.0360 | N/A |
CDE loss | -10.62 | -10.60 |
pit_out_rate = PitOutRate(pdfs, zgrid, ztrue).evaluate(pits=pits_dc1)
pit.plot_pit_qq(title="DC1 data (original PITs)", code="FlexZBoost",
pit_out_rate=pit_out_rate, savefig=False)
Using the original PIT values from the paper, all metrics match reasonably, except for the Anderson-Darling statistic.
The class AD uses scipy.stats.anderson_ksamp
method to compute the Anderson-Darling statistic for the PIT values by comparing with a uniform distribution between 0 and 1. Up to the current version (1.6.2), scipy.stats.anderson
(the 1-sample test) does not support uniform distributions as reference sample.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.anderson_ksamp.html
ad_dc1 = dc1.results['AD']['FlexZBoost']
ad_dc1
478.8
ad = AD(pdfs, zgrid, ztrue)
ad.evaluate(pits=pits_dc1)
ad.metric
5 PITs removed from the sample.
/Users/julia/github/RAIL/rail/evaluation/metrics.py:188: UserWarning: p-value floored: true value smaller than 0.001 ad_results = stats.anderson_ksamp([pits_clean, uniform_yvals])
663.49815902956
By default, the AD is computed within the interval $0.0 \leq PIT \leq 1.0$.
5 objects have PIT values out of this interval (this is unexpected).
print(pits_dc1[(pits_dc1<0)|(pits_dc1>1)])
[1.00008 1.00002 1.00041 1.00006 1.00002]
It is possible to remove extreme values of PIT, as done in the paper.
ad.evaluate(pits=pits, ad_pit_min=0.0001, ad_pit_max=0.9999)
ad.metric
7683 PITs removed from the sample.
333.5984688987156
ad.evaluate(pits=pits, ad_pit_min=0.001, ad_pit_max=0.999)
ad.metric
10588 PITs removed from the sample.
322.737851720298
ad.evaluate(pits=pits, ad_pit_min=0.01, ad_pit_max=0.99)
ad.metric
21764 PITs removed from the sample.
386.4284099611739
p_err = (abs(ad_dc1-ad.metric)/ad_dc1)*100. # percent error
print(f"Percent error: {p_err:.1f} %")
Percent error: 19.3 %
These metrics are deprecated and might not be used in future analyses. They are included in this notebook for the sake of reproducing the results from the paper in its totality.
utils.old_metrics_table(photoz_mode, ztrue, name="this test", show_dc1=True)
Metric | this test | DC1 paper |
---|---|---|
scatter | 0.0155 | 0.0154 |
bias | -0.00027 | -0.00027 |
outlier rate | 0.020 | 0.020 |
utils.plot_old_valid(photoz_mode, ztrue, code="FlexZBoost")
The metrics calculated using the new implementation are reasonably close to the expected values. Minor differences were observed due to the difference in the calculation of PIT values. In both cases, here and in the paper, the PITs were calculated using qp functions. The small diferences are attributed to minor changes in qp versions since when the paper was produced.
When using the original values of PIT, i.e., those calculated for the paper using the qp version availabe at the time, all metrics were reproduced, except for the AD test. This particular metric is quite sensitive to the range of PITs considered in the calculation. Using the same PIT interval as used in the paper (0.01,0.99), the result obtained using the new implementation diverges from the paper result by 19.3%.