Reality as the referee
It is easy for a system — or a team — to convince itself it is making progress. We design our evaluation so the world itself does the grading: forecasts are time-gated and pre-registered, test domains are held out by construction, and a gain only counts when it beats the strongest available baseline out-of-sample.
