Behind every robust result is a robust method: Perspectives from a case study and publication process in hydrological modelling
2021; Wiley; Volume: 35; Issue: 8 Linguagem: Inglês
10.1002/hyp.14266
ISSN1099-1085
AutoresFabrizio Fenicia, Dmitri Kavetski,
Tópico(s)Flood Risk Assessment and Management
ResumoModels are commonplace in hydrological and environmental sciences. In scientific contexts, models are used to encapsulate our current understanding of environmental processes and to test new scientific propositions. In applied contexts, models are used to predict future environmental conditions and hence are a cornerstone of the decision-making chain in civil protection (e.g., real-time flood forecasting) and environmental management and planning (e.g., seasonal streamflow forecasts to inform water allocation, ecological health assessment, and so forth). The importance of models—and public expectations of them—are only likely to increase, as science's attention turns to the formidable environmental challenges of the XXI century. The increased reliance on models places a corresponding expectation on model reliability and robustness. Environmental model failures can be costly—from loss of time of a PhD researcher who adopts a published model that turns out to be non-robust, to potentially catastrophic consequences during natural disasters such as floods and droughts if models incorrectly anticipate such events. These considerations might lead an impartial observer to expect to see an increasing scientific emphasis on model design and verification. Paradoxically, there is evidence of an opposite trend taking place. For example, modern publication practices in natural sciences seem to place an ever increasing emphasis on the presentation of "results," rather than on the "methods" used to achieve them. Some journals minimise their methods sections by presenting them in reduced font size, or even relegate them to supplementary materials that do not form part of their formal printed versions. The motivation for these changes is perhaps understandable: results and their interpretation are usually seen as the more exciting aspects of scientific endeavours. But such degree of prioritisation of results and interpretations over their methodological foundation brings real risks to longer term scientific rigour and credibility. The risks are particularly high as environmental sciences and applications are increasingly dominated by heavily abstracted mathematical models, which are used to simulate variables far beyond those we can measure (Cunge, 2003). Seemingly routine methodological choices can significantly affect the conclusions of case studies and investigations—making it a problematic practice to skimp over methodological details in an enthusiastic rush towards the next great discovery that models are able to "demonstrate." Our commentary emphasises the value of three methodological instruments that arguably remain under-utilised in hydrological modelling practice and broader environmental applications: (1) benchmarking model results with a null-hypothesis model, (2) testing model predictions in space–time validation, and (3) carrying out controlled model comparisons. A schematic representation of these instruments is shown in Figure 1, in the representative context of developing a catchment scale rainfall-runoff model. These instruments play a key role in the classic scientific method, and are certainly not new in hydrological modelling. For example, the importance of establishing meaningful benchmark models has been emphasised by many hydrologists (Pappenberger et al., 2015; Seibert et al., 2018). The need of model validation tests was notably advocated by Klemeš (1986) followed by numerous discussions on the topic (e.g., Andréassian et al., 2009; Refsgaard, 1997); see also earlier discussion by Burges (1984). The value of systematic model comparisons and hypothesis testing was reviewed and elaborated by Clark et al. (2011) (see also Baker, 2017; Pfister & Kirchner, 2017). Yet, these instruments are not necessarily included in what is generally perceived as "good modelling practice" (Cunge, 2003). For example, the common modelling protocol, adopted in many/most hydrological modelling studies, relies on calibration-validation (Blöschl & Sivapalan, 1995; Cunge, 2003; Refsgaard, 1997), but does not explicitly consider comparisons with alternative models. Validation in space, although considered important for spatially distributed models, is not implemented routinely (Refsgaard, 1997). Research in broader environmental sciences, including climate modelling, mirrors these experiences, with the need for carefully designed model comparisons being stressed in several recent publications (e.g., Eyring et al., 2019; Rood, 2019). So, multiple opinion papers and discussions notwithstanding, why are these instruments not routinely adopted in environmental modelling? The reason might be twofold. First there is still insufficient appreciation and guidance on what should constitute "good modelling practice." An appropriate modelling protocol is not necessarily unique, but may depend on the type of model and its application. For example, in some cases model calibration may not be advisable (Cunge, 2003). Second, there may be an insufficient appreciation of how much these instruments matter in practice. With some exceptions (e.g., Refsgaard & Hansen, 2010), previous discussions have revolved mainly around theoretical aspects. We therefore refrain from restating the "theoretical" importance of these instruments and instead offer a practical example. Our story is inspired by a modelling study in a Luxembourgish catchment with a surprisingly wide range of streamflow dynamics (Fenicia et al., 2016). This commentary revisits that earlier study and presents it as a Hypothetical Paper Submission and Review Process. We start from a common choice of modelling methods and illustrate how progressively more stringent techniques change the modelling conclusions. Our characters are the humble Authors, Reviewers 1–3, and the Editor. Our disclaimer is that, in this story, the model results are real, but any semblance to actual persons or actual events is purely coincidental. The story begins with the Authors submitting a manuscript on the hydrology of the 300 km2 Attert catchment in Luxembourg. In that catchment, the diversity of hydrograph dynamics observed across multiple locations beseeched the question of which climatic or landscape properties acted as dominant controls on streamflow generation. This research question was approached from a modelling perspective, seeking to find a distributed model structure and parameter values that reproduced—or as commonly put, "explained"—the observed streamflow variability. The reader will appreciate that developing and setting up a distributed hydrological model can be a challenging endeavour. At the very least, one has to develop a conceptual model, translate it into a computer code, gather and format the data, interface the model with a calibration algorithm, and spend a non-negligible amount of time to debug and run the entire machinery. The Authors did all this, and when methodological choices were necessary, they did their best to "do things the right way," following the recommendations and techniques from the hydrological modelling literature. After a quick description of the methods, the paper focused its attention on the results. As examples of methodological choices, the Authors implemented the model using a robust time stepping scheme to avoid numerical instabilities (Kavetski & Clark, 2011), applied the square-root streamflow transformation to account for error heteroscedasticity/skew (McInerney et al., 2017; Sorooshian & Dracup, 1980), and employed global optimisation strategies (Duan et al., 1992; Qin et al., 2018). The model was validated on a separate time period not used in calibration (Blöschl & Sivapalan, 1995; Klemeš, 1986; Refsgaard, 1997). To ascertain the model's ability to simulate streamflow, the Authors had followed common "best practices" and evaluated the model not only in the time domain, but also in the signature domain (Hrachowitz et al., 2014; Kavetski et al., 2018). In particular, in addition to the Nash-Sutcliffe (NS) efficiency of streamflow time series simulations at all subcatchments—an imperfect choice for a model evaluation metric (Schaefli & Gupta, 2007), but also a familiar and still widely used one—the Authors considered three streamflow signatures: the runoff coefficient, the baseflow index, and the flashiness index (e.g., Addor et al., 2018; Baker et al., 2004). This selection of data and performance metrics was argued to provide a suitably comprehensive model evaluation. The proposed model will be referred to as M1 (and the numbering already foreshadows it will not be the final one…). Like many distributed models, M1 was based on the concept of Hydrological Response Units (HRUs). In this case, 3 HRUs were defined, based on catchment topography. Following standard practice, the same "generic" model structure was assigned to each HRU. The main selling point of this model was its ability to capture the spatial variability of streamflow in the study area. Figure 2a presents the performance of model M1 in terms of streamflow time series predictions and their NS efficiency in each of the subcatchments. In discussing this figure, the Authors noted that model performance in calibration was "satisfactory" or "acceptable" in all the subcatchments but one, with NS efficiencies between 0.60 and 0.93. The only rogue subcatchment was the Huewelerbach headwater catchment, where the model had a NS efficiency of 0.34. The Authors contended this was a "minor" problem representative of "idiosyncrasies" of headwater catchments (McDonnell, 2003). In terms of streamflow signatures, the Authors noted that their absolute values were difficult to match, but the dots generally followed the same pattern, with linear correlation values (Pearson correlation, R) in the range of 0.77–0.94 (Figure 2b–d). Moreover, time validation resulted in only a moderate decrease in model performance. Overall, the model was seen to provide a "credible" indication that streamflow diversity in the Attert catchment was controlled by topography. Satisfied with these findings, the Authors submitted their work for publication. They were certainly not anticipating the odyssey ahead. Then came the reviews. Reviewers 1 and 2 were generally positive. They only had some general remarks on improving referencing and presentation. Reviewer 3, instead, was surprisingly critical. In particular, Reviewer 3 did not appreciate statements such as "satisfactory" and "acceptable"—these were not only subjective, but lacked a frame of reference—"with respect to what is the model satisfactory?" The reviewer contended that a meaningful evaluation of model performance requires a credible benchmark, that is, a "null-hypothesis" model (Figure 1a). As Reviewer 3 argued, "Getting an NS efficiency above zero ain't exactly newsworthy." Setting up a meaningful benchmark model is not trivial, and the literature does not offer many examples. After some reflection, the Authors came up with a solution. The most relevant hypothesis in their model setup was the selection of HRUs, as this model decision controls how the model associates distinct dominant processes with distinct landscape elements. The null-hypothesis model requested by Reviewer 3 should serve to test this hypothesis, and therefore could be represented by a model with distributed rainfall inputs but uniformly distributed parameters (Figure 1a). In this setup, the only difference between the null-hypothesis model, here referred to as M0, and the proposed model M1 was the selection of HRUs: a single HRU for M0 versus 3 HRUs for M1. Models M0 and M1 made the same choices in all other modelling decisions. The key results of the revised manuscript are shown in Figure 3. In the time domain, model M1 outperformed model M0 in 9 of 10 subcatchments, which for the Authors was a reassuring result. Still, M0 failed similarly to M1 in the Huewelerbach, which was once again attributed to headwater catchment "idiosyncrasies." In the signature domain, M1 clearly outperformed M0 for the baseflow and flashiness indices, for example, with R values of 0.94 versus 0.09 respectively for the flashiness index. Notably, the inclusion of the benchmark model also allowed to replace potentially subjective absolute statements ("good" or "bad") with relative ones ("better than" or "worse than"). That said, model M0 diminished some of the perceived merits of model M1. For example, M0 could match the runoff coefficient signature even better than M1 just by virtue of distributing the forcings, without also distributing the model parameters. Still, the Authors argued, this exception did not overturn the conclusion that the topography based model M1 offered a "realistic" representation of hydrological processes in the study catchment. The Authors therefore re-submitted their paper, convinced that their thorough revision would satisfy the most critical reviewer… The Authors' optimism lasted as long as the re-review process. Actually, the first two reviewers were now rather enthusiastic about the paper, in which they recognised many more improvements than they had themselves asked for. Reviewer 3, instead, came up with an additional set of criticisms. A particular concern was that "temporal split sample" validation was not sufficient for a spatially distributed model. Reviewer 3 asserted that "a distributed model should be able to make predictions not only in time, but also in space." The paper was now held back by a single tenacious reviewer. However the request for more stringent validation was eminently sensible and hard to dismiss. The Editor therefore asked the Authors to perform this last check, which "should not take much additional work." The Authors, confident in their model, decided to meet rather than evade the challenge, and implemented the more stringent so-called "proxy-basins, split-sample" space–time validation test (Klemeš, 1986). The model was calibrated on one group of subcatchments over a given time period, and validated on another group of subcatchments over another time period (Figure 1b). Horror and damnation! Model M1 performed considerably worse in space–time validation, as seen in Figure 4. In the time domain (panel a), the model not only continued to fail in the now infamous Huewelerbach, but also had worse NS efficiency than the benchmark model M0 in another three subcatchments. More regrettably, the pride and joy of model M1—its high correlation between simulated and observed signatures—was also lost, with R values dropping from 0.77–0.94 in time validation (Figure 3), to 0.46–0.53 in space–time validation. The Authors contemplated defending their model, but after this experiment themselves began to question it. Instead of arguing that their hitherto favoured model M1 still provided an improvement over the benchmark (e.g., at simulating streamflow in 6 out of 10 catchments), the Authors embarked on the arduous journey of revisiting the model assumptions. After several failed attempts, and as many requests to extend the re-submission deadline … Eureka! A new model, M2, was found that reproduced the observed data much better than models M0 and M1! The performance of model M2 is shown in Figure 4. In terms of NS efficiency of streamflow time series (panel a), model M2 outperformed the benchmark model M0 in all subcatchments. Notably, M2 did well in the Huewelerbach—no need to blame headwater catchment idiosyncrasies no more! In terms of signatures (panels b–d), model M2 matched exceptionally well the baseflow and flashiness indexes (R of 0.89 and 0.97 respectively, whereas model M0 had R values close to zero). So, what was the difference between models M2 and M1? Whereas model M1 assumed that streamflow variability was driven by topography and used 3 topography related HRUs, model M2 considered geology as the dominant control and used 4 geology related HRUs. These contrasting assumptions are schematised in Figure 5a,b respectively. Despite the larger number of HRUs, model M2 had fewer parameters than model M1 (11 versus 21, respectively). This reduction was obtained by tailoring the model structures to the specific HRUs by focusing on their perceived dominant processes, and also by linking some parameter values across all HRUs. As the new model M2 was very different from model M1 advocated in the original submission, the entire set of conclusions about streamflow diversity in the Attert catchment had to change. This outcome was somewhat embarrassing for the Authors, who nonetheless rewrote a big chunk of their story and re-submitted their work, hoping that Reviewer 3 would now be satisfied. When the Authors received the reviews, they understood it was not a good sign, as they had been expecting a short acceptance email. The Editor did not hide some discomfort in sending the paper back to the Authors once again. This time, understandably, the Editor did not consult the first two reviewers. So there was no need to scroll down to see the comments of Reviewer 3. After a moralising preamble on how the message of the paper had progressively changed through the review process (as a result of adopting more stringent model evaluation approaches), Reviewer 3 launched their final attack. It read: "Results show that model M2 is better than model M1. But why is that? M1 and M2 differ in multiple respects—not only in the spatial discretisation approach, but also in the model structures used to represent each HRU, and in the presence of parameter links across HRUs. Are both differences pertinent? And what is their relative impact? In other words, what is le raison d'être for improved model performance?" The Authors, stoic in adversity, faced this new concern. Two additional models, with structures "intermediate" between models M1 and M2, were designed. The first new model, M1A, would differ from model M1 solely in the spatial discretisation approach, namely using geology rather than topography to define the HRUs. The second new model, M1B, would differ from model M2 solely in the parameter regularisation, namely by not linking parameter values across the HRUs. The resulting model comparison sequence is illustrated in Figure 5c. Ultimately, controlled model comparisons enabled the study to attribute changes in predictive performance to specific model decisions (Figure 1c). In summary, model M1A showed the largest improvement of all model variations, indicating that in this study the geology based landscape discretisation was the key modelling decision. Model M1B also showed some improvement over M1A, lending credence to the idea of tailoring the model structure to each HRU. M2 achieved a very similar performance to M1B, indicating that parameter regularisation shaved off unnecessary model complexity. The overall study conclusion was that streamflow diversity in the Attert catchment was controlled by spatial variability in rainfall (runoff coefficient) and geology (baseflow and flashiness indices). With their last strengths, the Authors re-submitted Revision 3 of their paper. Their perseverance was finally rewarded with the manuscript being accepted. The interested reader can find the complete study in Fenicia et al. (2016), where models M0, M1 and M2 in this paper correspond to models M-Uni, M-Top and M-Geo-3, and the intermediate models M1A and M1B correspond to models M-Geo-1 and M-Geo-2. Note some minor changes to the analysis methods used in the original study by Fenicia et al. (2016). For example, here the model calibration employed a square-root streamflow transformation (following a recent recommendation of McInerney et al. (2017)). Luckily for the Authors this particular change in analysis methods did not affect the study conclusions! These instruments were not straightforward to apply, yet had a sobering impact on the modelling conclusions. While the Authors were, no doubt, at times frustrated by the exigence of Reviewer 3, ultimately they appreciated the demonstrably higher quality and robustness of the resulting analyses. As environmental research and operations become dominated by model based approaches, and as these models become more complex and abstract, the methods used for model design and evaluation need to become correspondingly more comprehensive and stringent. The ongoing trend of prioritising results and interpretation over methods goes against this need, and can foster a laissez faire scientific attitude, where models are misused as inexhaustible resources of (non-robust and potentially misleading) findings and recommendations. Detailed and careful reviews offer some protection, much as accomplished by Reviewer 3 in our example. However, the review process on its own, as harsh as it often seems, cannot be the sole guardian against methodological weaknesses. Indeed this protection is increasingly weak, as the growing publication rate is (perhaps inevitably) mirrored by a decline in review quality, which is increasingly limited to "generic" assessment rather than detailed scrutiny (Politi et al., 2021). Relegating methods to smaller font size, and giving them less prominence in scientific papers and presentations, can only exacerbate these concerns. A stronger awareness of good modelling practices and appropriate modelling protocols—and keeping methods forefront in scientific reporting—is needed to foster a healthier scientific attitude and reduce the risk of non-robust findings. This commentary illustrates the impact of modelling methods on study findings, and how three methodological instruments that are known yet under-utilised in hydrological and environmental modelling protocols can be used to achieve more robust and insightful conclusions. We argue that studies where models are used as explanatory and/or predictive tools, as well as studies where "better" models are proposed, should make use of the three instruments here investigated, adapted to the specific applications. Studies should articulate a meaningful benchmark, set up a stringent validation targeted to the research question, and construct informative model comparisons to elucidate the reasons for the results. These choices are not straightforward, and therein ultimately lies the science and art of hydrological modelling. We hope that our commentary will stimulate and help hydrologists to implement these methodological instruments in their ongoing and future work. We thank Editor Jim Buttle and Michael Leonard for constructive comments that helped us improve the manuscript. Open Access Funding provided by Lib4RI Library for the Research Institutes within the ETH Domain Eawag Empa PSI and WSL.
Referência(s)