A copula based representation for tailings dam failures

In this article, we model the dependence between dam factor and Dmax, where dam factor is an indicator of risk of a tailings dam failure, which involves the height H of the tailings dam, the volume of material housed by the tailings dam VT and the volume dispensed by the tailings dam, VF, when the dam breaks. And, Dmax is the maximum distance traveled by the material released by the tailings dam, after the collapse. With the dependence found via copula models and Bayesian estimation, given a range of dam factor, we estimate the probability of the released material to exceed a certain threshold. Since the dam factor involves the released volume VF (unknown before the dam break), we present a naive way to estimate it using VT and H. In this way, it is possible to estimate the dam factor of a tailings dam and with such a value to identify the probability of the tailings dam to show a Dmax that exceeds a certain threshold.


Introduction
Tailings dams are scattered throughout various regions of the world, many of them already inactive, others still operating. Some countries have turned their attention to the conditions of such reservoirs, as entire regions have suffered from the impact of tailings dams breaks. The impacts of such ruptures have been incalculable, taking lives, leaving useless areas, and polluting rivers, projecting damages miles away from the reservoir site. Despite the demand imposed by several entities, the problem of regularizing these dams is complex due to several reasons. Many reservoirs are not used and their maintenance requires specialized equipment/technicians. In other cases, it is not easy to identify those responsible for the reservoirs because they work (or worked) for companies which are no longer in operation.
Several aspects must be taken into account in establishing the reliability of a tailings dam, the construction method, its size, the region where it is located, etc. For example, when talking about the region where it is located, factors such as the proximity of watercourses or cities raises the risk of the tailings dam in case of failure. Moreover, it is very common to find entire cities standing in the vicinity of reservoirs of this type, since the mining activity that surrounds the reservoir generates jobs and as a consequence facilitates the rising of communities that are sustained mainly by this activity. As expected, the databases that can help understand the dynamics of these tailings dams break are small (few cases), or at least those from which comparable information has been collected.
A valuable database of 35 tailings dam breaks reported around the world is presented by [1]. Information such as the volume of the material in a tailings dam (V T ) could be used to determine the volume of the material to be released in a dam breaking (V F ). The height H, attained by the material contained in the dam, and the volume of the released material could be used to estimate the distance reached by the released material (D max ). Among the amounts to consider, the following two quantities are usually incorporated into the studies: (i) H Â V F (dam factor), see [2], (ii) H f = H Â V F V T Â V F (dam factor related to the fractional volume V F V T ), see [1]. In the literature there are several proposals to determine V F and D max . The most recent according to our knowledge is also given in [1], where the authors propose a stochastic linear relationship between log(V F ) and log(V T ), and also a stochastic linear relationship between D max (or log(D max )) and the dam factor version (i) or (ii).
Several aspects arise in the statistical perspective used in [1]. The first one is about the estimation of the parameters of the linear relations. A frequentist view as it is used could be compromised by the limited number of cases available in the database, since the law of large numbers (large sample size) is not feasible, which results in a lack of good properties in parameter estimators. To overcome this limitation, a Bayesian approach is then recommended. The second aspect presented in [1] is the imposition of linearity in the relationships to be estimated. This assumption is quite ambitious and, although the authors take the precaution of analyzing the goodness of the fit, the validity of the diagnoses also ends up being compromised by the size of the data set. Using the same data as [1], here we address both issues, making more flexible the relationship between V T and V F and also between dam factor and D max using copula models and from a Bayesian perspective.
This article is organized as follows. Section Theoretical Background exposes the theoretical foundations of the work, with a focus on dependence models (see [3]). Inspecting the dependence relationships of the data provided in [1], the model selection procedure and the referred Bayesian parameter estimation are presented in Section Model Representation. Section Chances for Large Values of D max shows a dam factor-based prediction study in order to determine the distance that the released material can travel after a breakdown. In this same section, we offer a strategy to identify the reservoir's dam factor. Finally, Section Conclusion closes the paper with some general remarks and conclusions.

Theoretical background
What really interests us is the relationship between X 1 and X 2 , where X 1 is the variable V T (or dam factor) and X 2 is the variable V F (or D max ) corresponding to tailings dams. Figures 1a, 2a and 3a show the scatter plots between the pairs (see [1]), with the most extreme cases highlighted.
We approach this problem using the concept of copula, where if H is the cumulative distribution function of (X 1 , X 2 ), being those continuous random variables, there is a function C such that for all (x, y) 2 Image (X 1 , X 2 ), If the function C is the 2-copula of (X 1 , , then, C is the joint distribution of the variables U :¼ F 1 (X 1 ) and V :¼ F 2 (X 2 ). The function C is the one that we want to identify. The result given by (1)see [4] allows the decomposition of H into three functions: the marginal distribution F 1 , the marginal distribution F 2 , and the copula function C. F 1 and F 2 are marginally determined, that is, it is enough to inspect the stochastic behavior of X 1 to determine F 1 and similarly, X 2 is enough to determine F 2 . Already to determine C, we need both variables duly transformed by their marginal distributions. Copula functions cover all dependence types, since "(u, v) 2 [0, 1] 2 and for all copula C(u, v), (X 1 , X 2 ) has 2-copula M (W) if and only if X 2 is a monotone nondecreasing (nonincreasing) funcion of X 1 almost surely (see [5]). This property demonstrates that for X 1 and X 2 verify a linear relationship the 2-copula of (X 1 , X 2 ) can only be M (Spearman correlation coefficient = 1) or W (Spearman correlation coefficient = À1). In the next example, we show a family of copulas indexed by a parameter h 2 [1,1). This allows covering the cases in the spectrum considered by the result (2).

Example 2.1
Consider (u, v) 2 [0, 1] 2 , h 2 [1, 1), the Gumbel-Hougaard copula is given by,  The Gumbel-Hougaard family is a family inside the Archimedean class (see [3]). That being, then there is a function /: [0, 1] ? [0, 1] associated with this family which generates it. In this case, / h (t) :¼ (Àln(t)) h and the generation of C occurs since the following analytical expression is allowed: where / À1 h ðzÞ ¼ expðÀz 1 h Þ: Note that C(u, v|h = 1) = uv (independence between the random variables) and C(u, v|h) ? M(u, v) when h ? 1 (positive and strict dependence between the variables). This means that the Gumbel-Hougaard family allows to model positive dependence types with a spectrum that goes from independence to perfect positive dependence, not being possible for this family the cases from independence to the copula W. Also, the Kendall's tau coefficient is given by s h ¼ hÀ1 h 2 ½0; 1 because h 2 [1, 1). So, the spectrum of s h summarizes the positive dependence attained by the family. {C(Á,Á|h)} h2[1,1) is a positively ordered family since, given h, h 0 such that Features such as those mentioned for the Gumbel-Hougaard copula serve as a framework to interpret the dependence of random variables. This subject will be central in the next sections when the dependence between the variables of interest is determined.
In the application, we compare several copula models, some of them to be formulated considering the equation (3) (Archimedean copulas) with appropriate generators / h , for which the pseudo-inverse of e Àh À1 , h 2 (À1, 1)\{0}, the model generated by equation (3) (3) is called Joe family. Clayton and Frank reach the limits W and M, see equation (2). Joe's family follows the same pattern as the Gumbel-Hougaard, only allowing cases from the independence to M. We also consider two well-known copulas belonging to the family of elliptical copulas with the shape, Cðu; vjqÞ :¼ wðw À1 ðuÞ; w À1 ðvÞjqÞ; ðu; vÞ 2 ½0; for an appropriate function w and parameter q 2 [À1, 1]. (i) The Gaussian copula given by w(t) :¼ U(t) which is the usual cumulative standard Gaussian distribution, N(0, 1) and w(s, t|q) :¼ U(s, t|q) which is the bivariate standard which is the cumulative of the univariate t-Student distribution with g degrees and w(s, t|q) :¼ T g (s, t|q) that is the bivariate cumulative t-Student distribution with g degrees of freedom and q correlation. As can be verified by the examples of copulas registered here, there is a huge diversity concerning the possibilities of definition for the relationships between X 1 and X 2 . Many of these are available in statistical libraries (see for example copula package in R project [R-project], 1 that in our case is used for the modeling).
With the variety of models introduced previously, we wish to cover a considerable range of dependence types that allow us to determine the best representation of the dependence between X 1 and X 2 . For comparison between the models, we will adopt a model selection criterion (see [6]).

Model representation
In order to proceed with the estimation of the model, the original observations fðx 1i ; x 2i Þg n i¼1 are replaced by their re-scaled marginal ranks to [0, 1], u i :¼ and v i :¼ jfj:1 j n; x 2j x 2i gj nþ1 ; i = 1, . . ., n, where |A| denotes the cardinal of the set A. This step has two purposes, namely to neutralize the effect arising from the (possibly) difference in measurement scale of the original observations and to mitigate the problems of evaluating the copula likelihood function in the boundaries of the region [0, 1] Â [0, 1]. This latter task is done by dividing the ranks by n + 1 so that the pseudoobservations, fu 1i g n i¼1 (or fv 2i g n i¼1 Þ, never attain 0 or 1. Pseudo observations are nonparametric estimates of U = F 1 (X 1 ) (or V = F 2 (X 2 )), furthermore, the function C is the distribution of (U, V), which leads us to infer that the dependence exposed by equation (1) is revealed when exploring the dispersion between the pseudo-observations. See the scatterplot for the three cases: Figures 1b, 2b and 3b. In order to guarantee some conditions required by the models, we test for stochastic independence and exchangeability. Originally proposed by [7], the independence test applied here, is based on the distance between the empirical copula and the product copula P(u, v) = uv and is implemented using simulations executed by the indepTest() function from the copula R-package. We test H 0 : U and V are independent and, H 0 is rejected with p-value < 0.001 in all the cases. Such evidence supports the search for a model that represents the dependence. A rather desirable property of the dependence is the exchangeability, a condition requested by many families of copulas including the Archimedean and the elliptical ones. A copula is exchangeable if the property C(u, v|h) = C(v, u|h) is verified for all (u, v) 2 [0, 1] 2 and the test proposed by [8], based on the empirical copula, is implemented using simulations by the function exchTest() of the copula R-package. Then, we test if H 0 : U and V are exchangeable, and H 0 is not rejected in all the cases, with p-values much greater than the generally used significance level of 5%. Such evidence supports to use the models given by Archimedean and elliptical copulas. In order to define the appropriate copula in the three situations we use the function fitCopula(), with arguments (a) copula and (b) method being (a) "claytonCopula(dim = 2)", "frankCopula(dim = 2)", "gumbelCopula(dim = 2)", "joeCopula(dim = 2)", "normalCopula(dim = 2)", "tCopula(dim = 2)" and being (b) method = "mpl" (maximum pseudo likelihood) which is the maximum log-likelihood (MLL) method evaluated on the pseudo observations.
Given a copula C its density c is computed and the log-likelihood is given by lnð Q n i¼1 cðu i ; v i ÞÞ which is maximized in the underlying parameters to obtain MLL ðC; fðu i ; v i Þg n i¼1 Þ, related to the model C and the set fðu i ; v i Þg n i¼1 . Note that five of these models have one parameter while the t-Student copula model has two parameters, so that a penalty is applied to this model in order to promote a fair selection. We consider the Bayesian Information Criterion (BIC) for this purpose, see [6], where N is the total number of parameters of C and n = 30 is the number of observations in the dataset. The BIC takes into account the trade-off between the maximum log-likelihood, MLL ðC ; fðu i ; v i Þg n i¼1 Þ, and the penalty imposed by the second term, 1 2 N lnðnÞ, which is done to enforce parsimony, i.e., simpler models having fewer parameters. According to the BIC, the higher the value taken by the equation (5), the better the model.
In the next subsection, we show the results produced by the selection procedure (selecting the best copulas) and also the estimation of the parameters. For the latter, we show the classical estimator as well as two Bayesian versions.

Estimation
We note that in the three cases, the two best models (copulas) are the Gaussian and the Gumbel-Hougaard (see Tabs. 1 and 2). It is clear that due to the size of the database (n = 30) the reliability of the classical parameter estimates may not be the best possible. This aspect can be addressed in a timely manner, but for now we can get some conclusions about the type of relationship in each case.
The Gaussian copula confirms that assuming linearity between X 1 and X 2 could not be indicated. If the relationship between these variables were in fact linear, the Gumbel-Hougaard copula could have detected this fact by attributing a very large estimate for h (in the direction of the M(u, v) copula), see Section Theoretical Background. For all the cases, (1) V T vs. V F , (2) H Â V F vs. D max , and (3) H f vs. D max , the estimation of h is moderateĥ 2 (2,3] showing evidence of the non-applicability of the M copula. In order to introduce a level of confiability to the parametric estimation process, we conduct a Bayesian estimation. We apply Hamiltonian Monte Carlo (HMC) simulations through the rstan R-package in two Bayesian strategies, described below and in general terms, Non-Informative (NI) and Informative (I). In Table 3, we show the results for each of the three cases and for each of the two best models pointed out by the BIC, see Tables 1 and 2. For the Non-Informative cases of the Gaussian copula, we use an uniform prior distribution for q, properly accommodated in the interval [À1, 1], which is the range of possibilities of the q parameter. The Informative cases of Gaussian copula are implemented assuming a Beta priori distribution for q, accommodated in [À1, 1], where its mode is the one derived from the relation between Kendall's tau coefficient and the copula association parameter. For instance, by means of the empirical estimation of Kendall's tau coefficient we can obtain an estimation of the parameter (see iTau() function from copula R package). The NI cases of Table 1. Parameters estimated by Maximum Log-Likelihood (MLL) and BIC values for the copula between the pseudo-observations ranks(V T ) and ranks(V F ); degrees of t-Studentĝ = 3.529 also estimated by MLL using the tCopula() function of the copula R-package.

Copula
Parameter Gumbel-Hougaard copula are implemented assuming as a prior distribution on h an improper prior distribution (proportional to 1), while the Informative cases are implemented by assuming a Gamma distribution on h accommodated in [1, 1) with mode indicated by the parameter estimation derived from the iTau() function. While the informative scenario (I) aims to follow the trend of the data, the non-informative (NI) seeks objectivity, without risking in declaring trends. Table 3 shows convergence indicators of the simulations and a descriptive summary of the values taken by the posterior distributions of q (or h) obtained by the simulations. We note that Bayesian estimates (by quadratic and multilinear loss) return values close to the classic estimates reported in Tables 1 and 2, which gives coherence to the adjustments.
In this section, we have determined the best representation of the dependence between X 1 and X 2 in each of the three cases, which happens to be given by the Gaussian copula in the first instance and by the Gumbel-Hougaard copula in the second instance. We also implemented the Bayesian estimation of the parameters involved, determining among other aspects an estimation of the dependence structure, given by Cðu; vjq B Þ in the case of Gaussian copula (Eq. (4)) and Cðu; vjĥ B Þ for the Gumbel-Hougaard copula (Example 2.1), whereq B andĥ B are the Bayesian estimators given by a certain loss function.

Chances for large values of D max
In this section, we approach the specific problem of calculating the probability that the material released by the dam will exceed a certain distance, given a specific risk represented by dam factor values. To address this issue, we use the copula determined by the BIC and the Bayesian estimate presented in Table 3. Given that the perspective via copulas translates the problem to the ranks of the observations, we first address the questions for the ranks of the observations, in this paper denoted by U and V.
Under the Gaussian copula, we estimate the probability (6) using the Bayesian estimator by quadratic loss function of q, say b q B : In the same way, under the Gumbel-Hougaard copula, we estimate the probability (6) using the Bayesian estimator by quadratic loss function of h, sayĥ B . Then, the estimators of (6) are, and, respectively. Each one of the estimates (Eqs. (7) and (8)) presents two versions, one under the Non-Informative setting and another under the Informative setting, see Table 3. Tables 4 and 5 record the estimated values according to equations (7) and (8). Table 4 shows the estimates when the dam factor is H Â V F and Table 5 shows the estimates when the dam factor is H f . We note that the highest probabilities occur with the estimate implemented by the Gumbel-Hougaard copula (Eq. (8)) for the intervals of the extremes (0, 0.25] and (0.75, 1], while for the central intervals, the Gaussian copula (Eq. (7)) offers the highest values. This behavior is displayed for both types of dam factor (Tabs. 4 and 5). The Gumbel-Hougaard copula is well known as having more realistic properties in extreme values (see [9]); in the copula's framework, this means near to the values 0 or 1. Table 4. Case H Â V F vs. D max (see Tab. 3). From left to right: interval (a, b], v lower threshold of V, P i = Prob i (V > v|U 2 (a, b]) following equation (8), i = 1, 2 and equation (7), i = 3, 4, Non Informative settings i = 1, 3 and Informative settings i = 2, 4. In bold, the highest probability per line.

Interval (a, b]
Threshold v Gumbel H P 1 (NI) Gumbel H P 2 (I) Gaussian P 3 (NI) Gaussian P 4 (I) In Tables 6 and 7, we  To compute probabilities in real cases, it is enough to identify on the original scale the dam factor interval associated with U 2 (a, b] and also the threshold in V, v associated with D max . For instance, consider the case 11 of [1], first line in Table 8. Both versions of dam factor are in the interval (66.5, 2880], which is associated with (a, b] = (0.75, 1], see Table 6. This means that, according to the estimations given by Tables 4 and 5, the probability P(V > 0.75|U 2 (0.75, 1]) is !0.6. While the threshold v = 0.75 is related to D max = 25 (see Tab. 7). This already indicates that there is a high probability of the released material traveling 25 km or more. Besides, the computation by the Gumbel-Hougaard copula offers estimates that exceed 0.7. In fact, this case has an observed D max = 120. This simple exercise of using Tables 4 and 5 shows the ability of the copulas to identify risks in real situations, for example, if one wants to calculate the probability of the material released reaching a certain point as a city, a river, etc. The practical impact of these estimates also exposes the need of defining an estimator of the dam factor. That, according to the notions used here H Â V F and The next subsection is intended to estimate V F , since an adequate estimate of V F guarantees a reliable estimate of the dam factor. For such, we will take the V T value as an available information.
Expected Value for V F given V T As proved in [10], the expected value EðV jU 2 ðc; dÞ can be computed by integrating the copula function, Under the Gaussian copula we estimate the equation (9) by means of the Bayesian estimator by quadratic loss function of q, say b q B (see Tab. 3). The estimator of equation (9) is defined as, In the same way, and under the Gumbel-Hougaard copula we estimate the equation (9) using the Bayesian estimator by quadratic loss function of h, sayĥ B (see Tab. 3). Here is the estimator: In Table 9, we clearly visualize the effect of the copula on the magnitude of the mean values, the Gumbel-Hougaard copula offers the highest values in the extreme intervals and the Gaussian copula does it in the central intervals. As expected, with increasing the U interval (V T scaled to [0, 1]) the mean of V increases (V F scaled to [0, 1]). In Table 10, we record in the first column the ranges of V F values corresponding to the intervals to the left of Table 9, these last ranges made in the percentages of V F . Table 10 offers naive estimates for V F , based on V T intervals. For example, E 2 ðV F jV T 2 ðc; dÞ is the estimator of V F when V T 2 (c, d] and computed from the Gumbel-Hougaard copula with an Informative setting. Next, we compare the intervals in which the naive estimates fall, with those reported in Table 8 and, according to Table 6. Regarding line 1 of Table 8: since V T = 12.34, this value is associated with the interval (7.040, 74.00] of Table 10, which brings us to an average of V F estimated asÊ i (V F │V T 2 (7.04, 74]) = 32. When calculating the dam factor-(i), we obtain H ÂÊ i (V F |V T 2 (7.04, 74]) = 15 Â 32 = 480 2 (66.5, 2880], note that the observed value 135 (Tab. 8) is also in the interval (66. 5,2880]. Computing the dam factor-(ii), we obtain H ÂÊ iðV F jV T 2ð7:04;74Þ V T ÂÊ i (V F |V T 2 (7.04, 74]) = 15 Â 32/12.34 Â 32 = 1244.733 2 (13.85, 1675.64], which is in fact the interval in which was found the observed data (98.46). In other words, the estimates and observed values are accommodated in the same risk interval, reported in Table 6. If we consider the Samarco case (second case in Tab. 8), we have that the estimation iŝ E i (V F |V T 2 (7.04, 74]) = 32, since the observed V T = 55 2 (7.04, 74], and in fact the observed value is V F = 32. Then, the dam factor estimates are the same as those recorded in Table 8. Table 9. E i , i = 1, 2 computed using Gumbel-Hougaard copula (Eq. (11)), E i , i = 3, 4 computed using Gaussian copula (Eq. (10)), i = 1, 3 from Non Informative settings, i = 2, 4 from Informative settings. In brackets, the standard deviation of the expected values. In bold, the highest mean expected values per line.

Conclusion
In this article, we model the dependence between each version of the dam factor and D max with dam factor given by the risk factor of a reservoir that depends on its dam's height (total volume) H (V T ) and the volume of material released V F after a collapse. D max is given by the maximum distance reached by the dispensed material. In this way, it is possible to accurately describe the performance of the dependence relationship between each dam factor and D max . Furthermore, using the BIC criterion we can point out the best representation, given in our case by the Gaussian copula (see Tab. 2). We were also able to delineate the dependence between the volume available from a reservoir V T and the material's volume V F capable of being expelled after a collapse (see Tab. 1).
Through a Bayesian study, we offer scenarios that aim to give flexibility to the representation of the dependence. Under an assumption of lack of prior information, the NI scenario would be the one indicated (see Tab. 3). With the alliance between Bayesian tools and copula models we can determine tail probabilities of D max values scaled to [0, 1] and conditioned at dam factor intervals (see Tabs. 4 and 5). With them, we determine risks for reservoirs, depending on the dam factor.
We also present in this article a Bayesian and non-parametric method to infer the dam factor of a reservoir (see Sect. Expected Value for V F Given V T ). We show its efficiency for some available cases in the literature. Note that such a method could be improved by univariate modeling techniques in future research studies. We see that the Gaussian copula has a competitor, which is the Gumbel-Hougaard copula (Tabs. 1 and 2), which offers slightly different (and higher) probabilities for extreme events (see Tabs. 4 and 5). On the other hand, when using the Gumbel-Hougaard copula in calculating the dam factor (see Tabs. 9 and 10), it produces results comparable to those of the Gaussian copula, which shows that the notion of mean hides certain stochastic characteristics. For a complement of other aspects addressed in this problem, see [11]. We consider this work as the starting point of other approaches enriched with marginal models, since here we treat the marginal effect only from a non-parametric perspective (ranks), which may show restrictions, in the presence of few data.