A copula-based quantifying of the relationship between race inequality among neighbourhoods in São Paulo and age at death

In this paper, we combine two statistical tools with the objective of creating models that represent the dependence between (i) the proportion of the black/brown population in relation to the total population of a neighborhood (pct) and (ii) the average age at which people died in the neighborhood (age). We explore the dependence between pct and age in São Paulo city, Brazil, during 2018. The statistical tools are models of copulas and informative and non-informative settings according to the Bayesian perspective. The different scenarios and models allow us to delineate the dependence between pct and age, and, through the Bayesian Information Criterion we can indicate which of these models best represents the data. The approach implemented here allows us to define estimates of variations in life expectancy conditioned by percentage intervals of pct. With them, we can conclude that on average all the scenarios point to a decrease in life expectancy by increasing the proportion of pct. When conditioning the percentages of pct to 4 intervals (0, 0.25], (0.25, 0.5], (0.5, 0.75], (0.75, 1] respectively, we note that the expectation is reduced in average at a constant rate from one interval in comparison with the immediate and next interval from left to right in [0, 1].


Introduction
Social inequality is broadly present in Latin America despite the profound cultural differences, the economic realities and the migratory movements responsible for shaping societies such as they are nowadays. After almost a 300 years slavery period of black people brought from Africa, Brazil was one of the last countries to officially abolish it, but differently than other countries, there was no organized inclusion of the ex-slaves into the formal society. As a result, the access to the basic quality education system and other civil rights does not occur uniformly among the distinct ethnic groups, such that the issue of opportunity inequality arising from racism gains more attention in the Brazilian society each year. In this paper, we investigate and model the relationship between two indicators, records coming from neighborhoods of São Paulo city (2018) (i) pct which is the proportion of the black/brown population in relation to the total population of the neighborhood and (ii) age which is the average age at which people died in the neighborhood. Our goal is to describe the process of dependence between these variables. The data set treated here can be seen in https://www.nossasaopaulo.org.br/. We focused this study in the São Paulo city in Brazil, since we found quality records depicting the reality that we wish to describe. At the same time, São Paulo shows a great diversity which is quite representative of the entire country.
In this paper we will determine and model the dependence between (i) and (ii) through copula models [1]. Upon estimating the underlying parameters of the copula model with frequentist methods based on the pseudo-observations, we will select the best copula through the Bayesian Information Criterion (see [2]). Then, we implement a Bayesian estimation process on the parameters of the copula giving greater confidence and flexibility to our estimates. Finally, we describe the behaviour of life expectancy under the imposition of certain percentiles of pct, with the purpose of giving an indication of how this expectation is being altered based on the modification of such percentiles ranges. This paper is organized as follows: Section Theoretical Background introduces the models that will be investigated to determine the dependence between pct and age. Also in such section the data is inspected. Section Estimation introduces the model selection procedure and the estimation process for the underlying parameters. The results are also presented in this section. Section Expected Value for Age at Death shows a study on the life expectancy in the neighborhoods, conditioned on percentiles of the variable pct. The conclusions are given in Section Conclusions.
In this section, we briefly introduce the notion of copula models. We also present the specific models that we applied to the real problem which are compatible with the type of dependence that the data shows. Given a pair of continuous random variables X 1 and X 2 , if H is the bivariate cumulative distribution function of (X 1 , X 2 ) there is a function C such that for all (x, y) 2 Image(X 1 , X 2 ), If C is the 2-copula of (X 1 , 1]. Then, C is the joint distribution of the variables U ¼ F 1 (X 1 ) and V ¼ F 2 (X 2 ), see [3]. And the function C is the one we want to identify based on a paired data set related to (X 1 , X 2 ). The copula models cover all dependence types, including the linear. We consider two wellknown copulas, belonging to the family of elliptical copulas with the shape, Cðu; vjqÞ ¼ wðw À1 ðuÞ; w À1 ðvÞjqÞ; ðu; vÞ 2 ½0; for an appropriated function w and parameter q 2 [À1, 1]. The cases under the form (2) considered here are (i) the Gaussian copula given by w(t) ¼ U(t), which is the usual cumulative standard Gaussian distribution, N(0, 1) and w (s, t|q) ¼ U(s, t|q) which is the bivariate standard Gaussian distribution zero centered, N 2 (0, P) with P ¼ 1 q q 1 ; (ii) the t-Student copula given by w(t) ¼ T g (t) which is the cumulative of the univariate t-Student distribution with g degrees and w(s, t|q) ¼ T g (s, t|q) that is the bivariate cumulative t-Student distribution with g degrees of freedom and q correlation. As we see, in the elliptical copula models, the parameters are modulating the degree of dependence.
There are other formulations of very useful copulas, for example the Archimedean copulas, which follow the form,  (3) is a copula if and only if / is convex. In the next example we show a family of copulas indexed by a parameter h 2 (À1, 1)\{0}. It covers a wide range of dependence types.
With the variety of models introduced previously we wish to cover a considerable range of dependence types that allow us to determine the best representation of the dependence between X 1 and X 2 . For comparison between the models we will adopt a model selection criterion, see [2].

Race and life expectancy
The data set analysed here can be obtained from https://www.nossasaopaulo.org.br/. It corresponds to the paired (X 1 , X 2 ) information of 96 neighborhoods of São Paulo city, Brazil. It's considered for each neighborhood (i) the proportion of the black/brown population in relation to the total population of the neighborhood, pct (X 1 ) and (ii) the average age at which people died in the neighborhood, age (X 2 ). The data is associated to the year 2018, the variables expose a strong negative dependence with Spearman's correlation coefficient, q s = À0.9705. We found that there is a huge variability between neighborhoods, for example Alto de Pinheiros records X 1 = 79.09 and X 2 = 8.06, while the neighborhood Cidade Tiradentes records X 1 = 57.31 and X 2 = 56.07, this is, more than 20 years of difference, for the variable X 1 , in favor of Alto de Pinheiros. While the variable X 2 shows a difference of 7 times in the opposite sense. The scatter plot of the paired observations can be seen in Figure 1a. Figure 1a shows the dependence between observations in a general way, and Figures 1b and 1c show the relationship in specific cases. Figure 1b shows pct vs. age for the 25 neighborhoods with the highest percentage of white population. Figure 1c shows pct vs. age for the 25 neighborhoods with the highest percentage of black/brown population. In Figures 1b and 1c one can note that there is more certainty about the average age at which people die in the neighbourhoods where the white population is the majority. We see how the linearity of the dependence pointed at Figure 1a begins to be lost by considering predominance of black/brown population (Fig. 1c).
The study presented here deals with the dependence between pct and age, that is, we will describe the problem in terms of the copula that results from the selection of models.
In the next section we present the model selection process and the estimation of the underlying parameters.

Estimation
The original observations fðx 1i ; x 2i Þg n i¼1 are replaced by their re-scaled marginal ranks to [0, 1], u i :¼ where |A| denotes the cardinal of the set A. In fact, the function C is the distribution of the paired ranks of the observations, which leads us to infer that the dependence described by equation (1) is exposed when exploring the dispersion between the paired ranks of the observations (pseudo-observations). See the scatterplot in Figure 2.
The 3 commands, indepTest(), exchTest() and radSymTest(), are coming from copula R-package, 1 each of them allows verifying the compatibility of the models with the data. In order to guarantee some conditions, we test H 0 : U and V are independent by means of the indepTest(), and H 0 is rejected with p-value < 0.001. A rather desirable property of dependence is the exchangeability, a condition required by many families of copulas including the Archimedean and the elliptical ones. So, we test H 0 : U and V are exchangeable (C(u, v) = C(v, u)), using the exchTest(), see [6], and H 0 is not rejected, with p-value = 0.2562. The radial symmetry (important characteristic of Frank family) was tested by the command radSymTest() (see [7]), with H 0 : there is radial symmetry, the test returns a p-value = 0.1593, indicating the possibility of this property being valid for the data.
In order to define the appropriate copula we use the copula R-package, and the function fitCopula(), with arguments (a) copula and (b) method with (a) "FrankCopula(dim = 2)", "GaussianCopula(dim = 2)", "tCopula(dim = 2)" and (b) method = "mpl" (maximum pseudo likelihood) which is the maximum log-likelihood (MLL) method evaluated on the pseudo observations. That is, given a copula C its density c is computed and the log-likelihood is given by lnð Q n i¼1 cðu i ; v i ÞÞ which is maximized in the underlying parameters to obtain MLL ðC; fðu i ; v i Þg n i¼1 Þ, related to the model C and the set fðu i ; v i Þg n i¼1 . Note that 2 of these models have 1 parameter while the t-Student copula model has 2 parameters, so a penalty is applied to the models in order to promote a fairer selection. We consider the Bayesian Information Criterion (BIC) for this purpose, see [2].
where N is the total number of parameters of C, and n = 96 in the dataset. According to the BIC, the higher the value taken by the equation (4), the better the model. In the following subsection we show the results of the model selection procedure and the classical and Bayesian estimation of its parameters.

Results
We note that the two best models (copulas) are Frank and Gaussian, see Table 1. In this selection we have considered a classical estimation perspective, but we also show its Bayesian versions that give our results greater flexibility.
In Table 2 we show the results of the Bayesian analysis. We apply Hamiltonian Monte Carlo (HMC) simulations through the rstan R-package in two settings (i) a Non-informative (NI) setting and (ii) an Informative (I) setting, using  Table 1. Regarding the Frank model, for (i) we use an improper prior distribution on h (proportional to a constant), for (ii) we use a Gaussian distribution on h, with mode equal to À25.832 and standard deviation equal to 5. Regarding the Gaussian model, for (i) we use a non-informative prior distribution on q (proportional to a constant), for (ii) we use a Transformed Beta distribution À1 + 2B, where BB eta(1.8; 58), on q with mode equal to À0.974. For settings (ii) the mode of the prior distribution was built through the funcion iTau() of copula R package (moment method). For instance, by means of the empirical estimation of Kendall's tau coefficient we can obtain an estimation of the parameter, used in those settings as mode of the prior distribution.
As expected, considering the NI settings, the Bayesian estimators under quadratic/multi linear loss function (Tab. 2 in bold) offer very close values of classical estimates, see Table 1-column 2. This evidence strengthens our confidence in the adjustments found. The I settings show how the posterior distribution would be affected with a prior distribution build with excessive influence of the observations. Below on Figures 3b and 4b one can see the influence of these prior distributions on the posterior distributions of h and q, the grey lines representing the non-informative prior and the black lines the informative prior distribution based on the Gaussian distribution (Fig. 3) and on the Transformed Beta distribution (Fig. 4). The traces plotted in Figures 3a and  4a indicate that the chains converged, as no sign of nonstationarity, no patterns as several consecutive simulations in either  direction nor several equal simulations on both graphs is seen. This white-noise similar pattern is the expected one in case of convergence.

Expected value for age at death
Once the best copula model for the data is chosen, one can estimate quantities of interest that quantify the inequality between race and life expectancy upon analysing, for instance, first based on the pseudo-observations, EðV jU 2 ða; bÞ, the mean life expectancy given the share of black and brown people in the whole neighbourhood population belongs to a specific interval (a, b] as, where c V|U2(a,b] (Á) denotes the conditional density of the random variable V|U 2 (a, b], which by definition is, Note that both equations (5) and (6) depend on the underlying copula parameter (h for Frank and q for Gaussian). We avoid incorporating the parameter in order to simplify the notation.
The conditional expectation EðV jU 2 ða; bÞ allows us to restrict the problem to cases by percentage bands, that is, if U 2 (a, b], we are considering the pseudo-observations of pct proportions between a and b, under this assumption a natural question is, what is the life expectancy? To answer this, we must first compute and estimate equation (5), what we do in the following way, Prob ðV vjU 2 ða; bÞ ¼ Cðb; vÞ À Cða; vÞ b À a ; ð7Þ and from equations (6) and (7), we have, Due to integration by parts, it is verified that, We can finally compute EðV jU 2 ða; bÞ, EðV jU 2 ða; bÞ ¼ As mentioned above, C depends on the parameter, then strictly speaking the equation (10) is, for the Frank copula, and, for the Gaussian copula.
As an illustration we show in Figure 5, EðV jU 2 ða; bÞÞ (Eqs. (11) and (12)) coming from m = 4000 simulations of h (or q) using the posterior distributions built from Non Informative (NI) and Informative (I) settings, as described previously. Figure 5 shows the results for both models, Frank and Gaussian copulas. Based on the results it is possible to see the small effect of the prior distribution in the reduction of uncertainty regarding EðV jU 2 ða; bÞ.
For the Frank copula we estimate equation (11) by means of the Bayesian estimator by quadratic loss function of h, sayĥ B ,Ê ðV jU 2 ða; bÞ In the same way for the Gaussian copula we estimate (12) by means of the Bayesian estimator by quadratic loss function of q, sayq B ,Ê ðV jU 2 a; b The estimator given by equation (13) and (14) is evaluated upon simulated (of size m = 4000)ĥ B (andq B ) from the posterior distribution for each combination of copula model 2 and prior distribution. 3 The results are in Table 3.
Given each interval (a, b] the estimates are very close regardless of the copula (and prior distribution) used to compute the conditional expectation (see each line of Tab. 3). This shows that the conditional expectation is capable of neutralizing the effect visualized in Figure 5. As the data indicates, without giving a precise magnitude that now we have, as the percentage of pct increases, life expectancy decreases. The table also gives us in what percentages the decreasing occurs. The conditional means show a mean decrease at a proportional rate from one stratum to another, since for example, the difference betweenÊ 1 ðV jU 2 ð0; 0:25Þ andÊ 1 ðV jU 2 ð0:25; 0:5Þ is 0.24, betweenÊ 1 ðV jU 2 ð0:25; 0:5Þ and E 1 ðV jU 2 ð0:5; 0:75Þ is 0.25 and betweenÊ 1 ðV jU 2 ð0:5; 0:75Þ andÊ 1 ðV jU 2 ð0:75; 1Þ is 0.24.
The evidence indicated by Table 3 could lead us to the conclusion that the curves' performances (from Eqs. (11) to (12)) are identical, for each of the bands (a, b], except for a displacement at a rate of approximately 0.25, but this is not true. For instance, for the Frank copula (best model according to Tab. 1) and under the non informative setting we can see that the curves show in the extreme intervals (0,0.25] and (0.75,1] a greater dispersion in comparison with the curves for the central intervals, as can be seen based on Table 4, which presents the interquartile ranges of the conditional expectations. Furthermore, the 4 curves are quite different in terms of symmetry/asymmetry. For further details and information, see [8]. This study leads to the need to deepen the investigation, in the framework of each of these situations (a, b], since other factors could explain the performance of these curves, such as purchasing power, access to health care, educational level, criminality level, access to clean water and correct disposal of sewage, etc.

Conclusions
The models of copulas are useful to describe the dependence between variables (see [1]), and as has been done in this paper, they are tools for analyzing implications and impacts on social realities. In this study we have combined two powerful tools, the copula models and the Bayesian estimation. With such tools we have been able to inspect the relationship between (i) pct proportion of the black/brown population in relation to the total population of the neighborhood and (ii) age average age at which people died in the neighborhood. Assuming different perspectives, through copula models pointed out by the BICsee [2] and, informative/non-informative settings we can fully describe the relationship by exercising different theoretical assumptions (see Tabs. 1 and 2). This diversity of scenarios finds common points for the estimation of life expectancy conditioned at pct percentile intervals (see Tab. 3). And it also offers ways to compare the results. The non-informative scenario and Frank's copula [5] are then established as the starting point for future inspections, in relation to which future results could be compared, or results obtained after certain social events that may alter performance between pct and age.
We see that for the specific database discussed here (as of 2018) the state of São Paulo shows life expectancies that fall with increasing pct percentages. On average, the fall rate is constant and decreases as the percentage interval of pct (a, b] increases, a, b 2 [0, 1]. In other words, since we have set 4 referential intervals for the proportions of pct, from (0, 0.25] to (0.25, 0.5] we have a rate of fall in life expectancy of around 0.25 (scale from 0 to 1) that is repeated em the fall in life expectancy for proportions of pct from (0.25, 0.5] to (0.5, 0.75] and also for proportions of pct from (0.5, 0.75] to (0.75, 1]. Furthermote, when observing the life expectancy in each of these intervals, Figure 5, we verify that depending on the interval the curve shows a markedly different performance, which leads us to other questions such as what are the factors that determine each specific behaviour? Those questions are outside of our focus but certainly are very relevant for future studies.