# Bayesian estimation of the threshold of a generalised pareto distribution for heavy-tailed observations

- First Online:

- Received:
- Accepted:

DOI: 10.1007/s11749-016-0501-7

- Cite this article as:
- Villa, C. TEST (2016). doi:10.1007/s11749-016-0501-7

- 90 Views

## Abstract

In this paper, we discuss a method to define prior distributions for the threshold of a generalised Pareto distribution, in particular when its applications are directed to heavy-tailed data. We propose to assign prior probabilities to the order statistics of a given set of observations. In other words, we assume that the threshold coincides with one of the data points. We show two ways of defining a prior: by assigning equal mass to each order statistic, that is a uniform prior, and by considering the worth that every order statistic has in representing the true threshold. Both proposed priors represent a scenario of minimal information, and we study their adequacy through simulation exercises and by analysing two applications from insurance and finance.

### Keywords

Extreme valuesGeneralised Pareto distributionHeavy tailsKullback–Leibler divergenceSelf-information loss### Mathematics Subject Classification

Primary 62F15Secondary 62P05## 1 Introduction

The purpose of this paper is to outline a novel Bayesian approach to estimate the threshold of a generalised Pareto distribution (GPD) by means of data dependent priors on the order statistics. The statistical model for the overall sample is a mixture model with two main components: a model for the non-extreme data below a certain threshold, also labelled as the bulk data, and the GPD to model the extreme values above the threshold. The component for the bulk data does not represent our main concern; therefore, we will be using a finite mixture of densities where the components will somehow reflect the nature of the phenomenon of interest (Do Nascimento et al. 2011), in particular a mixture of gamma densities if we are interested in positive data (e.g. insurance losses, river floods, rainfall), and a mixture of normal densities for data that can take both positive and negative values (e.g. financial returns). The second component of the overall model is a GPD where the threshold parameter \(\theta \), conceptually separating non-extreme from extreme observations, has an assigned uncertainty represented by a prior probability distribution. The details of the overall model will be discussed in Sect. 2.

The idea behind extreme value theory is that the main interest is in the tail (or tails) of a distribution. In areas such as finance, insurance, environmental sciences and engineering, the focus is often on observations that present a clear difference in value from the bulk data. Due to this extremal nature of some observations, a distribution that models the whole data would not be appropriate as the majority of observations used to estimate the parameters are non-extreme. It is then necessary to use an appropriate procedure that, whilst still allowing for a reasonable inference of the bulk data, permits a precise estimate of the main characteristics of the tail observations. Depending on the area of application, justifications of the adoption of extreme value distributions can be found, for example, in Fabozzi et al. (2010) for finance, Donnelly and Embrechts (2010) for insurance and actuarial science; or, to cover a wider range of applications, including environmental sciences and engineering, refer to Coles (2001), De Zea and Turkman (2003) and Smith (1984).

*F*(

*x*). Under some specific conditions (Pickands 1975), the distribution of

*x*above a certain value \(\theta \) can be approximated by a GPD with distribution function

The idea of using order statistics to identify the threshold of a GPD is not new. In the context of a Bayesian predictive approach, De Zea et al. (2001) assigns a discrete prior to the number of upper order statistics that is to the number of observations that could be classified as excedances. What we propose here has a different flavour, and it assumes that the threshold corresponds to one of the observed data points. The detailed motivations for a discrete prior for the threshold of a GPD will be given in Sect. 2, when the overall statistical model is introduced. In short, if the data are modelled by a mixture of two components, one for the data below the threshold and one for the data above the threshold, then the assumption of having the threshold coinciding with one of the order statistics is sensible for the following two reasons: there is no evidence about the threshold value between any two consecutive order statistics, and the contribution of each excedance to the GPD likelihood is maximised when the threshold lies on an order statistic. We propose two different criteria to define a prior distribution on the order statistics. The first one assigns equal mass to each order statistics, and the second one assigns a mass which depends on the worth that each order statistics, as the potential threshold, has as being part of the model (Villa and Walker 2015). Although the proposed priors tend to yield posterior distributions with similar frequentist properties in most scenarios, we discuss some situations and reasons where either one or the other has to be preferred. In addition, although for different reasons, the proposed prior distributions can be categorised as objective, as defined in Berger (2006), and are suitable to be employed in scenarios of minimal prior information.

The outline of the paper is as follows. In Sect. 2, we discuss the details of the mixture model for the whole data set and the prior distributions for the parameters. For the priors, we place the main focus on the prior distributions we propose for the threshold of the GPD. We then conduct simulation studies in Sect. 3 by first illustrating how the proposed priors for \(\theta \) apply to a single independent and identically distributed sample, and then by analysing the frequentist performances of the respective induced posterior distributions. Section 4 is dedicated to applying the defined model and the proposed prior distributions for the threshold to real data examples; in particular, we analyse the well-known data set of the Danish fire loss, and the daily increments of the NASDAQ-100 index over a period of more than seventeen years. Finally, the concluding discussion and remarks are presented in Sect. 5.

## 2 The model and the priors

### 2.1 The mixture model

An example of the first type considers a gamma distribution for the bulk data (Behrens et al. 2004). Of course, other parametric distribution can be considered, such as the normal, the lognormal or the Weibull, so to reflect a different nature of the data. The main drawback of parametric bulk models is the lack of flexibility, resulting in a difficult identification of the threshold, except when the processes generating the bulk data and the extreme data are well discernible (Scarrott and MacDonald 2012; Behrens et al. 2004). To overcome this difficulty, semiparametric bulk models have been proposed. Cabras and Castellanos (2011) propose a spliced model for the bulk data, while Do Nascimento et al. (2011) discuss a finite mixture of gamma densities. Examples of a nonparametric approach for the bulk data can be found in Tancredi et al. (2006), MacDonald et al. (2011) and Fúquene Patiño (2015). All the above references concern Bayesian approaches to deal with the GPD. Recent publications discussing different approaches worthwhile to be mentioned are Northrop and Coleman (2014) and Wadsworth and Tawn (2012), among others.

The focus of this paper is on the determination of the threshold \(\theta \), and we represent the bulk data with a finite mixture of distributions (Do Nascimento et al. 2011). As discussed by the authors, the approach allows for appropriate adaptation and, therefore, flexibility of the overall mixture model. For a more detailed discussion of this specific type of models for the bulk data and, in general, about semiparametric models, we refer to Scarrott and MacDonald (2012) and Do Nascimento et al. (2011).

It is important to highlight here that the GPD suffers from identifiability issues as the scale parameter \(\sigma \) and the threshold \(\theta \) are related. In fact, if we consider \(Y=X-\theta _0\sim G(\cdot |\xi ,\sigma _0,\theta _0)\), then \(Y-\theta |Y>\theta \sim G(\cdot |\xi ,\sigma ,\theta )\), with \(\sigma =\sigma _0+\xi (\theta -\theta _0)\). Part of the issue is mitigated by the choice of the model in (2) because, through *H*, the threshold \(\theta \) represents a cutting point separating the model for the bulk data from the model for the extreme data. See, for example, Cabras and Castellanos (2011) and Do Nascimento et al. (2011). For relatively large data sets, the identifiability issue is further reduced by the amount of information about the parameters included in the data, as one would expect.

### 2.2 The prior for the threshold

The main contribution of this work is in the prior for the threshold \(\theta \) of a GPD. In fact, we propose to assign a prior probability to the observed order statistics by assuming that \(\theta =x^{(k)}\), where in general *k* can take any value in \(\{1,\ldots ,n\}\). Therefore, the nature of the proposed priors is of a discrete data dependent prior. Before outlining how a prior on the order statistics can be defined, we need to fully motivate the choice of a distribution in which support is limited to the order statistics only.

*m*is the number of parameters of the model for the bulk component (i.e. the dimension of \(\gamma \)). In this case, the overall prior will be

It is important to remark that the component \(h(\cdot |\gamma )\) of the model is not considered in the construction of the prior for the threshold. The main reason being that, with the type of problems considered in this paper, the focus is on the tail of the model, i.e. on the extreme values. In addition, the mixture model approach here discussed is thought in a way that the mixture distribution for the bulk data is included for convenience only and little consideration is given to its actual fitting to the data. As such, in order to use the prior information for \(\theta \) in a relatively sharp way, it seems more appropriate to use the information of the order statistics above the threshold only, leaving the information coming from the bulk data to contribute by means of the likelihood function.

*x*. The worth is measured by applying a result in Berk (1966) which states that, if a model is misspecified, i.e. if \(x^{(k)}\) is removed and it is the true threshold, then the posterior distribution asymptotically accumulates at the order statistics \(x^{(k^\prime )}\) such that the Kullback–Leibler divergence (Kullback and Leibler 1951) \(D_{KL}(g(\cdot |\xi ,\sigma ,x^{(k)})\Vert g(\cdot |\xi ,\sigma ,x^{(k^\prime )}))\) is minimised. That is, if the true model is removed, the estimation process will asymptotically indicate as the correct model the nearest one, in terms of the Kullback–Leibler divergence; viz., the model which is the most similar to the true one (Bernardo and Smith 1994). To link the worth of each order statistics to the prior probability, we use the self-information loss function. This particular type of loss function assigns a loss to a probability statement and, say we have defined prior \(\pi (x^{(k)}|\xi ,\sigma )\), its form is \(-\log \pi (x^{(k)}|\xi ,\sigma )\). More information about the self-information loss function can be found, for example, in Merhav and Feder (1998). To formally derive the prior for the threshold, we can proceed in terms of utilities, instead of losses; this approach allows for a clearer exposition and does not impact the logic behind the prior derivation. Let us then write utility \(u_1(x^{(k)})=\log \pi (x^{(k)})\) where, to simplify the notation, we have dropped parameters \(\xi \) and \(\sigma \). We then let the minimum divergence from \(x^{(k)}\) to be represented by utility \(u_2(x^{(k)})\). We want \(u_1(x^{(k)})\) and \(u_2(x^{(k)})\) to be matching utilities functions, as they measure the same utility in \(x^{(k)}\); though as it stands \(-\infty <u_1\le 0\) and \(0\le u_2<\infty \), and we want \(u_1=-\infty \) when \(u_2=0\). The scales are matched by taking exponential transformation, so \(\exp (u_1)\) and \(\exp (u_2)-1\) are on the same scale. Hence, we have

*c*, the nearest GPD to \(g(x|\xi ,\sigma ,x^{(k)})\) is either \(g(x|\xi ,\sigma ,x^{(k-1)})\) or \(g(x|\xi ,\sigma ,x^{(k+1)})\). However, given that \(g(x|\xi ,\sigma ,x^{(k+1)})\) is zero for \(x\in (x^{(k)},x^{(k+1)})\), resulting in an infinite divergence, the prior is

In considering the qualitative behaviour of the prior distribution based on losses, we also need to take into account the case where there may be two or more observations with the same value. Although it is possible, and perhaps advisable, to assume that the data are different from each other as this may lead to conceptual issues in the definition of the posterior distribution (Fernandez and Steel 1998), it is easy to see how the proposed prior would behave in this scenario. If we have two (or more) order statistics with the same value, say \(x^{(j)}=x^{(j+1)}\), then, by the way the prior is constructed, the mass on \(x^{(j+1)}\) would be zero, but the mass on \(x^{(j)}\) would be strictly positive, provided \(x^{(j)}\ne x^{(j-1)}\). As such, the prior based on losses maintains the idea of assigning mass on the basis of how “extreme” a value is, even when there are repeated observations.

Given that the prior distributions proposed are data dependent, it is appropriate to briefly discuss the implications of such a choice. A definition of data-dependent prior can be found in Wasserman (2000), who identifies it as a measurable mapping from the data space to the set of priors and, in other words, a distribution that depends on the data obtained through avert use of the observations. The above can be accomplished in different ways (and at different levels of depth), but probably the most common type of data-dependent priors is the data-analytic priors, where the data are used to determine the hyperparameter(s) of the prior distribution. Examples can be found in Morris (1983), Berger (1985), Carlin and Gelfand (1990) and Czado et al. (2005). Data-analytic priors can also be used to choose the base measure and the precision of a Dirichlet Process in Bayesian nonparametric (MacEachern 1998; MacAuliffe et al. 2006). Finally, Wasserman (2000) and Raftery (1996) discuss data-analytic priors for finite mixtures of normal densities.

Although data-dependent priors are used in practical situations, criticisms have been raised. Possibly, the most important concerns are that the data are used twice, for the prior and for the likelihood, and that Bayes theorem can only be approximated. An interesting discussion about the first objection can be found in Gelman et al. (2014), for example; while the second objection is discussed, for example, in Deely and Lindley (1981). We do not present here a detailed discussion on how the above objections can be rebutted or overcome; such a discussion can be found, for example, in the work of Darnieder (2011) and the reference therein. Obviously, using the order statistics to determine the parameter space of the threshold categorises our priors under Wasserman’s definition of a data-dependent prior. In the case of the uniform prior, the information drawn from the data is limited to the possible location of the threshold and, as discussed above, the choice is sensible as it yields optimal contribution of the excesses to the likelihood. For the prior based on the Kullback–Leibler divergence, the information drawn from the data goes beyond the possible location of the threshold, as it considers the similarity (or diversity) between consecutive models.

To conclude, we deem appropriate to point out that information from the data (besides in the likelihood function) has been always considered in the inferential process for the threshold. This is obvious when we consider graphical approaches (Coles 2001), where data are plotted to determine a possible location of the threshold. When it comes to Bayesian analysis, the proposed priors in the literature which claim to carry minimal information draw some of this information from the data. For example, the continuous uniform prior proposed in Cabras and Castellanos (2011) has a parameter space bound by order statistics. The normal prior proposed by Behrens et al. (2004), and claimed to be set up in a noninformative fashion by Do Nascimento et al. (2011), has to be centered on the 90 % data quantile to avoid identifiability issues when the sample size is not sufficiently large.

### 2.3 The priors for \((\xi ,\sigma )\)

### 2.4 The prior for \(\gamma \)

*r*gamma densities. With the parametrisation in (11), we assign an inverse gamma prior to each mean \(\alpha \) and a gamma prior to each shape parameter \(\beta \). Although the above priors are not selected through an objective method, they will represent minimal prior information in the form of large variance. Finally, for the weights \(\omega _1,\ldots ,\omega _r\), we chose a symmetric Dirichlet prior distribution with all the parameters equal to one: \(\pi (\omega _1,\ldots ,\omega _r)\sim \mathrm{Dir}(1,\ldots ,1)\). This choice as well represents minimal prior information, and \(\pi (\omega _1,\ldots ,\omega _r)\propto 1\).

## 3 Analysis of the posterior distribution for the threshold

To analyse and compare the proposed discrete priors for the threshold of the GPD, we perform two types of simulations. In the first simulation, we detail the inferential procedure for all the parameters of the mixture on the basis of a random sample from a known model. The second part consists in a simulation study that aims to assess the frequentist performances of the posterior distributions induced by the proposed priors. This is done by repeatedly sample from mixture models that differ in the GPD component only (i.e. threshold, shape and scale parameters) and observe the coverage and the means square errors of the posterior distributions for the threshold. Given the minimal informative nature of the paper, the analysis of the frequentist properties is a suitable way to compare the two proposed priors and assess their effectiveness.

### 3.1 Simulation from a single i.i.d. sample

To illustrate in detail the entire inferential procedure, we have sampled \(n=1000\) observations from a mixture model as in (3). The bulk data component is a mixture of two gamma densities with shape parameters \(a_1=4\) and \(a_2=8\), and rate parameters \(b_1=2\) and \(b_2=8\). The weights of the gamma densities are, respectively, \(\omega _1=2/3\) and \(\omega _2=1/3\). The extreme data component is a GPD with shape parameter \(\xi =0.4\) and scale parameter \(\sigma =2\), and the threshold has been put at the \(90~\%\) data quantile, with \(\theta =9\).

To estimate the number of components of the mixture for the bulk data (*r*), one could proceed as suggested in Do Nascimento et al. (2011), where models with different values of *r* are estimated and suitable indexes, such as the deviance information criterion (DIC) and the Bayesian information criterion (BIC), are computed to choose the “best” model on the basis of the observed sample. Alternatively, one could consider a hierarchical structure and assign a prior to *r* to represent the uncertainty on its true value; for this approach see, for example, Mengersen et al. (2011). We have already mentioned that the focus of this work is on the prior for \(x^{(k)}\); therefore, we will not further investigate this matter, and we simply show that the posterior distributions for the weights are different from zero only for \(r=2\).

Statistics of the posterior distributions under the prior based on losses (KL prior) and under the uniform prior

True value | KL prior | Uniform prior | |||||
---|---|---|---|---|---|---|---|

Mean | Median | \(95~\%\) CI | Mean | Median | \(95~\%\) CI | ||

\(a_1\) | 4 | 4.00 | 3.96 | (3.41, 4.82) | 3.82 | 3.82 | (3.32, 4.32) |

\(a_2\) | 8 | 8.80 | 8.78 | (7.14, 10.53) | 8.85 | 8.84 | (7.02, 10.37) |

\(b_1\) | 2 | 1.99 | 1.99 | (1.82, 2.15) | 2.05 | 2.06 | (1.92, 2.18) |

\(b_2\) | 8 | 7.95 | 8.04 | (6.44, 8.67) | 8.09 | 8.09 | (7.60, 8.54) |

\(\omega _1\) | 2/3 | 0.67 | 0.68 | (0.52, 0.73) | 0.70 | 0.70 | (0.66, 0.74) |

\(\omega _2\) | 1/3 | 0.33 | 0.32 | (0.27, 0.48) | 0.30 | 0.30 | (0.26, 0.34) |

\(\theta \) | 9 | 9.02 | 9.02 | (8.95, 9.08) | 8.63 | 8.67 | (8.82, 9.02) |

\(\xi \) | 0.4 | 0.46 | 0.45 | (0.21, 0.78) | 0.37 | 0.35 | (0.11, 0.67) |

\(\sigma \) | 2 | 1.98 | 1.97 | (1.71, 2.26) | 1.98 | 1.98 | (1.71, 2.26) |

The Monte Carlo procedure consists in a Metropolis within Gibbs of 20,000 iterations with 10,000 iterations as burn-in period. Convergence of the posterior has been assessed by several means, including monitoring the chains, running means and computing the Gelman and Rubin’s convergence diagnostics (Gelman and Rubin 1992). The MCMC algorithm consists of Metropolis-Hastings proposals for each parameter of the model as the full conditionals cannot be directly sampled. The three parameters of the GPD have been sampled separately in the order \(\xi , \sigma \) and \(\theta \). For the mixture, we have performed the sampling in two groups: first the parameters of the components and then the weights of the components.

### 3.2 Frequentist performances of the yielded posterior distributions

*n*. To analyse the MSE from the mean, let us first consider the case \(\theta =7\). As one would expect, the MSE is smaller for larger sample sizes. It appears that there is larger variability in its estimate for different values of \(\xi \) when \(n=1000\) then when \(n=5000\). A similar behaviour can be seen in the case the threshold is set equal to 9. In addition, when we compare the MSE for \(\theta =7\) and \(\theta =9\), we note that its value is higher in the second case. Given that the rest of the mixture model is kept unchanged, a higher threshold implies less data included in the GPD part of the likelihood, therefore, less information to estimate the parameters. When comparing the two discrete priors for the threshold, it seems that the overall frequentist properties are reasonably similar, especially for larger sample sizes. For the smaller sample size \(n=1000\), we note different performances in the lower end on the parameter space of \(\xi \) when the threshold is set to 7. With a threshold of the GPD equal to 9, it appears that the uniform priors outperform the prior based on losses for low values of \(\xi \), but it is outperformed for growing values of the shape parameter. In any case, the differences observed appear to be restrained.

## 4 Real data modeling

In this section, we show the application of the proposed discrete prior distributions for the threshold of the GPD. The first example is an application from insurance and we analyse the popular data set of losses due to fires in Denmark over a decade. In the second example, we analyse financial data (NASDAQ-100 returns) and we show that the proposed priors, and the overall model, allow for the information about the threshold that is contained in the bulk data to be taken into account.

### 4.1 An application from insurance

In the first application of the proposed discrete priors for the GPD threshold, we analyse the popular Danish fire loss data. This data set has been largely analysed in the literature, including McNeil (1997), Frigessi et al. (2002) and Cabras and Castellanos (2011), and it reports 2167 insurance losses deriving from as many industrial fires occurred in Denmark over the period 1980–1990. The losses are valued in millions of Danish krone (DKK) adjusted to the year 1985 values, for comparison purposes. Figure 6 shows the data in chronological order, say *y*, where it is possible to see that the majority of observations are grouped below the value of DKK 25 millions, with an increasing sparsity the more the loss amount becomes extreme. As such, it appears to be appropriate to model the quantity by a mixture model with a bulk data component and a GPD to represent the heavy-tailed behaviour.

Summary statistics of the posterior distributions for the mixture model for the Danish fire loss dataset

Parameters | KL prior | Uniform prior | ||||
---|---|---|---|---|---|---|

Mean | Median | \(95~\%\) CI | Mean | Median | \(95~\%\) CI | |

\(\alpha _1\) | 33.83 | 34.06 | (31.78, 36.04) | 32.83 | 32.62 | (29.04, 36.20) |

\(\alpha _2\) | 14.62 | 14.42 | (12.43, 17.94) | 15.89 | 15.83 | (13.06, 19.00) |

\(\alpha _3\) | 4.92 | 4.85 | (3.82, 6.56) | 6.81 | 6.56 | (4.70, 7.73) |

\(\beta _1\) | 1.31 | 1.31 | (1.28, 1.34) | 1.31 | 1.31 | (1.27, 1.34) |

\(\beta _2\) | 2.03 | 2.02 | (1.92, 2.15) | 2.00 | 1.97 | (1.84, 2.11) |

\(\beta _3\) | 5.00 | 5.00 | (4.62, 5.40) | 4.54 | 4.53 | (4.16, 5.46) |

\(\omega _1\) | 0.38 | 0.38 | (0.33, 0.43) | 0.38 | 0.38 | (0.31, 0.43) |

\(\omega _2\) | 0.34 | 0.34 | (0.29, 0.40) | 0.33 | 0.33 | (0.28, 0.39) |

\(\omega _3\) | 0.28 | 0.28 | (0.24, 0.32) | 0.29 | 0.29 | (0.25, 0.34) |

\(\theta \) | 5.79 | 5.79 | (4.93, 7.54) | 5.16 | 4.45 | (4.08, 7.99) |

\(\xi \) | 0.53 | 0.52 | (0.32, 0.78) | 0.64 | 0.65 | (0.37, 0.91) |

\(\sigma \) | 5.20 | 5.18 | (4.04, 6.60) | 4.02 | 3.23 | (2.32, 6.54) |

The inferential results are detailed in Table 2. We compare the estimates of the GPD parameters with the values obtained by Cabras and Castellanos (2011), which are 5.30 for the threshold, 0.58 for the shape parameter and 5.92 for the scale parameter. The results obtained using both the discrete priors on the order statistics appear to be in accordance with the values estimated by Cabras and Castellanos (2011). There is an exception in the estimate of \(\sigma \) when the uniform prior on the order statistics is employed; however, the estimated value is well within the credible interval. If we compare the estimates obtained using the two proposed priors, we note that there is strong agreement in the results for what it concerns the bulk part of the model. Besides a slightly larger credible interval for the estimate of the scale parameters under the uniform prior, it appears that there are no estimates notably different to be highlighted.

### 4.2 An application from finance

*t*. The histogram of the daily increments (Fig. 8) shows a heavy-tailed behaviour of the data, suggesting to use of a GPD to model the extreme observations. To have a feeling of the prior distribution based on losses defined on the order statistics, we proceed as we have done for the previous example. We set the parameters to the values estimated by the authors, that is Behrens et al. (2004): \(\xi =0.157\) and \(\sigma =0.974\). Figure 9 shows the prior on the whole order statistics (left graph) and the prior on the upper order statistics only (right graph).

Statistics of the posterior distributions (the \(95~\%\) credible intervals are in brackets) for the parameters of the GPD for the NASDAQ-100 data analysis

Gamma mixture | Single gamma | Behrens et al. | |||
---|---|---|---|---|---|

KL prior | Uniform prior | KL Prior | Uniform prior | Uniform prior | |

\(x^{(k)}\) | 1.13 (0.93, 1.18) | 1.08 (0.88, 1.11) | 0.93 (0.91, 0.94) | 0.93 (0.89, 0.96) | 0.96 (0.79, 1.13) |

\(\xi \) | 0.12 (0.01, 0.22) | 0.13 (0.01, 0.26) | 0.15 (0.08, 0.21) | 0.15 (0.09, 0.21) | 0.16 (0.09, 0.23) |

\(\sigma \) | 1.07 (0.91, 3.12) | 1.01 (0.84, 3.05) | 0.98 (0.90, 1.06) | 0.98 (0.90, 1.06) | 0.97 (0.86, 1.08) |

Table 3 details the estimates of the GPD component. The model for the whole set of observations is always a mixture model, and we have estimated that the number of gamma densities of the mixture for the bulk data is \(r=2\); however, we have considered the case where the bulk data are modelled by a mixture of gamma densities, and the case where only a single gamma distribution models the bulk data. In both cases, we have estimated the GPD parameters by considering both the discrete proposed prior distributions. All the results are compared with the estimates in Behrens et al. (2004) (last column to the right). From Table 3, we see that when we consider the mixture model with a single gamma distribution for the bulk data, the results we obtain by applying the uniform prior and the prior based on losses for the threshold are consistent with the estimates in Behrens et al. (2004). When we compare the estimates of the threshold of the gamma mixture for the bulk data with the ones of the single gamma mixture, we note some differences. They appear to be reasonable differences, especially if we consider the size of the credible intervals. However, the fact that the differences are not large it is most likely due to the large size of the sample. But, it is possible to appreciate these discrepancies which show that different models for the bulk data impact on the threshold value, as expected.

## 5 Discussion

There are many processes which present heavy-tailed behaviour and, in these cases, it is not always advisable to represent the whole data by means of a parametric distribution, such as the Student-*t*, or by a mixture model where the components are densities of the same family, such as a mixture of gamma densities (Venturini et al. 2008; Del Castillo et al. 2012) or normal densities.

One way to address the above problem is to consider the asymptotic result of Pickand’s theorem, for which the tail of a distribution, above a certain threshold, can be represented by a GPD. However, this method raises another problem, that is the determination of an appropriate threshold. In a Bayesian set up, the idea is to represent the uncertainty about the threshold by a prior distribution.

In this paper, we present a way of defining prior distributions for the threshold of a GPD which have as support the set of observed order statistics. We propose two different methods to determine the prior: one is intuitive in the sense that every order statistic has, a priori, the same probability of being the true threshold value. The second method takes into consideration the loss that we would incur if a given order statistic was removed from the set of possible values for the threshold, and it is true value. Through simulation and real data analysis, we have shown that the two priors have very similar performances in most cases, when compared on the basis of the frequentist properties of the respective posterior they yield and the estimates they generate. Given that the idea behind these priors is to represent a condition of minimal prior information, the fact that both priors converge to similar results is comforting. An exception occurs when the sample size is small, i.e. \(n=100\). Although the scenario lies outside the usual range of applications for a GPD, it allows to show that the prior based on losses gives better performance results than the uniform for small values of \(\xi \). Besides the above exception, it is possible to find other reasons to prefer one over the another. The uniform prior has the undoubted advantage of being easier to code and, although minimally, it allows for faster simulations. However, one has to be careful in assuming that uniformity represents no prior information (Bernardo and Smith 1994). Thus, although the results obtained by applying the two discrete priors for the threshold are similar for plausible sample sizes, we believe that the prior based on losses has to be preferred on the basis of the following considerations. First, it has a “meaning”. The mass assigned to each order statistic derives from a sound consideration of the worth that each one of them has in representing the potential threshold for the GPD. On the other hand, and this connects to the second reason, by assigning a uniform prior to \(x^{(k)}\) one assumes that each order statistics has the same chance of being the threshold. Apart that it could be interpreted as an informative assumption, it conflicts with the idea behind the GPD, for which the threshold has to be sufficiently large to avoid model bias, as discussed in Sect. 1; and this is not compatible with a uniform prior where lower order statistics have an a priori probability of being the threshold equal to the one of the upper order statistics. In conclusion, when one aims for objectivity, in applied statistics problems a noninformative prior has to be based on solid motivations, not only on performance. One exception to the above argument is the case of \(\xi <0\) (which, however, is not in the scope of this paper). In fact, given that the parameter space for the threshold depends on the values of \(\xi \) and \(\sigma \), the prior based on losses would not be defined as the Kullback–Leibler divergence between different GPD is infinite. Thus, in the case we would model a light-tail of a distribution by a GPD, the choice of the uniform prior would be the only choice between the two proposed discrete prior distributions.

As a final remark, we would like to highlight that, although the focus of the paper has been on observations that can take positive values only, the overall approach can be easily extended to quantities over the whole real line. For example, if we consider logarithmic returns of some financial index (or price), the part of the data below the threshold could be modelled by a finite mixture of normal densities. Similarly, if we were interested in analysing the negative returns (which is a common practice in risk management, for example), the prior would be defined over the negative order statistics, and the bulk data would be represented by the observations above the threshold. Another possible development of the model and the inferential procedure would be represented by the case where both tails of the distribution would be of interest and, therefore, represented by one separate, but not necessarily independent, GPD each. In all the above cases, the support of the prior has to be defined so to reflect the nature of the problem.

## Acknowledgments

The author would like to thank the Editor-in-Chief and two anonymous reviewers for comments and criticism on earlier versions of the paper. The author is also very thankful to Professor Philip Brown for his valuable feedback and comments during the drafting of the paper.

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.