Risk of fraud classification

Jesús Enrique García; Verónica Andrea González-López; Hugo Helito da Silva; Thainá Soares Silva

doi:10.1051/fopen/2020010

All issues

Volume 3 (2020)

4open, 3 (2020) 9

Full HTML

Open Access

Issue		4open Volume 3, 2020


Article Number		9
Number of page(s)		11
Section		Mathematics - Applied Mathematics
DOI		https://doi.org/10.1051/fopen/2020010
Published online		21 August 2020

4open 2020, 3, 9

Research Article

Risk of fraud classification

Jesús Enrique García¹, Verónica Andrea González-López¹, Hugo Helito da Silva² and Thainá Soares Silva¹^*

¹ Department of Statistics, University of Campinas, Sergio Buarque de Holanda, 651, CEP: 13083-859, Campinas, SP, Brazil
² CPFL, Rod. Eng. Miguel Noel Nascentes Burnier, 1755 – Chácara Primavera, CEP: 13088-900, Campinas, SP, Brazil

^* Corresponding author: thainass@outlook.com

Received: 17 March 2020
Accepted: 19 July 2020

Abstract

In this article, we define consumers’ profiles of electricity who commit fraud. We also compare these profiles with users’ profiles not classified as fraudsters in order to determine which of these clients should receive an inspection. We present a statistically consistent method to classify clients/users as fraudsters or not, according to the profiles of previously identified fraudsters. We show that it is possible to use several characteristics to inspect the classification of fraud; those aspects are represented by the coding performed in the observed series of clients/users. In this way, several encodings can be used, and the client risk can be constructed to integrate complementary aspects. We show that the classification method has success rates that exceed 77%, which allows us to infer confidence in the methodology.

Key words: Bayesian Information Criterion / Partition Markov Models / Metric in Markov Processes

© J.E. García et al., Published by EDP Sciences, 2020

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Introduction

This article is oriented to the solution of a real problem through stochastic processes techniques. Institutions/companies collect information from users/customers to determine their profiles on consumption practices, preferences, and socio-economic features, among other aspects. That is, in general terms, they seek to establish behavioral profiles. This knowledge can facilitate the placement of products or the rapid adaptation of an institution to meet the needs of its users. The coding of the information allows defining these profiles, which constitute representations of the behavior. Such representations provide information to institutions and companies to form teams that can dedicate themselves to optimizing the relationship with these groups characterized by specific profiles. Those profiles are defined using the knowledge about the performance of certain sequences (user history coding). The problem of determining groups and profiles can be approached from discrete stochastic processes tools, since, in this area, there are powerful tools to deal with the problem, see [1], [2] and [3]. The sequences resulting from the coding of user/customer data can be identified as samples coming from discrete stochastic processes. In this article, we develop a method to classify sequences, according to k _I previously determined profiles. Then the k _I profiles are compared with the performance of other unclassified sequences (or group of). For this purpose, it is necessary to have some tools, (A) a tool that is capable of (i) discriminating between processes by samples from them, (ii) determining whether the processes represented by their samples are from the same stochastic law, (B) a tool that allows drawing a stochastic profile of the behavior of a process based on a series of sequences (group) that are judged to come from the same process. As a consequence of addressing this issue, in this article, we address the problem of classifying clients as fraudsters. We employ a set of real data on electricity consumption. The proposal is to attribute to each classified client, a risk related to the similarity that its series of consumption shows with some group of fraudulent clients, identified through (A)–(B).

This article is organized as follows. Section 2 addresses the theoretical foundations and classification strategy. Section 3 describes the data, the coding, and the calculation of the risk to customers. Also, in this section, the notion of fraud customers and the groups found in the database are discussed. The conclusions and considerations are given in Section 4.

Theoretical background

We begin this section by introducing the notation used in the formalization of stochastic tools. Let (Z _t)_t be a discrete time Markov chain of order o (o < ∞) with finite alphabet A. Let us call $S$ $\mathcal{S}$ = A ^o the state space and denote the string a _m a _m+1…a _n by $a_{m}^{n}$ ${a}_m^n$ where a _i ∈ A, m ≤ i≤ n. For each a ∈ A and s ∈ $S$ $\mathcal{S}$ define the transition probability $P (a | s) = Prob (Z_{t} = a | Z_{t - o}^{t - 1} = s) .$ $P(a|s)=\enspace \mathrm{Prob}\enspace ({Z}_t=a|{Z}_{t-o}^{t-1}=s).$ In a given sample $z_{1}^{n},$ ${z}_1^n,$ coming from the stochastic process, the number of occurrences of s is denoted by N _n(s) and the number of occurrences of s followed by a is denoted by N _n(s, a). In this way, $\frac{N_{n} (s, a)}{N_{n} (s)}$ $\frac{{N}_n(s,a)}{{N}_n(s)}$ is the maximum likelihood estimator of P(a|s). Consider now, two Markov chains (Z _1,t)_t and (Z _2,t)_t, of order o, disposed on the finite alphabet A with state space $S$ $\mathcal{S}$ . Given s ∈ $S$ $\mathcal{S}$ denote by {P(a|s)}_a∈A and {Q(a|s)}_a∈A the sets of transition probabilities of (Z _1,t)_t and (Z _2,t)_t, respectively. Consider now the local metric d _s introduced by [1], note that d _s is a metric in $S$ $\mathcal{S}$ (not negative, symmetric and follows triangular inequality) and it allows defining a global notion (in $S$ $\mathcal{S}$ ) of similarity between sequences.

Definition 1

Consider two Markov chains (Z _1,t)_t and (Z _2,t)_t of order o, with finite alphabet A, state space $S$ $\mathcal{S}$ = A ^o and independent samples $z_{1,1}^{n_{1}},$ ${z}_{\mathrm{1,1}}^{{n}_1},$ $z_{2,1}^{n_{2}}$ ${z}_{\mathrm{2,1}}^{{n}_2}$ respectively. Then, set

for each s ∈ $S, d_{s} (z_{1,1}^{n_{1}}, z_{2,1}^{n_{2}}) =$ $\mathcal{S},\enspace {d}_s({z}_{\mathrm{1,1}}^{{n}_1},{z}_{\mathrm{2,1}}^{{n}_2})=$

$\frac{α}{(| A | - 1) \ln (n_{1} + n_{2})} \sum_{a \in A} {\sum_{k = 1,2} N_{n_{k}} (s, a) \ln (\frac{N_{n_{k}} (s, a)}{N_{n_{k}} (s)}) - N_{n_{1} + n_{2}} (s, a) \ln (\frac{N_{n_{1} + n_{2}} (s, a)}{N_{n_{1} + n_{2}} (s)})},$ $\frac{\alpha }{\left(\left|A\right|-1\right)\mathrm{ln}\left({n}_1+{n}_2\right)}\sum_{a\in A} \left\{\sum_{k=\mathrm{1,2}} {N}_{{n}_k}\left(s,a\right)\mathrm{ln}\left(\frac{{N}_{{n}_k}\left(s,a\right)}{{N}_{{n}_k}(s)}\right)-{N}_{{n}_1+{n}_2}\left(s,a\right)\mathrm{ln}\left(\frac{{N}_{{n}_1+{n}_2}\left(s,a\right)}{{N}_{{n}_1+{n}_2}(s)}\right)\right\},$

$dmax (z_{1,1}^{n_{1}}, z_{2,1}^{n_{2}}) = \max_{s \in S} {d_{s} (z_{1,1}^{n_{1}}, z_{2,1}^{n_{2}})},$ ${dmax}\enspace ({z}_{\mathrm{1,1}}^{{n}_1},{z}_{\mathrm{2,1}}^{{n}_2})={{max}}_{s\in \mathcal{S}}\{{d}_s({z}_{\mathrm{1,1}}^{{n}_1},{z}_{\mathrm{2,1}}^{{n}_2})\},$

with

N_{n_{1} + n_{2}} (s, a) = N_{n_{1}} (s, a) + N_{n_{2}} (s, a),

${N}_{{n}_1+{n}_2}(s,a)={N}_{{n}_1}(s,a)+{N}_{{n}_2}(s,a),$

N_{n_{1} + n_{2}} (s) = N_{n_{1}} (s) + N_{n_{2}} (s)

${N}_{{n}_1+{n}_2}(s)={N}_{{n}_1}(s)+{N}_{{n}_2}(s)$ , where

N_{n_{1}}

${N}_{{n}_1}$ and

N_{n_{2}}

${N}_{{n}_2}$ are given as usual, computed from the samples

z_{1,1}^{n_{1}}

${z}_{\mathrm{1,1}}^{{n}_1}$ and

z_{2,1}^{n_{2}}

${z}_{\mathrm{2,1}}^{{n}_2}$ respectively. Moreover, α is a real and positive value.

The Definition 1 introduces two notions of proximity between sequences, i. is local, ii. is global; both are statistically consistent, since, by increasing the min{n ₁, n ₂}, grows their capacity to detect discrepancies (when the underlying laws are different) and similarities (when the underlying laws are the same). To decide if the sequences follow the same law, is only necessary to check that d _s < 1. This threshold is derived from the Bayesian Information Criterion (BIC), see [1]. In the application, we use α = 2, with this value, we recover the usual expression of the BIC, given by [4].

The next notion (Partition Markov Model-PMM) allows postulating a parsimonious model for a Markov process, aiming at the identification of states in the state space, which have in common their transition probabilities. Through this model we build the stochastic profiles.

Definition 2

Let (Z _t)_t be a discrete time Markov chain of order o on a finite alphabet A, with state space S = A ^o,

s, r ∈ S are equivalent if P(a|s) = P(a|r) ∀ a ∈ A.
(Z _t)_t is a Markov chain with partition $L$ $\mathcal{L}$ = {L ₁, L ₂, …, $L_{| L |}$ ${L}_{\left|\mathcal{L}\right|}$ } if this partition is the one defined by the equivalence introduced by item i.

The model given by Definition 2 was introduced in reference [2] as well as the strategy for its consistent estimation that is also based on a metric defined on the state space and based on the BIC. The parameters to be estimated are (a) the partition $L$ $\mathcal{L}$ , (b) the transition probabilities of each part L to any element of A, P(⋅|L) = ∑_s∈S P(⋅|s). Given a sample of $(Z_{t})_{t}, z_{1}^{n},$ $({Z}_t{)}_t,\enspace {z}_1^n,$ according to [2] the partition is estimated by means of $d_{L}$ ${d}_{\mathcal{L}}$ given by Definition 3.

Definition 3

Let (Z _t)_t be a Markov chain of order o, with finite alphabet A and state space S = A ^o, $z_{1}^{n}$ ${z}_1^n$ a sample of the process and let $L$ $\mathcal{L}$ = {L ₁, L ₂, …, $L_{| L |}$ ${L}_{\left|\mathcal{L}\right|}$ } be a partition of S such that for all s, r ∈ L, P(⋅|s) = P(⋅|r). Then, set $d_{L}$ ${d}_{\mathcal{L}}$ (i,j) between parts L _i and L _j as

$d_{L} (i, j) = \frac{α}{(| A | - 1) \ln (n)} \sum_{a \in A} {\sum_{k = i, j} N_{n} (L_{k}, a) \ln (\frac{N_{n} (L_{k}, a)}{N_{n} (L_{k})}) - N_{n} (L_{ij}, a) \ln (\frac{N_{n} (L_{ij}, a)}{N_{n} (L_{ij})})},$ ${d}_{\mathcal{L}}(i,j)=\frac{\alpha }{(|A|-1)\mathrm{ln}(n)}\sum_{a\in A} \left\{\sum_{k=i,j} {N}_n({L}_k,a)\mathrm{ln}\left(\frac{{N}_n({L}_k,a)}{{N}_n({L}_k)}\right)-{N}_n({L}_{{ij}},a)\mathrm{ln}\left(\frac{{N}_n({L}_{{ij}},a)}{{N}_n({L}_{{ij}})}\right)\right\},$ with N _n(L) = ∑_s∈L N _n(s), N _n(L, a) = ∑_s∈L N _n(s, a), for a ∈ A, L ∈ $L$ $\mathcal{L}$ , L _ij = L _i ∪ L _j, N _n(L _ij) = N _n(L _i) + N _n(L _j) and N _n(L _ij, a) = N _n(L _i, a) + N _n(L _j, a), for a ∈ A. α a real and positive value.

The metric $d_{L}$ ${d}_{\mathcal{L}}$ is designed to build a structure in the state space, identifying equivalent states, it is applied for example in an initial set consisting of the entire state space S, and whenever $d_{L}$ ${d}_{\mathcal{L}}$ (i, j) < 1 the elements L _i and L _j must be in the same part (see properties of $d_{L}$ ${d}_{\mathcal{L}}$ in [2]). For each part L of $\hat{L}$ $\widehat{\mathcal{L}}$ (estimated partition) the transition probability is estimated by $\hat{P} (a | L) = \frac{N_{n} (L, a)}{N_{n} (L)} .$ $\widehat{P}(a|L)=\frac{{N}_n(L,a)}{{N}_n(L)}.$ Note that all equivalent states are used to estimate these probabilities, in this way, an economy is produced in the total number of probabilities to be estimated.

In the next subsection we show how integrate the tools presented here to build sequence groups (clusters) with the same stochastic law. We also explain how to define the stochastic profile of each cluster.

Clusters of sequences and partition by cluster

Given a collection of p sequences $C = {z_{i, 1}^{n_{i}}}_{i = 1}^{p},$ $\mathcal{C}=\{{z}_{i,1}^{{n}_i}{\}}_{i=1}^p,$ under the assumptions of Definition 1, the notion dmax ii – Definition 1 is used to define clusters in $C$ $\mathcal{C}$ . We introduce an algorithm that shows how this is done.

Algorithm 1

• Input $C = {z_{i, 1}^{n_{i}}}_{i = 1}^{p}$ $\mathcal{C}=\{{z}_{i,1}^{{n}_i}{\}}_{i=1}^p$

1. M = $C$ $\mathcal{C}$

2. M = {m ₁, …, m _|M|},

3. (i _*, j _*) = argmin{dmax(m _i, m _j), i ≠j, i, j ∈ {1, 2, …, |M|}}

* if $d \max (m_{i_{*}}, m_{j_{*}}) < 1$ $d\mathrm{max}({m}_{{i}_{\mathrm{*}}},{m}_{{j}_{\mathrm{*}}}) < 1$ , $M = {{M \ {m_{i_{*}}}} \ {m_{j_{*}}}} \cup m_{i_{*} j_{*}}$ $M=\{\{M\backslash \{{m}_{{i}_{\mathrm{*}}}\}\}\backslash \{{m}_{{j}_{\mathrm{*}}}\}\}\cup {m}_{{i}_{\mathrm{*}}{j}_{\mathrm{*}}}$ with $m_{i_{*} j_{*}} = {m_{i_{*}}, m_{j_{*}}}$ ${m}_{{i}_{\mathrm{*}}{j}_{\mathrm{*}}}=\{{m}_{{i}_{\mathrm{*}}},{m}_{{j}_{\mathrm{*}}}\}$ and go back to 2,

* otherwise the procedure ends.

• Output (clusters of $C$ $\mathcal{C}$ ) M = {C ₁, …, C _k}

That is, the initial M is composed by all the separate sequences and the final M corresponds to the groups of sequences or clusters. Note that given two different sequences $z_{i_{*}, 1}^{n_{i_{*}}}, z_{j_{*}, 1}^{n_{j_{*}}}$ ${z}_{{i}_{\mathrm{*}},1}^{{n}_{{i}_{\mathrm{*}}}},{z}_{{j}_{\mathrm{*}},1}^{{n}_{{j}_{\mathrm{*}}}}$ ∈ $C$ $\mathcal{C}$ the occurrence of each s ∈ $S$ $\mathcal{S}$ is recorded by $N_{n_{i_{*}}} (s)$ ${N}_{{n}_{{i}_{\mathrm{*}}}}(s)$ and $N_{n_{j_{*}}} (s)$ ${N}_{{n}_{{j}_{\mathrm{*}}}}(s)$ respectively, and the occurrence of s followed by a ∈ A is computed by $N_{n_{i_{*}}} (s, a)$ ${N}_{{n}_{{i}_{\mathrm{*}}}}(s,a)$ and $N_{n_{j_{*}}} (s, a)$ ${N}_{{n}_{{j}_{\mathrm{*}}}}(s,a)$ . Already, when defining the new unit $m_{i_{*} j_{*}} : = {z_{i_{*}, 1}^{n_{i_{*}}}, z_{j_{*}, 1}^{n_{j_{*}}}}$ ${m}_{{i}_{\mathrm{*}}{j}_{\mathrm{*}}}:=\left\{{z}_{{i}_{\mathrm{*}},1}^{{n}_{{i}_{\mathrm{*}}}},{z}_{{j}_{\mathrm{*}},1}^{{n}_{{j}_{\mathrm{*}}}}\right\}$ (after verifying $dmax (z_{i_{*}, 1}^{n_{i_{*}}}, z_{j_{*}, 1}^{n_{j_{*}}}) < 1$ ${dmax}({z}_{{i}_{\mathrm{*}},1}^{{n}_{{i}_{\mathrm{*}}}},{z}_{{j}_{\mathrm{*}},1}^{{n}_{{j}_{\mathrm{*}}}}) < 1$ ) the count of the ocurrences of s, is given by $N_{n_{i_{*}}} (s) + N_{n_{j_{*}}} (s)$ ${N}_{{n}_{{i}_{\mathrm{*}}}}(s)+{N}_{{n}_{{j}_{\mathrm{*}}}}(s)$ and if a ∈ A, the ocurrences of s followed by a is $N_{n_{i_{*}}} (s, a) + N_{n_{j_{*}}} (s, a)$ ${N}_{{n}_{{i}_{\mathrm{*}}}}(s,a)+{N}_{{n}_{{j}_{\mathrm{*}}}}(s,a)$ . That is, in the case of $m_{i_{*} j_{*}}$ ${m}_{{i}_{\mathrm{*}}{j}_{\mathrm{*}}}$ both sequences, $z_{i_{*}, 1}^{n_{i_{*}}}$ ${z}_{{i}_{\mathrm{*}},1}^{{n}_{{i}_{\mathrm{*}}}}$ and $z_{j_{*}, 1}^{n_{j_{*}}}$ ${z}_{{j}_{\mathrm{*}},1}^{{n}_{{j}_{\mathrm{*}}}}$ , contribute to the count attributed to $m_{i_{*} j_{*}}$ ${m}_{{i}_{\mathrm{*}}{j}_{\mathrm{*}}}$ .

Once the proximity between sequences is determined in order to build the clusters {C ₁, …, C _k} and for each cluster we can build a PMM, representing the cluster. In addition, it is possible to quantify the dissimilarity between clusters using the notion dmax. Suppose the cluster i is C _i and it is composed by m _i independent sequences,

$C_{i} = {z_{i_{1}, 1}^{n_{i_{1}}}, z_{i_{2}, 1}^{n_{i_{2}}}, \dots, z_{i_{m_{i}}, 1}^{n_{i_{m_{i}}}}} = {z_{i_{m}, 1}^{n_{i_{m}}}}_{m = 1}^{m_{i}},$ ${C}_i=\left\{{z}_{{i}_1,1}^{{n}_{{i}_1}},{z}_{{i}_2,1}^{{n}_{{i}_2}},\dots,{z}_{{i}_{{m}_i},1}^{{n}_{{i}_{{m}_i}}}\right\}={\left\{{z}_{{i}_m,1}^{{n}_{{i}_m}}\right\}}_{m=1}^{{m}_i},$ the sample size related to C _i is $\sum_{m = 1}^{m_{i}} n_{i_{m}}$ ${\sum }_{m=1}^{{m}_i} {n}_{{i}_m}$ . For each s ∈ $S,$ $\mathcal{S},$ compute the ocurrences of s in C _i as

$N (C_{i}, s) = \sum_{m = 1}^{m_{i}} N_{n_{i_{m}}} (s)$ $N({C}_i,s)=\sum_{m=1}^{{m}_i} {N}_{{n}_{{i}_m}}(s)$ (1)and the ocurrences of s followed by a as

$N (C_{i}, (s, a)) = \sum_{m = 1}^{m_{i}} N_{n_{i_{m}}} (s, a),$ $N({C}_i,(s,a))=\sum_{m=1}^{{m}_i} {N}_{{n}_{{i}_m}}(s,a),$ (2)for $N_{n_{i_{m}}} (\cdot)$ ${N}_{{n}_{{i}_m}}(\cdot )$ related to the sample $z_{i_{m}, 1}^{n_{i_{m}}} .$ ${z}_{{i}_m,1}^{{n}_{{i}_m}}.$

The following remark shows how to determine the stochastic profile of each cluster.

Remark 1

If we replace in Definition 3 the sample size n by $\sum_{m = 1}^{m_{i}} n_{i_{m}}$ ${\sum }_{m=1}^{{m}_i}{n}_{{i}_m}$ and applying the Algorithm 1, substituting (i) $C = {z_{i, 1}^{n_{i}}}_{i = 1}^{p}$ $\mathcal{C}=\{{z}_{i,1}^{{n}_i}{\}}_{i=1}^p$ by S = A ^o, (ii) dmax by $d_{L}$ ${d}_{\mathcal{L}}$ , the Output of the algorithm will be the partition $\hat{L}$ $\widehat{\mathcal{L}}$ _i of S, related to the cluster C _i .

The following remark shows how to measure the similarity between two clusters.

Remark 2

To establish the dissimilarity between the clusters C ₁ and C ₂ (since by construccion those are different) we use the Definition 1 – ii. In the calculation of i-def 2.1, we replace $N_{n_{k}} (s) (N_{n_{k}} (s, a))$ ${N}_{{n}_k}(s)({N}_{{n}_k}(s,a))$ by equation (1) ((2)), with i = k. We replace also $N_{n_{1} + n_{2}} (s)$ ${N}_{{n}_1+{n}_2}(s)$ by N(C ₁ , s) + N(C ₂ , s) and $N_{n_{1} + n_{2}} (s, a)$ ${N}_{{n}_1+{n}_2}(s,a)$ by N(C ₁ , (s, a)) + N(C ₂ , (s, a)). Using those ocurrences we can compute the dissimilarity between the clusters.

The next section is intended to apply the concepts detailed here as well as the strategies presented to real data.

Risk through Discretized Information

Data and structure of the analises

In Table 1, we describe the data inspected in this paper. The data correspond to serial records of energy consumption of clients of a company of power supply (CPFL) during the period: January 2011 to June 2019. We have two types of records, Irregular classified by specialists in fraud and Other which are not be classified as Irregular. That is, irregular cases have already been classified since they were identified by the fraud detection system of the company. The other cases appear to be normal but could have been disregarded by the system used in fraud detection.

Table 1

Biphasic clients reported by CPFL, period: January, 2011 to June, 2019.

The monthly energy consumption sequence of each client i, $x_{i, 1}^{q_{i}}$ ${x}_{i,1}^{{q}_i}$ is discretized in order to identify it with a sample of a Markov stochastic process (Z _i,t)_t, of finite order o in the discrete and finite alphabet A, for i = 1, …, 8381. The first inspection to be carried out seeks to identify clusters in Irregular clients, this classification could point to specific fraud practices. So we determine ${I_{1}, \dots, I_{k_{I}}}$ $\{{I}_1,\dots,{I}_{{k}_I}\}$ clusters of irregular clients, applying the Algorithm 1 in the set of irregular clients. For the group Other, we also determine the clusters, say ${O_{1}, \dots, O_{k_{O}}}$ $\{{O}_1,\dots,{O}_{{k}_O}\}$ (by applying Algorithm 1 in Other). So, we can classify customers into consumer practices. Once the Irregular clusters have been constructed, it is possible to quantify the dissimilarity between them, this is done by means of dmax as described in the previous section (see Remark 2), computing dmax(I _i, I _j), i ≠ j, i, j ∈ {1, 2, …, k _I}. In a second instance, we compare the behavior of the O _l, l = 1, …, k _O clusters with the irregular ones, computing dmax(I _i, O _l), i ∈ {1, …, k _I}, l ∈ {1, …, k _O}. We do this comparison in order to identify which could be considered as indistinguishable from some irregular cluster, this happens when dmax(I _i, O _l) < 1. Such a comparison generates a risk index in the class ${O_{1}, \dots, O_{k_{O}}}$ $\{{O}_1,\dots,{O}_{{k}_O}\}$ , as to guide the inspection of the company in that class. For each client $υ \in O_{i_{υ}}$ $\upsilon \in {O}_{{i}_{\upsilon }}$ we define:

$a_{υ} = \min_{1 \leq j \leq k_{I}} {d \max (O_{i_{υ}}, I_{j})}$ ${a}_{\upsilon }=\underset{1\le j\le {k}_I}{\mathrm{min}}\{d\mathrm{max}({O}_{{i}_{\upsilon }},{I}_j)\}$ (3)

In this way, the values obtained from equation (3) reported for the clients in the class Other (Tab. 1) are ${a_{v}}_{v = 1}^{7828} .$ $\{{a}_v{\}}_{v=1}^{7828}.$ By construction for the client υ in Other, ∃! i _υ ∈ {1, …, k _O} such that $υ \in O_{i_{υ}}$ $\upsilon \in {O}_{{i}_{\upsilon }}$ allowing the good definition of equation (3). Denote by a _v(1), a _v(2), …, a _v(7828) the ordered values in an increasing way. Thus, the client that receives the value a _v(1) is the one with the highest risk and the one that receives the value a _v(7828) is the client with the lowest risk, taking into account that the threshold equal to 1 allows us to pay attention only in the clients whose values fall in [0, 1). Figure 1 illustrates the situation.

Figure 1

Scheme of the organization of the ordered values ${a_{v} (i)}_{i = 1}^{7828}$ $\{{a}_v(i){\}}_{i=1}^{7828}$ , on the left those that indicate greater risk, on the right those of lower risk. Cut = 1 indicates the threshold given by the BIC.

Results

We compare the series of consumption through a discretization that considers four possible states, and reports the performance of the series in relation to the magnitude of the consumption in the last measurement (at time t) when compared with the two previous measurements (times t − 1 and t − 2). For each client i with $x_{i, 1}^{q_{i}}$ ${x}_{i,1}^{{q}_i}$ consumption series define the sample $z_{i, 1}^{n_{i}}$ ${z}_{i,1}^{{n}_i}$ of the discrete process (Z _i,t)_t as

$Z_{i, t} = (\begin{array}{l} 1, & x_{i, t} < x_{i, t - 1}, x_{i, t} < x_{i, t - 2} \\ 2, & x_{i, t} < x_{i, t - 1}, x_{i, t} \geq x_{i, t - 2} \\ 3, & x_{i, t} \geq x_{i, t - 1}, x_{i, t} < x_{i, t - 2} \\ 4, & x_{i, t} \geq x_{i, t - 1}, x_{i, t} \geq x_{i, t - 2} . \end{array} .$ ${Z}_{i,t}=\left(\begin{array}{ll}1,& {x}_{i,t}<{x}_{i,t-1},{x}_{i,t}<{x}_{i,t-2}\\ 2,& {x}_{i,t}<{x}_{i,t-1},{x}_{i,t}\ge {x}_{i,t-2}\\ 3,& {x}_{i,t}\ge {x}_{i,t-1},{x}_{i,t} < {x}_{i,t-2}\\ 4,& {x}_{i,t}\ge {x}_{i,t-1},{x}_{i,t}\ge {x}_{i,t-2}.\end{array}\right..$ (4)

Then, A = {1, 2, 3, 4} and |A| = 4. In the set of sequences, the smaller one has a sample size equal to 39, the indicated order o follows the rule: o < log₄(39) = 2.643, then o = 2. The application of the Algorithm 1 in the Irregular group generates k _I = 12 clusters. Table 2 exposes the maximum value of dmax reported by the algorithm, inside each cluster, denoted by $d_{i}^{*}$ ${d}_i^{\mathrm{*}}$ . These results measure the homogeneity within each irregular cluster. Lower values of $d_{i}^{*}$ ${d}_i^{\mathrm{*}}$ indicate greater homogeneity, and values of $d_{i}^{*}$ ${d}_i^{\mathrm{*}}$ close to 1 indicate greater heterogeneity.

Table 2

Description of the k _I = 12 irregular clusters I _i, i = 1, …, 12. $n_{I_{i}}$ ${n}_{{I}_i}$ : clients in I _i, $d_{i}^{*}$ ${d}_i^{\mathrm{*}}$ : maximal dmax attained by I _i, values reported from left to right in increasing order. Using the process (Z _i,t)_t – see equation (4).

After identifying 12 Irregular clusters, we can explore the dissimilarity between them computing the values of dmax between the clusters (see Remark 2). Table 3 shows the results.

Table 3

dmax between the irregular clusters ${I_{1}, \dots, I_{k_{I}}}$ $\{{I}_1,\dots,{I}_{{k}_I}\}$ (see Eq. (4) and Remark 2). In bold type, the three highest values.

In a stochastic way, the table quantifies the differences in fraud practices classified by the company. With the purpose of exploring the dynamics of each irregular cluster, we fit a PMM model for each cluster. This leads us to describe the meaning of each possible state for the (Z _i,t)_t process. In Table 4, we report the relation of the states s ∈ $S$ $\mathcal{S}$ with the consumption. Each possible state is composed of the concatenation of a and b in A, so the state is ab. By construction, the states relate the magnitudes of the energy consumption at times t − 3, t − 2, t − 1 and t, thus, for example, the state ab = 13 means {X _t ≥ X _t−1 & X _t ≥ X _t−2} (associated with b = 3) and {X _t−1 < X _t−2 & X _t−1 < X _t−3} (associated with a = 1). Note that some combinations are not allowed by construction, those are 12, 22, 33, 43 (see Tab. 5).

Table 4

States s ∈ $S$ $\mathcal{S}$ and consumption behavior, coding Z _i,t – see equation (4).

Table 5

Impossible states, coding Z _i,t – see equation (4).

Note that the states described in Table 4 indicate a decreasing/increasing trajectory ending, and this behavior is reflected in the partitions generated for each irregular cluster (see Tabs. 6 and 7), with only two possible exceptions, for states that allow consumption maintenance. Consider two large groups: those of increasing final trajectories (a) 13, 14, 23, 24, 34, 44 (including increasing/maintenance) and those of decreasing final trajectories (b) 11, 21, 31, 32, 41, 42. We see that all models (except in two situations) have separated the states into those two large classes. That is, in each part of each model we only find states of one type. For example, let’s take I ₁₁, it is composed by 5 parts, 2 parts composed by states with a decreasing trajectory ending: L ₁ = {11, 31, 42, 21, 32} and L ₅ = {41} and, 3 parts composed by states with increasing trajectory ending: L ₂ = {13, 23}, L ₃ = {14, 24}, L ₄ = {34, 44}. The exceptions are for I ₁, the part L ₄ = {31, 42, 34}, where 31 and 42 have decreasing endings and 34 allows increasing ending, and for I ₃ the part L ₃ = {21, 23} where 21 has decreasing ending and 23 allows increasing ending.

Table 6

PMM for the irregular clusters I _i, i = 1, …, 6 see equation (4) and Remark 1. In bold type the highest probability for the cluster.

Table 7

PMM for the irregular clusters I _i, i = 7, …, 12 see equation (4) and Remark 1. In bold type the highest probability for the cluster.

From the magnitudes of the estimated probabilities (Tabs. 6 and 7) we see that the clusters show two preferences (in bold), for state 1, cases I ₁, I ₂, I ₃, I ₄, I ₇, I ₉, I ₁₂ and for state 4 the remaining cases, I ₅, I ₆, I ₈, I ₁₀, I ₁₁. State 1 indicates decrease in consumption at time t in relation to the other previous instances t − 1 and t − 2 and 4 indicates increase/maintenance of consumption at time t in relation to the other previous instances t − 1 and t – 2. Moreover, for all cases I _i, i = 1, …, 12 the first two elections (the two highest probabilities) fall in states 1 or 4. Note that when the preference is the state 1, the past states (elements of the parts) end in 1 or 2, that is to say, that according to the classification given in Table 4, the process was already in a decreasing final trajectory (except I ₃). When the preference is state 4, the past states (elements in the parts) end in 3 or 4, that is, according to the classification (Tab. 4), the process was in maintenance or increasing trajectory.

The group Other is divided by the Algorithm 1 (coding Z _i,t – equation (4)) in 391 clusters, so k _O = 391. As the purpose of this paper is to identify those customers in the Other category that resemble an irregular cluster, we proceed to measure this similarity. For each I _i we calculate dmax between such irregular cluster and the clusters O _j, j = 1, …, k _O. Table 8 summarizes the obtained values.

Table 8

By line (for i = 1, …, k _I) for each cluster I _i is reported the minimum, median and maximum value of dmax computed between I _i and each group O _j, j = 1, …, k _O, see equation (4) and Remark 2. With * we indicate when similarity was detected, for some O _j.

In Table 9, we report which O _j clusters behave as irregular. The lower the value of dmax on the right, the higher the risk of group O _j as it becomes indistinguishable from an irregular.

Table 9

For I _i, i = 1, 3, 4, 5, 7, 8 are reported the clusters O _j with dmax < 1, see equation (4) and Remark 2. With * we indicate the highest risk cases.

We note that there are 63 clients that deserve a detailed inspection, since their risks are pronounced (dmax values below 0.7). And certainly, the priority is for the 36 with approximately zero dmax.

As set forth in Table 4, Irregular processes define their minimum units, parts of the partitions, according to the types of final trajectories (a) increasing/maintenance final trajectories and (b) decreasing final trajectories, which leads us to inspect the consumption series via a representation showing that trend. The following subsection is intended for this purpose.

Increasing and decreasing movements

Based on the findings we introduce a complementary coding that allows us another perspective of the study. For that, we consider two movements in the consumption series. For each client i with series $x_{i, 1}^{q_{i}},$ ${x}_{i,1}^{{q}_i},$ define the sample $y_{i, 1}^{n_{i}}$ ${y}_{i,1}^{{n}_i}$ of the discrete process (Y _i,t)_t as

$Y_{i, t} = {\begin{matrix} \begin{matrix} 0, & x_{i, t} < x_{i, t - 1} \end{matrix} \\ \begin{matrix} 1, & x_{i, t} \geq x_{i, t - 1} . \end{matrix} \end{matrix}$ ${Y}_{i,t}=\left\{\begin{array}{c}\begin{array}{cc}0,& {x}_{i,t} < {x}_{i,t-1}\end{array}\\ \begin{array}{cc}1,& {x}_{i,t}\ge {x}_{i,t-1}.\end{array}\end{array}\right.$ (5)

Then, A = {0, 1} and |A| = 2. We adopt the memory o = 2 in order to facilitate the interpretation in concordance with the previous inspection. See the meaning of the states of the process (Y _i,t)_t in Table 10.

Table 10

States s ∈ $S$ $\mathcal{S}$ and consumption behavior, codification Y _i,t – see equation (5).

In Table 11 we show the k _I = 22 clusters defined by the Algorithm 1 in the Irregular class (Tab. 1), $I_{i}^{0 - 1}$ ${I}_i^{0-1}$ , i = 1, …, 22 clusters derived from the codification Y _i,t – see equation (5).

Table 11

Description of the k _I = 22 irregular clusters $I_{i}^{0 - 1},$ ${I}_i^{0-1},$ i = 1, …, 22. $n_{I_{i}^{0 - 1}} :$ ${n}_{{I}_i^{0-1}}:$ clients in $I_{i}^{0 - 1},$ ${I}_i^{0-1},$ $d_{i}^{*} :$ ${d}_i^{\mathrm{*}}:$ maximal dmax attained by $I_{i}^{0 - 1},$ ${I}_i^{0-1},$ values reported from left to right in increasing order. Using the process (Y _i,t)_t – see equation (5).

We see that in relation to the irregular clusters via the Z _i,t encoding, the Y _i,t encoding almost doubles the irregularity modalities. While the Table 2 reports only 3 in 12 (25%) cases with d* < 0.5, Table 11 reports 12 in 22 (55%) cases with d* < 0.5, which explains the increase in the number of groups reported in Table 11. In the Appendix, Tables 16 and 17, we report the PMM for each Irregular cluster derived from the codification (Y _i,t)_t. We report 14 models with only two parts, 7 with 3 parts and 1 model with 4 parts. States 00 and 11, which reiterate a trend of consecutive decrease in energy consumption or consecutive increase/maintenance are found in separate parts in all models except in four cases: $I_{i}^{0 - 1}$ ${I}_i^{0-1}$ , i = 11, 12, 13, 20.

The group Other, under the representation Y _i,t equation (5) is divided by Algorithm 1 in 128 clusters, $O_{1}^{0 - 1}, \dots, O_{k_{O}}^{0 - 1}$ ${O}_1^{0-1},\dots,{O}_{{k}_O}^{0-1}$ with k _O = 128. That is, we obtain an increase the profiles of the Irregular clusters, from 12 to 22, and a decrease of the clusters in the class Other, from 391 to 128.

For each $I_{i}^{0 - 1}$ ${I}_i^{0-1}$ we calculate the dmax between such irregular cluster and the clusters $O_{j}^{0 - 1}$ ${O}_j^{0-1}$ , j = 1, …, k _O, Table 12 shows the results. According to coding 0−1, the only cluster $I_{i}^{0 - 1}$ ${I}_i^{0-1}$ that is not associated with any element of the class Other (see Tab. 1) is the $I_{20}^{0 - 1}$ ${I}_{20}^{0-1}$ which has 68 clients. We must not lose sight of the fact that the risk increases when obtaining values of dmax close to zero, and only those cases need to be identified. All criteria are asymptotic so, they should be considered with caution, that is to say that, cases with dmax near to the threshold 1 can wrongly point cases that are regular one.

Table 12

By line (for i = 1, …, k _I) for each cluster $I_{i}^{0 - 1}$ ${I}_i^{0-1}$ is reported the minimum, median and maximum value of dmax computed between $I_{i}^{0 - 1}$ ${I}_i^{0-1}$ and each group $O_{j}^{0 - 1}$ ${O}_j^{0-1}$ , j = 1, …, k _O, see equation (5) and Remark 2. With * we indicate when similarity was detected, for some $O_{j}^{0 - 1} .$ ${O}_j^{0-1}.$

As we can see, from the results reported in Table 13, we see that in relation to the meaning given by the Y _i,t coding (see Tab. 10), the total number of cases in the class Other that can be identified with irregular clusters, increases considerably. These results could indicate the relevance of the memory of the process, since we see that coding 1–4 reaches a greater past in comparison with coding 0−1 (compare Tabs. 4 and 10), being able to separate in a more realistic way the class Other of the class Irregular.

Table 13

For $I_{i}^{0 - 1}$ ${I}_i^{0-1}$ , i ≠ 20, i = 1, …, 22 are reported the clusters $O_{j}^{0 - 1}$ ${O}_j^{0-1}$ with dmax < 1 – see equation (5) and Remark 2.

As the discretization caused by equations (4) and (5) lead us to simplifications of the original information, we proceed to consider both for the classification of clients. In the next subsection we take both codifications into account and propose a strategy for the inspection of potentially fraudulent customers.

Risk of clients from two codifications

It is always wise to consider that the representations given by the (Z _i,t)_t and (Y _i,t)_t processes (see Eqs. (4) and (5)) only capture certain aspects of the original consumption series. As those reveal complementary information, in this subsection we consider both to guide the decision-making process in the search for undetected frauds. We introduce a function that allows a risk classification integrating both codifications, generated by equation (3). For each client $υ \in O_{i_{υ}} \cap O_{k_{υ}}^{0 - 1}$ $\upsilon \in {O}_{{i}_{\upsilon }}\cap {O}_{{k}_{\upsilon }}^{0-1}$ we compute:

$a_{υ} = \min_{1 \leq j \leq 12} {d \max (O_{i_{υ}}, I_{j})}$ ${a}_{\upsilon }=\underset{1\le j\le 12}{\mathrm{min}}\{d\mathrm{max}({O}_{{i}_{\upsilon }},{I}_j)\}$ (6)

$b_{υ} = \min_{1 \leq j \leq 22} {d \max (O_{k_{υ}}^{0 - 1}, I_{j}^{0 - 1})} .$ ${b}_{\upsilon }=\underset{1\le j\le 22}{\mathrm{min}}\{d\mathrm{max}({O}_{{k}_{\upsilon }}^{0-1},{I}_j^{0-1})\}.$ (7)

Note that by construction, for each client υ in Other, ∃! i _υ ∈ {1, …, 391} such that $υ \in O_{i_{υ}}$ $\upsilon \in {O}_{{i}_{\upsilon }}$ and ∃! k _υ ∈ {1, …, 128} such that $υ \in O_{k_{υ}}^{0 - 1}$ $\upsilon \in {O}_{{k}_{\upsilon }}^{0-1}$ , then we obtain a good definition of equations (6) and (7). a _υ and b _υ represent marginal risks of client υ, since, a _υ depends on Z _i,t – see equation (4) and b _υ depends on Y _i,t – see equation (5). Even more, we can include in the risk definition process other representations, according to the information provided by the inspeccion. In Table 14 we report the number of cases, in the class Other (see Tab. 1), by risk bands.

Table 14

Number of clients by interval, a _υ given by equation (6) and b _υ given by equation (7). In bold the number of cases with high risk.

If we consider as low risk those clients with dmax near to 1 or more, there are 79 risk cases that should be inspected, cases inside the set [0, 0.9] × [0, 0.9] (in bold letter – Tab. 14). As exemplified in Table 14, various representations of the original information can be integrated into the definition of a client’s risk, in this case, we have adopted two, which have revealed the need to first inspect 79 clients, according to both representations. If the cases indicated for inspection are many, according to the availability of the company, customer selection criteria such as the one described in [5] may be applied. Reference [5] shows that through a robust criterion, it is possible to select a representative client of the cluster that could be first inspected.

In the following subsection, we analyze the ability to detecting fraud, of the proposed strategy (see Eq. (3)), under each type of discretization (4) and (5).

Assertiveness of classification

We reserve this subsection to identify the predictive capacity of the classification given by Algorithm 1. What interests us is the quality of classification of Irregular customers, as these have gone through rigorous inspection processes, being defined as fraud. The database of our inspection is given in Table 1; for this, we proceed as follows. Consider the clusters defined by Algorithm 1 in the Irregular class: $I_{1}, I_{2}, \dots, I_{k_{I}},$ ${I}_1,{I}_2,\dots,{I}_{{k}_I},$ (i) randomly select s% of irregular customers, say $υ_{i_{1}}, \dots, υ_{i_{s k_{I} / 100}}$ ${\upsilon }_{{i}_1},\dots,{\upsilon }_{{i}_{s{k}_I/100}}$ ; (ii) apply Algorithm 1 in the Irregular class (without the clients selected in (i)) and denote the clusters as $I_{1}^{'}, I_{2}^{'}, \dots, I_{k_{I}^{'}}^{'} .$ ${I}_1^\mathrm{\prime},{I}_2^\mathrm{\prime},\dots,{I}_{{k}_I^\mathrm{\prime}}^\mathrm{\prime}.$ (iii) For each element $υ_{i_{j}} \in I_{υ_{i_{j}}}$ ${\upsilon }_{{i}_j}\in {I}_{{\upsilon }_{{i}_j}}$ find the cluster $I_{υ_{i_{j}}}^{'}$ ${I}_{{\upsilon }_{{i}_j}}^\mathrm{\prime}$ such that $I_{υ_{i_{j}}}^{'} \cap I_{υ_{i_{j}}} \geq I_{i}^{'} \cap I_{υ_{i_{j}}}, \forall i \in {1, \dots, k_{I}^{'}}$ ${I}_{{\upsilon }_{{i}_j}}^\mathrm{\prime}\cap {I}_{{\upsilon }_{{i}_j}}\ge {I}_i^\mathrm{\prime}\cap {I}_{{\upsilon }_{{i}_j}},\enspace \forall i\in \{1,\dots,{k}_I^\mathrm{\prime}\}$ , (iv.a) compute $δ (υ_{i_{j}}) = d \max (υ_{i_{j}}, I_{υ_{i_{j}}}^{'})$ $\delta ({\upsilon }_{{i}_j})=d\mathrm{max}({\upsilon }_{{i}_j},{\mathrm{I}}_{{\upsilon }_{{i}_j}}^\mathrm{\prime})$ and (iv.b) record $| j \in {i_{1}, \dots, i_{s k_{I} / 100}} : δ (υ_{j}) < 1 | .$ $|j\in \{{i}_1,\dots,{i}_{s{k}_I/100}\}:\delta ({\upsilon }_j) < 1|.$

Note that the cluster $I_{υ_{i_{j}}}^{'}$ ${I}_{{\upsilon }_{{i}_j}}^\mathrm{\prime}$ such that (iii) is verified can be considered as the most indicated cluster for the client $υ_{i_{j}}$ ${\upsilon }_{{i}_j}$ , since the client $υ_{i_{j}}$ ${\upsilon }_{{i}_j}$ is a member of $I_{υ_{i_{j}}}$ ${I}_{{\upsilon }_{{i}_j}}$ and the sets $I_{υ_{i_{j}}}^{'}$ ${I}_{{\upsilon }_{{i}_j}}^\mathrm{\prime}$ and $I_{υ_{i_{j}}}$ ${I}_{{\upsilon }_{{i}_j}}$ share the largest number of customers.

Note that under both discretizations, the average percentage of successes is greater than 77% and, the minimum percentages exceed 65%, in the three settings by discretization, see Table 15.

Table 15

Percentage of cases classified in the most indicated cluster. N: number of simulations with s = 15. Up: coding Z _i,t – equation (4), Down: coding Y _i,t – equation (5).

4 Conclusion

In practical terms, this article deals with the capacity that discretizations have to extract relevant information contained in observations in series. Such discretizations make it possible to use and adapt tools from discrete stochastic processes. Through the metric – Definition 1 (see [1]), it is possible to measure the similarity/discrepancy between samples of discrete stochastic processes. Such a metric is statistically consistent for establishing the similarity/discrepancy. Based on the metric, in this article is proposed Algorithm 1 that defines clusters of samples, where each cluster contains those sequences that respond to the same stochastic law. From the previously demonstrated properties, see [1], the clusters are then assembled consistently and represent different profiles associated with the sequences inside. To identify how these profiles operate, we use the Partition Markov Models – Definition 2 – [2], which by means of a metric – Definition 3 is consistently estimated using the sequences located in the cluster. We generate a model for each cluster, which gives the minimal representation of the state space (partition) and the transition probabilities for any element of the alphabet. Based on all these elements, we deal with a real problem where there are sequences of observations of energy consumption of two groups (i) Irregular, (ii) Other – see Table 1. We define two types of discretization ((4) and (5)) through them we proceed to identify the clusters of (i) that group similar consumption practices, we do the same with (ii). Through Partition Markov Models, we represent the stochastic profile of each cluster of (i) – see Remark 1. We identify which clusters of (ii) are confused with the clusters of (i) – see Remark 2, which allows us to point out the cases in (ii) that deserve inspection. The classification’s rates of success given by the procedure are high, as shown in the study of Section 3.5, and on average, these exceed 77%. This whole procedure allows us to establish risk indicators for (ii) and also an order that indicates the most and least serious cases. We see then that, by means of two discretizations it is possible to point cases to be reviewed, according to the magnitudes of the notion (3), our results – Table 14 – states that 79 cases should go through revision, according to (6) and (7). For additional details, see [6].

Acknowledgments

The authors Hugo Helito da Silva and T. Soares Silva gratefully acknowledge the financial support provided by ANEEL R&D Program under grant PD-00063-3037/2018, developed by CPFL Energia. Also, the authors wish to thank the Editor-led peer review process and the reviewers which generated many helpful comments and suggestions on an earlier draft of this paper.

Appendix

Table A1

PMM for the irregular clusters $I_{i}^{0 - 1},$ ${I}_i^{0-1},$ i = 1, …, 12, see equation (5) and Remark 1.

Table A2

PMM for the irregular clusters $I_{i}^{0 - 1}$ ${I}_i^{0-1}$ , i = 13, …, 22, see equation (5) and Remark 1.

References

García Jesús E, Gholizadeh R, González-López VA (2018), A BIC-based consistent metric between Markovian processes. Appl Stoch Models Bus Ind 34, 6, 868–878. [Google Scholar]
García Jesús E, González-López VA (2017), Consistent estimation of Partition Markov Models. Entropy 19, 4, 160. [CrossRef] [Google Scholar]
Cordeiro MTA, García Jesús E, González-López VA, Londoño SLM (2019), Classification of autochthonous dengue virus type 1 strains circulating in Japan in 2014. 4open 2, 20. [CrossRef] [EDP Sciences] [Google Scholar]
Schwarz G (1978), Estimating the dimension of a model. Ann Stat 6, 2, 461–464. [Google Scholar]
Fernández M, García Jesús E, Gholizadeh R, González-López VA (2019), Sample selection procedure in daily trading volume processes. Math Meth Appl Sci 43, 7537–7549. https://doi.org/10.1002/mma.5705. [CrossRef] [Google Scholar]
Soares Silva T (2020), Similaridades entre Processos de Markov (unpublished master’s thesis). [Google Scholar]

Cite this article as: García JE, González-López VA, da Silva HH & Silva TS 2020. Risk of fraud classification. 4open, 3, 9

All Tables

Table 1

Biphasic clients reported by CPFL, period: January, 2011 to June, 2019.

In the text

Table 2

Description of the k _I = 12 irregular clusters I _i, i = 1, …, 12. $n_{I_{i}}$ ${n}_{{I}_i}$ : clients in I _i, $d_{i}^{*}$ ${d}_i^{\mathrm{*}}$ : maximal dmax attained by I _i, values reported from left to right in increasing order. Using the process (Z _i,t)_t – see equation (4).

In the text

Table 3

dmax between the irregular clusters ${I_{1}, \dots, I_{k_{I}}}$ $\{{I}_1,\dots,{I}_{{k}_I}\}$ (see Eq. (4) and Remark 2). In bold type, the three highest values.

In the text

Table 4

States s ∈ $S$ $\mathcal{S}$ and consumption behavior, coding Z _i,t – see equation (4).

In the text

Table 5

Impossible states, coding Z _i,t – see equation (4).

In the text

Table 6

PMM for the irregular clusters I _i, i = 1, …, 6 see equation (4) and Remark 1. In bold type the highest probability for the cluster.

In the text

Table 7

PMM for the irregular clusters I _i, i = 7, …, 12 see equation (4) and Remark 1. In bold type the highest probability for the cluster.

In the text

Table 8

By line (for i = 1, …, k _I) for each cluster I _i is reported the minimum, median and maximum value of dmax computed between I _i and each group O _j, j = 1, …, k _O, see equation (4) and Remark 2. With * we indicate when similarity was detected, for some O _j.

In the text

Table 9

For I _i, i = 1, 3, 4, 5, 7, 8 are reported the clusters O _j with dmax < 1, see equation (4) and Remark 2. With * we indicate the highest risk cases.

In the text

Table 10

States s ∈ $S$ $\mathcal{S}$ and consumption behavior, codification Y _i,t – see equation (5).

In the text

Table 11

Description of the k _I = 22 irregular clusters $I_{i}^{0 - 1},$ ${I}_i^{0-1},$ i = 1, …, 22. $n_{I_{i}^{0 - 1}} :$ ${n}_{{I}_i^{0-1}}:$ clients in $I_{i}^{0 - 1},$ ${I}_i^{0-1},$ $d_{i}^{*} :$ ${d}_i^{\mathrm{*}}:$ maximal dmax attained by $I_{i}^{0 - 1},$ ${I}_i^{0-1},$ values reported from left to right in increasing order. Using the process (Y _i,t)_t – see equation (5).

In the text

Table 12

By line (for i = 1, …, k _I) for each cluster $I_{i}^{0 - 1}$ ${I}_i^{0-1}$ is reported the minimum, median and maximum value of dmax computed between $I_{i}^{0 - 1}$ ${I}_i^{0-1}$ and each group $O_{j}^{0 - 1}$ ${O}_j^{0-1}$ , j = 1, …, k _O, see equation (5) and Remark 2. With * we indicate when similarity was detected, for some $O_{j}^{0 - 1} .$ ${O}_j^{0-1}.$

In the text

Table 13

For $I_{i}^{0 - 1}$ ${I}_i^{0-1}$ , i ≠ 20, i = 1, …, 22 are reported the clusters $O_{j}^{0 - 1}$ ${O}_j^{0-1}$ with dmax < 1 – see equation (5) and Remark 2.

In the text

Table 14

Number of clients by interval, a _υ given by equation (6) and b _υ given by equation (7). In bold the number of cases with high risk.

In the text

Table 15

Percentage of cases classified in the most indicated cluster. N: number of simulations with s = 15. Up: coding Z _i,t – equation (4), Down: coding Y _i,t – equation (5).

In the text

Table A1

PMM for the irregular clusters $I_{i}^{0 - 1},$ ${I}_i^{0-1},$ i = 1, …, 12, see equation (5) and Remark 1.

In the text

Table A2

PMM for the irregular clusters $I_{i}^{0 - 1}$ ${I}_i^{0-1}$ , i = 13, …, 22, see equation (5) and Remark 1.

In the text

All Figures

	Figure 1 Scheme of the organization of the ordered values ${a_{v} (i)}_{i = 1}^{7828}$ $\{{a}_v(i){\}}_{i=1}^{7828}$ , on the left those that indicate greater risk, on the right those of lower risk. Cut = 1 indicates the threshold given by the BIC.
In the text

[1] García Jesús E, Gholizadeh R, González-López VA (2018), A BIC-based consistent metric between Markovian processes. Appl Stoch Models Bus Ind 34, 6, 868–878. [Google Scholar]

[2] García Jesús E, González-López VA (2017), Consistent estimation of Partition Markov Models. Entropy 19, 4, 160. [CrossRef] [Google Scholar]

[3] Cordeiro MTA, García Jesús E, González-López VA, Londoño SLM (2019), Classification of autochthonous dengue virus type 1 strains circulating in Japan in 2014. 4open 2, 20. [CrossRef] [EDP Sciences] [Google Scholar]

[4] Schwarz G (1978), Estimating the dimension of a model. Ann Stat 6, 2, 461–464. [Google Scholar]

[5] Fernández M, García Jesús E, Gholizadeh R, González-López VA (2019), Sample selection procedure in daily trading volume processes. Math Meth Appl Sci 43, 7537–7549. https://doi.org/10.1002/mma.5705. [CrossRef] [Google Scholar]

[6] Soares Silva T (2020), Similaridades entre Processos de Markov (unpublished master’s thesis). [Google Scholar]