Partition Markov Model for Covid-19 Virus

Jesús Enrique García; Verónica Andrea González-López; Gustavo Henrique Tasca

doi:10.1051/fopen/2020013

All issues

Volume 3 (2020)

4open, 3 (2020) 13

Full HTML

COVID-19 Articles

Open Access

Issue		4open Volume 3, 2020 COVID-19 Articles


Article Number		13
Number of page(s)		11
Section		Mathematics - Applied Mathematics
DOI		https://doi.org/10.1051/fopen/2020013
Published online		30 September 2020

4open 2020, 3, 13

Research Article

Partition Markov Model for Covid-19 Virus

Jesús Enrique García, Verónica Andrea González-López and Gustavo Henrique Tasca^*

Department of Statistics, University of Campinas, Sergio Buarque de Holanda, 651, 13083-859 Campinas, S.P., Brazil

^* Corresponding author: tasca_gustavo@hotmail.com

Received: 12 March 2020
Accepted: 13 August 2020

Abstract

In this paper, we investigate a specific structure within the theoretical framework of Partition Markov Models (PMM) [see García Jesús and González-López, Entropy 19, 160 (2017)]. The structure of interest lies in the formulation of the underlying partition, which defines the process, in which, in addition to a finite memory o associated with the process, a parameter G is introduced, allowing an extra dependence on the past complementing the dependence given by the usual memory o. We show, by simulations, how algorithms designed for the classic version of the PMM can have difficulties in recovering the structure investigated here. This specific structure is efficient for modeling a complete genome sequence, coming from the newly decoded Coronavirus Covid-19 in humans [see Wu et al., Nature 579, 265–269 (2020)]. The sequence profile is represented by 13 units (parts of the state space’s partition), for each of the 13 units, their respective transition probabilities are computed for any element of the genetic alphabet. Also, the structure proposed here allows us to develop a comparison study with other genomic sequences of Coronavirus, collected in the last 25 years, through which we conclude that Covid-19 is shown next to SARS-like Coronaviruses (SL-CoVs) from bats specimens in Zhoushan [see Hu et al., Emerg Microb Infect 7, 1–10 (2018)].

Key words: Bayesian information criterion / Partition Markov Models / Metric between Markov processes

© J.E. García et al., Published by EDP Sciences, 2020

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Introduction

A Partition Markov Model (PMM) is a model established in a discrete stochastic process on a finite alphabet, with finite order, see [1]. PMM generalizes other models, including Variable Length Markov chains and Markov chains with finite order. A PMM identifies a partition in the state space, bringing together in each part of the partition, the states that share the same transition probabilities for all the elements in the alphabet. All the states of a part are used to compute the transition probabilities, allowing us to use several states (all those included in the part) to estimate a unique set of transition probabilities. By construction, this is a parsimonious model, since it reduces the number of probabilities to estimate by identifying equivalent states (found in the same part). PMM models have already shown sufficient flexibility in the field of genomic structure modeling, for several purposes, such as determining similarities between Zika’s genomic sequences and modeling the genomic Zika’s profile [2]. This family of models has also been used to define the genomic profile of the Epstein–Barr virus [3]. The consistent estimation of a PMM [1], is achieved by the Bayesian Information Criterion, BIC, which has led to the definition of a BIC-based metric, see also [4]. The metric has allowed the use of the dynamics of the PMM models for other open problems; it has been efficiently applied in subjects such as (1) the comparison between Dengue’s genomic sequences of different origins [5] and (2) to compare genomic sequences of Burkitt lymphoma/leukemia [6]. Since a PMM is defined by a partition on the state space of the process, we understand that the partition’s structure could be the key to modeling certain phenomena. Then, in this paper, we investigate specific impositions on the partition of a PMM, with the purpose of improving the modeling of genomic sequences. Given a memory o, the states are sequences of o elements of the alphabet. The state space is the set of states, and the partition of the PMM is a partition of the state space. Each state is a configuration that occurs in a consecutive interval of time, so in order to define the next element of the process (the transition) is necessary to observe a past of size o. In addition to a finite memory o, this paper introduces a parameter G. G allows considering the dependence on previous events in the past of the process realization, that are not accounted for the memory o. The PMM model thus specified could be used to achieve more intricate dependence structures allowing the representation of genomic sequences and, that is the objective of this paper, to verify if, in fact, we can achieve a finer model for the genomic structure of a complete DNA sequence, of the outbreak of a novel Coronavirus (Covid-19), collected in Wuhan of Hubei province, China. In addition to tracing the genomic profile of Covid-19, we also want to compare genomic sequences that according to the literature could point to the origins of the sequence used in this article, considered one of the first records of the virus.

This article is organized as follows. Section Theoretical Background addresses the theoretical foundations of Partition Markov Models, the estimation process, and the specific case of PMM that we investigate in this article. Section Covid-19 DNA Model describes a real problem associated with the identification of the profile of a new Coronavirus named Covid-19. This section describes the data and how the specific case of PMM shows an improved performance to describe the stochastic behavior of the new virus. The conclusions and considerations are given in Section Conclusion.

Theoretical background

In this section, we present the notation as well as the fomalization of the model on which we developed our discussion, the Partition Markov Model. We also show how the aforementioned model can be consistently estimated. Let (Z_t)_t≥1 be a discrete time Markov chain of order o on a finite alphabet A, such that o < ∞. Let us call 𝒮 = A^o the state space and denote the string a_m a_m+1, …, a_n by $a_{m}^{n}$ ${a}_m^n$ where a_i ∈ A, m ≤ i ≤ n. For each a ∈ A and s ∈ 𝒮 define the transition probability $P (a | s) = Prob (Z_{t} = a | Z_{t - o}^{t - 1} = s)$ $P(a|s)=\, \mathrm{Prob}\, ({Z}_t=a|{Z}_{t-o}^{t-1}=s)$ . In a given sample $z_{1}^{n}$ ${z}_1^n$ , coming from the stochastic process, the number of occurrences of s is denoted by N_n(s) and the number of occurrences of s followed by a is denoted by N_n(s, a). In this way, $\frac{N_{n} (s, a)}{N_{n} (s)}$ $\frac{{N}_n(s,a)}{{N}_n(s)}$ is the maximum likelihood estimator of P(a|s). The Partition Markov Model introduced in the next definition is designated to obtain a parsimonious model for a Markov process with finite memory on a finite alphabet. This model proposes the identification of states in the state space in units called parts (of a partition), the parts are composed by states which have in common their transition probabilities.

Definition 2.1

Let (Z_t)_t≥1 be a discrete time Markov chain of order o on a finite alphabet A, with state space 𝒮 = A^o,

s, r ∈ 𝒮 are equivalent if P(a|s) = P(a|r) ∀a ∈ A.
(Z_t)_t≥1 is a Markov chain with partition ℒ= {L₁, L₂, …, L_|ℒ|} if this partition is the one defined by the equivalence relationship introduced by item i.

The model given by Definition 2.1 was introduced in [1]. The parameters to be estimated are (a) the partition ℒ, (b) the transition probabilities of each part L to any element of A, P(·|L) which is P(·|L) = P(·|s), ∀s ∈ L. We note that the partition of 𝒮 that responds to item (ii) of Definition 2.1 is minimal in relation to the number of parts |ℒ|. Given a sample of (Z_t)_t≥1, $z_{1}^{n}$ ${z}_1^n$ , according to [1] the partition can be consistently estimated by means of d_ℒ given by Definition 2.2.

Definition 2.2

Let (Z_t)_t≥1 be a discrete time Markov chain of order o on a finite alphabet A, with state space 𝒮 = A^o, $z_{1}^{n}$ ${z}_1^n$ a sample of the process and let ℒ = {L₁, L₂, …, L_|ℒ|} be a partition of 𝒮 such that for all s, r ∈ L_l, P(·|s) = P(·|r) for each l = 1, 2, …, |ℒ|,

$d_{L} (i, j) = \frac{α}{(| A | - 1) \ln (n)} \sum_{a \in A}^{} {\sum_{l = i, j}^{} N_{n} (L_{l}, a) \ln (\frac{N_{n} (L_{l}, a)}{N_{n} (L_{l})}) - N_{n} (L_{ij}, a) \ln (\frac{N_{n} (L_{ij}, a)}{N_{n} (L_{ij})})},$ ${d}_{\mathcal{L}}\left(i,j\right)=\frac{\alpha }{\left(\left|A\right|-1\right)\mathrm{ln}(n)}\sum_{a\in A}^{}\left\{\sum_{l=i,j}^{}{N}_n({L}_l,a)\mathrm{ln}\left(\frac{{N}_n({L}_l,a)}{{N}_n({L}_l)}\right)-{N}_n({L}_{{ij}},a)\mathrm{ln}\left(\frac{{N}_n({L}_{{ij}},a)}{{N}_n({L}_{{ij}})}\right)\right\},$

with N_n(L_i) = ∑_{s∈L_i} N_n(s), N_n(L_i, a) = ∑_{s∈L_i} N_n(s, a), N_n(L_ij) = N_n(L_i) + N_n(L_j), N_n(L_ij, a) = N_n(L_i, a) + N_n(L_j, a), for a ∈ A, L_i, L_j ∈ ℒ, α a real and positive value.

The estimation of Partition Markov Models can be carried out via algorithms such as the one introduced in [7], which uses d_ℒ. Note that d_ℒ is a metric designed to build a structure in the state space, identifying equivalent states, it is applied (see the algorithm of [7]) for example in an initial set consisting of the entire state space 𝒮, and whenever d_ℒ(i, j) < 1 the elements L_i and L_j must be in the same part (see properties of d_ℒ in [1]). The metric d_ℒ is derived from the Bayesian Information Criterion (BIC), as proved in [1], and the BIC indicates the junction of two elements in the same part of the partition, if and only if d_ℒ < 1. For each part L of $\hat{L}$ $\widehat{\mathcal{L}}$ the transition probability is estimated by $\hat{P} (a | L) = \frac{N_{n} (L, a)}{N_{n} (L)}$ $\widehat{P}(a|L)=\frac{{N}_n(L,a)}{{N}_n(L)}$ . Note that all equivalent states are used to estimate each probability. An economy is produced in the total number of probabilities to be estimated, since the identification of the partition produces a reduction of the number of probabilities to be estimated and each probability can be better estimated since the occurrences of several states are used for the estimation of each probability, now related to the part of the partition P(·|L).

In the next subsection, we discuss a specific structure of the partition ℒ and how it might make sense in practice. We also discuss the impact of such specificity on estimation algorithms, such as the one exposed in [7].

A further memory for the process

When fitting a PMM, the state space is restricted by the maximum value for the memory o allowed by the sample size and, if the alphabet is A, the space where the partition is arranged is build in 𝒮 = A^o. This means that to define the next step of the process, it is enough to know the past of size o, but it could happen that in addition to a past of size o, the process depends on a more distant jump, say G, with G > o. It is natural to think then in defining the state space as A^G, where we would again have the structure of a Partition Markov Model, but with a redundancy of information, since the values between times t – G + 1 and t – o − 1 are not relevant for the future, see Figure 1 to illustrate the idea (the zigzag part is irrelevant). Let see the formalization of the situation, suppose that there is a value G > o, such that the transition probability to Z_t = a ∈ A from $Z_{t - G}^{t - 1} = z \dots s$ ${Z}_{t-G}^{t-1}=z\dots s$ , z ∈ A, s ∈ A^o, where “…” is any concatenation of elements of A of size G – o − 1, is given by,

$Prob (Z_{t} = a | Z_{t - G}^{t - 1} = z \dots s) = Prob (Z_{t} = a | Z_{t - G} = z, Z_{t - o}^{t - 1} = s), z \in A, s \in A^{o} .$ $\mathrm{Prob}\, ({Z}_t=a|{Z}_{t-G}^{t-1}=z\dots s)=\, \mathrm{Prob}\, ({Z}_t=a|{Z}_{t-G}=z,{Z}_{t-o}^{t-1}=s),\, z\in A,\, s\in {A}^o.$ (1)

Figure 1

Scheme of the past necessary to determine the state of the process at time t, according to equation (1). In zigzag the irrelevant period with limits on top of the scheme [t – G + 1, t − o − 1].

Then, the space is given by A × A^o, where A records all possibilities for z and A^o records all possibilities for s, (z, s) of equation (1).

The kind of process on which we want to identify the partition of the state space is given to follow.

Definition 2.3

A G-Markov Model (Z_t)_t≥1 is a discrete time Markov chain on a finite alphabet A, with state space 𝒲 = A × A^o, where o < ∞, transition probabilities following equation (1), for an adequate and finite G such that G > o.

Given a sample $z_{1}^{n}$ ${z}_1^n$ , the number of ocurrences of (z, s) ∈ A × A^o in the sample is $N_{n} (z, s) = | {t : G < t \leq n, z_{t - G} = z, z_{t - o}^{t - 1} = s} |$ ${N}_n(z,s)=|\{t:G < t\le n,{z}_{t-G}=z,{z}_{t-o}^{t-1}=s\}|$ and the ocurrences of (z, s) ∈ A × A^o followed by a ∈ A is $N_{n} ((z, s), a) = | {t : G < t \leq n, z_{t - G} = z, z_{t - o}^{t - 1} = s, z_{t} = a} |$ ${N}_n((z,s),a)=|\{t:G < t\le n,{z}_{t-G}=z,{z}_{t-o}^{t-1}=s,{z}_t=a\}|$ .

Remark 2.1

Itens (i) and (ii) of Definition 2.1 allow defining the partition of (Z_t)_t≥1, say ℐ (partition of 𝒲 of Definition 2.3). We can also adapt the metric of Definition 2.2 to this situation, to do that it is enough to change 𝒮 by 𝒲, defining for each part ℐ of the partition ℐ of 𝒲, N_n(I) = ∑_(z,s)∈I N_n(z, s) and N_n(I, a) = ∑_(z,s)∈I N_n((z, s), a), for a ∈ A. Denote by d_ℐ the metric on the state space 𝒲. So, to estimate ℐ we can use the algorithm introduced in [7].

Given a sample $z_{1}^{n}$ ${z}_1^n$ of the process (Z_t)_t≥1, denote by P(a|(z, s)) the transition probability given by equation (1), then, the likelihood of the sample $P (z_{1}^{n}) = Prob (Z_{1}^{n} = z_{1}^{n})$ $P({z}_1^n)=\, \mathrm{Prob}\, ({Z}_1^n={z}_1^n)$ is,

$P (z_{1}^{n}) = P (z_{1}^{G}) \prod_{a \in A, I \in I} P (a | I)^{N_{n} (I, a)},$ $P({z}_1^n)=P({z}_1^G)\prod_{a\in A,I\in I} P(a|I{)}^{{N}_n(I,a)},$ (2)

which is the same expression of equation (2) of [1]. Then, the procedure is to compute the BIC for each model ℐ, which is given by equation (3),

$BIC (z_{1}^{n}, I) = \ln (\prod_{a \in A, I \in I} {(\frac{N_{n} (I, a)}{N_{n} (I)})}^{N_{n} (I, a)}) - \frac{(| A | - 1) | I | \ln (n)}{α} .$ $\mathrm{BIC}\, ({z}_1^n,\mathcal{I})=\mathrm{ln}\left(\prod_{a\in A,I\in \mathcal{I}} {\left(\frac{{N}_n(I,a)}{{N}_n(I)}\right)}^{{N}_n(I,a)}\right)-\frac{(|A|-1)|\mathcal{I}|\mathrm{ln}(n)}{\alpha }.$ (3)

As the BIC definition itself shows, the models are characterized (Eq. (3)) by the logarithm of the maximum likelihood, $\ln (\prod_{a \in A, I \in I} {(\frac{N_{n} (I, a)}{N_{n} (I)})}^{N_{n} (I, a)})$ $\mathrm{ln}\left(\prod_{a\in A,I\in \mathcal{I}} {\left(\frac{{N}_n(I,a)}{{N}_n(I)}\right)}^{{N}_n(I,a)}\right)$ , penalized by the number of parameters to be estimated, (|A| − 1)|ℐ|, properly scaled by the size of the data set (by the term ln(n)/α). Thus, the model indicated by the BIC is the one with the highest BIC value that corresponds to the most plausible, taking into account the complexity of the model (number of parameters). The criterion is derived in [8], using α = 2, and it corresponds to the maximization of a posterior distribution assuming a non-informative prior distribution on the dimension of the parametric space. We note that the BIC continues to be valid, replacing the constant 2 with any positive constant α, as given in equation (3).

By means of the maximization of equation (3) the partition can be estimated, obtaining $\hat{I}$ $\widehat{\mathcal{I}}$ as equation (4),

$\hat{I} = {argmax}_{I \in P} {BIC (z_{1}^{n}, I)},$ $\widehat{\mathcal{I}}=\, \mathrm{argmax}_{\mathcal{I}\in \mathcal{P}}\{\mathrm{BIC}\, ({z}_1^n,\mathcal{I})\},$ (4)

where 𝒫 is the set of all the partitions of 𝒲.

As the set 𝒫 can be huge, to obtain $\hat{I}$ $\widehat{\mathcal{I}}$ it is necessary to use the metric d_ℐ with the algorithm introduced by [7].

Remark 2.2

Note that under the Definition 2.3, for the construction of equation (3) and subsequent derivation of the maximum given in equation (4), the parameters G and o are previously set.

The BIC criterion shows great advantages, and therefore, its use is recommended. We quote some of its qualities, (i) it is a consistent method for the estimation of models given by Definition 2.3 (Theorem 3 – [1]). Already particular cases of Definition 2.3 anticipated the consistency of the BIC for the estimation, see, for example [9], in the framework of variable-length Markov chains. (ii) the BIC allows creating a metric like the one detailed in Remark 2.1 (Theorem 2, corollary 2 – [1]), which facilitates the implementation of algorithms [7]. The second property imposes the preference of the BIC against other criteria such as the Krichevsky–Trofimov (KT) [9].

The structure that we want to investigate in this paper is about the form of the partition, which responds to the rule illustrated in Figure 1. In the next example, we present a case.

Example 2.1

Consider a G-Markov model (Z_t)_t≥1 (Definition 2.3) with A = {a, b, c}, o = 1, G = 4, partition ℐ = {I₁, I₂, I₃} and transition probabilities given by Table 1. For instance, since P(a|I₁) = 0.2 we have that $P (Z_{t} = a | Z_{t - 4}^{t - 1} = a ⋆ * a) = 0.2$ $P({Z}_t=a|{Z}_{t-4}^{t-1}=a\star \mathrm{*}a)=0.2$ , $\forall ⋆ \in A$ $\forall \star \in A$ and $\forall * \in A$ $\forall \mathrm{*}\in A$ . Then, the values $⋆$ $\star$ and * at positions t − 3 and t − 2 respectively (between the times t − 4 and t − 1) are irrelevant for the state at time t.

Table 1

Partition ℐ and transition probability to each element in the alphabet A.

In the following two simulations we see the impact of the structure of a G Markov model – Definition 2.3 – when we ignore it and apply the algorithm of [7] with the help of d_ℒ, assuming only the usual PMM structure – Definition 2.1.

Example 2.2

We apply the algorithm of [7] in a simulated data from the law given by Table 1, the algorithm is applied in two settings (i) using o = 4, initial set {a, b, c}⁴ and d_ℒ as given by Definition 2.2 and, (ii) using o = 1, G = 4, initial set {a, b, c} × {a, b, c} and d_ℐ with the modifications imposed by Remark 2.1. This means that in (i) we fit a Partition Markov Model without a parameter G, as usual and in (ii) a Partitition Markov Model with a G (Definition 2.3). With a sample size n = 5 × 10⁴, we obtain by (ii) the original partition (Tab. 1). By (i) the partition obtained is given in Table 2. Note that under the setting (i) each element of the state space is a concatenation of o = 4 consecutive elements of A, for example accb, and this element under the setting (ii) is denoted by (a, b) where the memory o = 1 is related to the element b and G = 4 is related to the element a, being irrelevant the central elements cc. Also observe that there is a relationship between some parts of Tables 1 and 2, L₁ = I₂ and L₅ = I₃, but the part I₁ is distributed in the parts: L₂, L₃ and L₄. That is to say that when adjusting via (i) a confusion of the original structure (Tab. 1) is generated.

Table 2

Estimated partition for the state space {a, b, c}⁴ – procedure (i). The elements of L₁ have the format b … b or c … b and the elements of L₅ have the format c … a.

To visualize the behavior of the settings (i) and (ii) when increasing the sample size, for each sample size n = 5 × 10⁴, 10⁶, 5 × 10⁶, 10⁷ we perform 100 simulations of the model – Table 1. We record the performance of the settings (i), (ii) in recovering the appropriate number of parts of the partition (which is 3). The results are shown in Table 3. We see how (ii) has a more efficient behavior, showing that a usual procedure, such as (i) can show difficulties in recovering the structure given by Definition 2.3. Note also that all the parts recovered by (ii) in the last 3 sample sizes are the original ones – see Table 1.

Table 3

Number of parts of the partitions estimated for settings (i) and (ii) in 100 simulations of size n each.

In practical terms, when modeling with real data, a specific sample size is available, say n. In general, the memory of the process that can be used in the model depends on n as well as the cardinal |A| of the process alphabet. Usually, the memory must be less than log_|A|(n), in the next example, we see the effect of this condition on G.

Example 2.3

Consider a G-Markov model (Z_t)_t≥1 (Definition 2.3) with A = {a, b, c}, o = 0 and G = 10, with partition ℐ = {I₁, I₂} and transition probabilities given by Table 4. We perform simulations of the process following Table 4, with n = 2000. The algorithm of [7] is applied in two settings (i) using $o = 4 < ⌊ \log_{| A |} (n) ⌋ - 1 = 5$ $o=4 < \lfloor {{log}}_{|A|}(n)\rfloor -1=5$ , initial set {a, b, c}⁴ and d_ℒ as given by Definition 2.2 and, (ii) using o = 0, G = 10, initial set {a, b, c} and d_ℐ with the modifications imposed by Remark 2.1. In other words, case (i) reflects the conditions that are generally applied to proceed with the adjustment and determination of the process memory.

Table 4

Partition ℐ and transition probability of each part I_i, i = 1, 2 to each element in the alphabet A.

In Table 5 we show the resulting partition of (i). We see that it is not possible to recover the original structure given in Table 4. As expected, (ii) recovers the structure given by Table 4. These results reflect the insufficiency for n = 2000 to reach a memory that encompasses G = 10, which is the period that the process needs to determine the choice of the next step.

Table 5

Estimated partition for the state space {a, b, c}⁴ – procedure (i).

Section Covid-19 DNA Model shows how the model stands out when it comes to representing the genome of Covid-19.

Covid-19 DNA model

In this section, we investigate the stochastic behavior of a complete DNA sequence of the outbreak of a novel Coronavirus (Covid-19) associated with a respiratory disease in Wuhan of Hubei province, China. The sequence was extracted from a patient coming from the Wuhan seafood market, a place that was associated with the origin of the outbreak, its accession number is MN908947 (version MN908947.3). The sequence is coming from a 41-year-old man with no history of hepatitis, tuberculosis or diabetes [10]. The patient was admitted and hospitalized in Wuhan Central Hospital on December 26, 2019, 6 days after the onset of illness. He reported fever, chest tightness, cough, pain, and weakness for one week. The cardiovascular, abdominal, and neurologic examination was normal; see more details in [10]. The sequence can be obtained from https://www.ncbi.nlm.nih.gov/nuccore/MN908947. We use in this paper the FASTA format of MN908947, which is composed by 29 903 bases: a, c, g, t.

For the construction of the model, we must choose a memory o. In a discrete Markov process with a discrete alphabet, the criterion used is $o < ⌊ \log_{| A |} (n) ⌋ - 1$ $o < \lfloor {\mathrm{log}}_{|A|}(n)\rfloor -1$ . So, for the alphabet A = {a, c, g, t} with n = 29 903, we have the restriction o < 6. Since the bases in a DNA structure are organized in triples, it is recommended o = 3. This organization also applies to memory G, it is expected that the values of G multiples of 3 show better performance.

We apply the algorithm of [7] in two settings (i) using o ∈ {1, 2, 3, 4}, initial set {a, c, g, t}^o and d_ℒ as given by Definition 2.2, (ii) using o = 3, G ∈ {5, 6, 7, 8, 9, 10, 11, 12}, initial set {a, c, g, t} × {a, c, g, t}³ and d_ℐ with the modifications imposed by Remark 2.1. To identify the best model for the sequence, we use the BIC criterion (with α = 2), see equation (3). And, therefore, the higher the BIC value, the better the model represents the sequence. To compare the results, we also report the Krichevsky–Trofimov (KT) criterion [9]. According to the KT definition, the smaller the value, the better the model will be for representing the data. According to Table 6, the best models are given by two models in (ii). This shows us the convenience of assuming the existence of an extra parameter G. Note that in the three best cases of (ii) G is a multiple of 3, G = 9, 12, 6, which confirms the nature of the genomic organization in triples formed by elements of the alphabet A.

Table 6

On top, settings under the perspective (i): memory o, cardinal of partition |ℒ|, BIC and KT values. On bottom, settings under the perspective (ii): parameter G, cardinal of partition |ℐ|, BIC and KT values. In bold letter the 2 best cases.

Table 7

Part composition for the model – (ii) with o = 3 and G = 9, see Table 6.

As is usual in DNA sequences [2], the transition probabilities are moderate (in this case ≤0.44, see Tab. 8). Note that there is a predilection of the process to choose as the next state a or t. Sequences of other viruses lead to other predilections, see for example [2], in which the Zika process is modeled, revealing a predilection for the states a or g. We observe that under Definition 2.3 and without imposing the partition structure, the total number of parameters to be estimated is unfeasible. For example, with o = 3 and G = 9 we have |A × A^o| × (|A| − 1) = 768 parameters to estimate, and when using the strategy given by equation (2) and Remark 2.1, it is necessary to estimate |ℐ| × (|A| − 1) = 39 parameters. Table 7 shows the composition of each part, for example, part I₁ is composed of 31 elements of type (z, s) where z ∈ A and s ∈ A^o. The elements are listed from left to right according to how they have been inserted in the part by the algorithm of [7] and following Remark 2.1.

Table 8

$\hat{P}$ $\widehat{P}$ (·|I_i), i = 1, …, 13, for the model – (ii) with o = 3 and G = 9, see Table 6. In bold the highest probabilities by part.

Covid-19 and coronaviruses

In this subsection, we incorporate 11 sequences into the study to compare how distant or close to them the sequence investigated is. It is speculated that the new sequence is the product of mutations of other types of Coronaviruses, and a way to deal with it could be to determine those sequences that are the closest. Although Coronaviruses similar to severe acute respiratory syndrome (SARS [11]) have been widely identified in mammals, including bats, since 2005 in China, the exact origin of Coronaviruses infecting humans remains unclear. Therefore, it is necessary to determine the natural reservoir and any intermediate hosts of Coronavirus in its current version (Covid-19). We describe the sequences in Table 9, those are complete sequences of the Coronavirus genome of different types that occurred in the last 25 years.

Table 9

Complete genome sequences coming from https://www.ncbi.nlm.nih.gov/nuccore/Y. For each sequence Y are informed, its Version, sample size n, Organism and Reference.

The metric introduced to follow makes this comparison possible.

Definition 3.1

Consider two G-Markov chains (Z_1,t) and (Z_2,t) following Definition 2.3 with alphabet A, parameters o and G, state space 𝒲 = A × A^o and independent samples $z_{1,1}^{n_{1}}$ ${z}_{\mathrm{1,1}}^{{n}_1}$ , $z_{2,1}^{n_{2}}$ ${z}_{\mathrm{2,1}}^{{n}_2}$ respectively,

For $(z, s) \in W$ $(z,s)\in \mathcal{W}$ ,

$d_{(z, s)} (z_{1,1}^{n_{1}}, z_{2,1}^{n_{2}}) = \frac{α}{(| A | - 1) \ln (n_{1} + n_{2})} \sum_{a \in A} {\sum_{l = 1,2} N_{n_{l}} ((z, s), a) \ln (\frac{N_{n_{l}} ((z, s), a)}{N_{n_{l}} (z, s)}) - N_{n_{1} + n_{2}} ((z, s), a) \ln (\frac{N_{n_{1} + n_{2}} ((z, s), a)}{N_{n_{1} + n_{2}} (z, s)})},$ ${d}_{(z,s)}({z}_{\mathrm{1,1}}^{{n}_1},{z}_{\mathrm{2,1}}^{{n}_2})=\frac{\alpha }{(|A|-1)\mathrm{ln}({n}_1+{n}_2)}\sum_{a\in A} \left\{\sum_{l=\mathrm{1,2}} {N}_{{n}_l}((z,s),a)\mathrm{ln}\left(\frac{{N}_{{n}_l}((z,s),a)}{{N}_{{n}_l}(z,s)}\right)\left.-{N}_{{n}_1+{n}_2}((z,s),a)\mathrm{ln}\left(\frac{{N}_{{n}_1+{n}_2}((z,s),a)}{{N}_{{n}_1+{n}_2}(z,s)}\right)\right\},\right.$
$d_{\max} (z_{1,1}^{n_{1}}, z_{2,1}^{n_{2}}) = \max_{(z, s) \in W} {d_{(z, s)} (z_{1,1}^{n_{1}}, z_{2,1}^{n_{2}})},$ ${d}_{\mathrm{max}}\, ({z}_{\mathrm{1,1}}^{{n}_1},{z}_{\mathrm{2,1}}^{{n}_2})=\underset{(z,s)\in \mathcal{W}}{\mathrm{max}}\{{d}_{(z,s)}({z}_{\mathrm{1,1}}^{{n}_1},{z}_{\mathrm{2,1}}^{{n}_2})\},$

with $N_{n_{1} + n_{2}} ((z, s), a) = N_{n_{1}} ((z, s), a) + N_{n_{2}} ((z, s), a)$ ${N}_{{n}_1+{n}_2}((z,s),a)={N}_{{n}_1}((z,s),a)+{N}_{{n}_2}((z,s),a)$ , $N_{n_{1} + n_{2}} (z, s) = N_{n_{1}} (z, s) + N_{n_{2}} (z, s)$ ${N}_{{n}_1+{n}_2}(z,s)={N}_{{n}_1}(z,s)+{N}_{{n}_2}(z,s)$ , where $N_{n_{1}}$ ${N}_{{n}_1}$ and $N_{n_{2}}$ ${N}_{{n}_2}$ are given as usual, computed from the samples $z_{1,1}^{n_{1}}$ ${z}_{\mathrm{1,1}}^{{n}_1}$ and $z_{2,1}^{n_{2}}$ ${z}_{\mathrm{2,1}}^{{n}_2}$ respectively. With α a real and positive value.

Definition 3.1 is an adaptation of the notion introduced in [4]. It has the same properties; that is to say that i. is a metric and both i. and ii. are statistically consistent to detect if the samples come or not from the same stochastic law. Moreover, Definition 3.1-(i) is a local notion while, d_max – Definition 3.1-(ii) is global, so in this comparison we will use d_max to have a general representation of the similarity/dissimilarity between the samples, see the results in Table 10.

Table 10

d_max values (see Definition 3.1) between each pair of sequences, o = 3, G = 9, α = 2. In bold type, the values between MN908947.3 (Covid-19) and the other sequences, with * the smaller ones.

In Figure 2, we show the dendrograms built from d_max values between the pair of sequences, see Table 10. The dendrograms confirm that based on the available sequences, the sequences MN908947.3 (Covid-19), MG772934.1 and MG772933.1 could be considered as a cluster. The sequences MG772934.1 and MG772933.1 are records from July 2015 and February 2017, respectively, and the sequences come from Zhoushan China. We also see that d_max(MG772933.1, MG772934.1) = 0.0902 is close to zero, and that proximity is confirmed in [12]. These discoveries allow us to speculate on certain aspects, one of them is that bats are consolidated as efficient transmitters and are a risk to the human immune system, at least about Coronavirus and its last versions [13]. We note that in [10], the similarity between MN908947.3 and MG772933.1 is mentioned, and this is confirmed here, by means of the G-Model conception.

Figure 2

Dendrograms build from the d_max values, reported in Table 10. MN908947.3 Covid-19 sequence.

Conclusion

Partition Markov Models allow a vast economy in the construction and representation of phenomena since, through Definition 2.1, they establish units (parts) in the state space that share the same transition probabilities. Thus several states contribute to the determination of a single transition probability. The parts (elements of the process’ partition) consider a finite memory o, that is to say, that the next step of the process will be determined knowing a past constituted by the concatenation of o elements coming from the alphabet. Thus, the step to time t is determined with the knowledge of the occurrences at times t − o, …, t − 1. In this paper, we investigate a specific structure within the theoretical framework of Partition Markov Models. The structure of interest lies in the formulation of the partition that defines the process, in which, in addition to a finite memory o associated with the process, a parameter G is introduced, which allows dependence on the past to complement that given by the memory o, see Definition 2.3. We show how algorithms designed for the classic version of Partition Markov Models can have difficulties in recovering the structure investigated here, see Examples 2.2 and 2.3. Under previous determination of the parameters o and G it is possible to adapt all the estimation tools of the usual Partition Markov Models (see [1] and [4]), see Remarks 2.1 and 2.2. This specific structure in the process’ partition (see Definition 2.3, Eq. (2)) is shown efficient for modeling a complete sequence of newly decoded DNA [10], Genbank MN908947, from the newly discovered Coronavirus Covid-19, from a patient of Wuhan – China. A partition in a G-model allows a huge reduction of the number of parameters to be estimated, from 768 to 39 (Tabs. 7 and 8), leading to an increase in the estimation quality of the parameters. Already, in more general terms, we see that the inclusion of the parameter G generates flexibility that is very well evaluated by model selection criteria (Tab. 6), giving credibility to partition models with more specific structures. Table 7 shows that the stochastic performance of sequence MN908947 can be reduced to 13 stochastic units that are discriminated by how the next state is selected (transition probabilities). Such a configuration could be used to design a Covid-19 profile. The model given by Definition 2.3 also allows us to develop a comparison study with 11 other genomic sequences of Coronavirus, collected in the last 25 years. We conclude that Covid-19 is shown next to Bat SARS-like Coronavirus sequences, Genbanks MG772934 and MG772933, coming from Zhoushan – China (period: 2015–2017), see Table 10 and Figure 2, see also [13]. Our results are in accordance with the indications given in [10]. This evidence could point to one of the best vectors of the virus, and help in the search for vaccines for its treatment.

Acknowledgments

G. Tasca gratefully acknowledges the partial financial support, provided by CAPES with a fellowship from the Ph.D. Program in Statistics – University of Campinas. Also, the authors wish to thank the two referees for their many helpful comments and suggestions on an earlier draft of this paper.

References

García Jesús E, González-López VA (2017), Consistent estimation of Partition Markov models. Entropy 19, 4, 160. https://doi.org/10.3390/e19040160. [CrossRef] [Google Scholar]
Cordeiro MTA, García Jesús E, González-López VA, Mercado Londoño SL (2020), Partition Markov model for multiple processes. Math Meth Appl Sci 43, 13, 7677–7691. https://doi.org/10.1002/mma.6079. [CrossRef] [Google Scholar]
Cordeiro MTA, García Jesús E, González-López VA, Mercado Londoño SL (2019), Stochastic profile of Epstein-Barr virus in nasopharyngeal carcinoma settings. 4open 2, 25. [CrossRef] [EDP Sciences] [Google Scholar]
García Jesús E, Gholizadeh R, González-López VA (2018), A BIC-based consistent metric between Markovian processes. Appl Stoch Models Bus Ind 34, 6, 868–878. https://doi.org/10.1002/asmb.2346. [Google Scholar]
Cordeiro MTA, García Jesús E, González-López VA, Mercado Londoño SL (2019), Classification of autochthonous dengue virus type 1 strains circulating in Japan in 2014. 4open 2, 20. [CrossRef] [EDP Sciences] [Google Scholar]
García Jesús E, Gholizadeh R, González-López VA (2018), Stochastic distance between Burkitt lymphoma/leukemia strains, in: Demography and Health Issues, Springer, Cham, pp. 143–153. [Google Scholar]
García Jesús E, González-López VA (2011, November), Minimal Markov models, in: Fourth Workshop on Information Theoretic Methods in Science and Engineering, p. 25. [Google Scholar]
Schwarz G (1978), Estimating the dimension of a model. Ann Stat 6, 2, 461–464. [Google Scholar]
Csiszár I, Talata Z (2006), Context tree estimation for not necessarily finite memory processes, via BIC and MDL. IEEE Trans Inf Theory 52, 3, 1007–1016. [Google Scholar]
Wu F, Zhao S, Yu B, Chen YM, Wang W, Song ZG, Hu Y, Tao ZW, Tian JH, Pei YY, Yuan ML, Zhang YL, Dai FH, Liu Y, Wang QM, Zheng JJ, Xu L, Holmes EC, Zhang YZ (2020), A new coronavirus associated with human respiratory disease in China. Nature 579, 265–269. https://doi.org/10.1038/s41586-020-2008-3. [PubMed] [Google Scholar]
Hu B, Zeng LP, Yang XL, Ge XY, Zhang W, Li B, Xie JZ, Shen XR, Zhang YZ, Wang N, Luo DS, Zheng XS, Wang MN, Daszak P, Wang LF, Cui J, Shi ZL (2017), Discovery of a rich gene pool of bat SARS-related coronaviruses provides new insights into the origin of SARS coronavirus. PLoS Pathogens 13, 11, e1006698. [CrossRef] [PubMed] [Google Scholar]
Hu D, Zhu C, Ai L, He T, Wang Y, Ye F, Yang L, Ding C, Zhu X, Lv R, Zhu J, Hassan B, Feng Y, Tan W, Wang C (2018), Genomic characterization and infectivity of a novel SARS-like coronavirus in Chinese bats. Emerg Microb Infect 7, 1, 1–10. [CrossRef] [Google Scholar]
Li W, Shi Z, Yu M, Ren W, Smith C, Epstein JH, Wang H, Crameri G, Hu Z, Zhang H, Zhang J, McEachern J, Field H, Daszak P, Eaton BT, Zhang S, Wang LF (2005), Bats are natural reservoirs of SARS-like coronaviruses. Science 310, 5748, 676–679. [Google Scholar]
Guan Y, Zheng BJ, He YQ, Liu XL, Zhuang ZX, Cheung CL, Luo SW, Li PH, Zhang LJ, Guan YJ, Butt KM, Wong KL, Chan KW, Lim W, Shortridge KF, Yuen KY, Peiris JS, Poon LL (2003), Isolation and characterization of viruses related to the SARS coronavirus from animals in southern China. Science 302, 5643, 276–278. [Google Scholar]
Roberts A, Deming D, Paddock CD, Cheng A, Yount B, Vogel L, Herman BD, Sheahan T, Heise M, Genrich GL, Zaki SR, Baric R, Subbarao K (2007), A mouse-adapted SARS-coronavirus causes disease and mortality in BALB/c mice. PLoS Pathogens 3, 1, e5. [CrossRef] [PubMed] [Google Scholar]
Leparc-Goffart I, Hingley ST, Chua MM, Jiang X, Lavi E, Weiss SR (1997), Altered pathogenesis of a mutant of the murine coronavirus MHV-A59 is associated with a Q159L amino acid substitution in the spike protein. Virology 239, 1, 1–10. [CrossRef] [PubMed] [Google Scholar]
He R, Dobie F, Ballantine M, Leeson A, Li Y, Bastien N, Cutts T, Andonov A, Cao J, Booth TF, Plummer FA, Tyler S, Baker L, Li X (2004), Analysis of multimerization of the SARS coronavirus nucleocapsid protein. Biochem Biophys Res Commun 316, 2, 476–483. [Google Scholar]
van Boheemen S, de Graaf M, Lauber C, Bestebroer TM, Raj VS, Zaki AM, Osterhaus AD, Haagmans BL, Gorbalenya AE, Snijder EJ, Fouchier RA (2012), Genomic characterization of a newly discovered coronavirus associated with acute respiratory distress syndrome in humans. MBio 3, 6, e00473-12. [CrossRef] [PubMed] [Google Scholar]

Cite this article as: García JE, González-López VA & Tasca GH 2020. Partition Markov Model for Covid-19 Virus. 4open, 3, 13.

All Tables

Table 1

Partition ℐ and transition probability to each element in the alphabet A.

In the text

Table 2

Estimated partition for the state space {a, b, c}⁴ – procedure (i). The elements of L₁ have the format b … b or c … b and the elements of L₅ have the format c … a.

In the text

Table 3

Number of parts of the partitions estimated for settings (i) and (ii) in 100 simulations of size n each.

In the text

Table 4

Partition ℐ and transition probability of each part I_i, i = 1, 2 to each element in the alphabet A.

In the text

Table 5

Estimated partition for the state space {a, b, c}⁴ – procedure (i).

In the text

Table 6

On top, settings under the perspective (i): memory o, cardinal of partition |ℒ|, BIC and KT values. On bottom, settings under the perspective (ii): parameter G, cardinal of partition |ℐ|, BIC and KT values. In bold letter the 2 best cases.

In the text

Table 7

Part composition for the model – (ii) with o = 3 and G = 9, see Table 6.

In the text

Table 8

$\hat{P}$ $\widehat{P}$ (·|I_i), i = 1, …, 13, for the model – (ii) with o = 3 and G = 9, see Table 6. In bold the highest probabilities by part.

In the text

Table 9

Complete genome sequences coming from https://www.ncbi.nlm.nih.gov/nuccore/Y. For each sequence Y are informed, its Version, sample size n, Organism and Reference.

In the text

Table 10

d_max values (see Definition 3.1) between each pair of sequences, o = 3, G = 9, α = 2. In bold type, the values between MN908947.3 (Covid-19) and the other sequences, with * the smaller ones.

In the text

All Figures

	Figure 1 Scheme of the past necessary to determine the state of the process at time t, according to equation (1). In zigzag the irrelevant period with limits on top of the scheme [t – G + 1, t − o − 1].
In the text

	Figure 2 Dendrograms build from the d_max values, reported in Table 10. MN908947.3 Covid-19 sequence.
In the text

[1] García Jesús E, González-López VA (2017), Consistent estimation of Partition Markov models. Entropy 19, 4, 160. https://doi.org/10.3390/e19040160. [CrossRef] [Google Scholar]

[2] Cordeiro MTA, García Jesús E, González-López VA, Mercado Londoño SL (2020), Partition Markov model for multiple processes. Math Meth Appl Sci 43, 13, 7677–7691. https://doi.org/10.1002/mma.6079. [CrossRef] [Google Scholar]

[3] Cordeiro MTA, García Jesús E, González-López VA, Mercado Londoño SL (2019), Stochastic profile of Epstein-Barr virus in nasopharyngeal carcinoma settings. 4open 2, 25. [CrossRef] [EDP Sciences] [Google Scholar]

[4] García Jesús E, Gholizadeh R, González-López VA (2018), A BIC-based consistent metric between Markovian processes. Appl Stoch Models Bus Ind 34, 6, 868–878. https://doi.org/10.1002/asmb.2346. [Google Scholar]

[5] Cordeiro MTA, García Jesús E, González-López VA, Mercado Londoño SL (2019), Classification of autochthonous dengue virus type 1 strains circulating in Japan in 2014. 4open 2, 20. [CrossRef] [EDP Sciences] [Google Scholar]

[6] García Jesús E, Gholizadeh R, González-López VA (2018), Stochastic distance between Burkitt lymphoma/leukemia strains, in: Demography and Health Issues, Springer, Cham, pp. 143–153. [Google Scholar]

[7] García Jesús E, González-López VA (2011, November), Minimal Markov models, in: Fourth Workshop on Information Theoretic Methods in Science and Engineering, p. 25. [Google Scholar]

[8] Schwarz G (1978), Estimating the dimension of a model. Ann Stat 6, 2, 461–464. [Google Scholar]

[9] Csiszár I, Talata Z (2006), Context tree estimation for not necessarily finite memory processes, via BIC and MDL. IEEE Trans Inf Theory 52, 3, 1007–1016. [Google Scholar]

[10] Wu F, Zhao S, Yu B, Chen YM, Wang W, Song ZG, Hu Y, Tao ZW, Tian JH, Pei YY, Yuan ML, Zhang YL, Dai FH, Liu Y, Wang QM, Zheng JJ, Xu L, Holmes EC, Zhang YZ (2020), A new coronavirus associated with human respiratory disease in China. Nature 579, 265–269. https://doi.org/10.1038/s41586-020-2008-3. [PubMed] [Google Scholar]

[11] Hu B, Zeng LP, Yang XL, Ge XY, Zhang W, Li B, Xie JZ, Shen XR, Zhang YZ, Wang N, Luo DS, Zheng XS, Wang MN, Daszak P, Wang LF, Cui J, Shi ZL (2017), Discovery of a rich gene pool of bat SARS-related coronaviruses provides new insights into the origin of SARS coronavirus. PLoS Pathogens 13, 11, e1006698. [CrossRef] [PubMed] [Google Scholar]

[12] Hu D, Zhu C, Ai L, He T, Wang Y, Ye F, Yang L, Ding C, Zhu X, Lv R, Zhu J, Hassan B, Feng Y, Tan W, Wang C (2018), Genomic characterization and infectivity of a novel SARS-like coronavirus in Chinese bats. Emerg Microb Infect 7, 1, 1–10. [CrossRef] [Google Scholar]

[13] Li W, Shi Z, Yu M, Ren W, Smith C, Epstein JH, Wang H, Crameri G, Hu Z, Zhang H, Zhang J, McEachern J, Field H, Daszak P, Eaton BT, Zhang S, Wang LF (2005), Bats are natural reservoirs of SARS-like coronaviruses. Science 310, 5748, 676–679. [Google Scholar]

[14] Guan Y, Zheng BJ, He YQ, Liu XL, Zhuang ZX, Cheung CL, Luo SW, Li PH, Zhang LJ, Guan YJ, Butt KM, Wong KL, Chan KW, Lim W, Shortridge KF, Yuen KY, Peiris JS, Poon LL (2003), Isolation and characterization of viruses related to the SARS coronavirus from animals in southern China. Science 302, 5643, 276–278. [Google Scholar]

[15] Roberts A, Deming D, Paddock CD, Cheng A, Yount B, Vogel L, Herman BD, Sheahan T, Heise M, Genrich GL, Zaki SR, Baric R, Subbarao K (2007), A mouse-adapted SARS-coronavirus causes disease and mortality in BALB/c mice. PLoS Pathogens 3, 1, e5. [CrossRef] [PubMed] [Google Scholar]

[16] Leparc-Goffart I, Hingley ST, Chua MM, Jiang X, Lavi E, Weiss SR (1997), Altered pathogenesis of a mutant of the murine coronavirus MHV-A59 is associated with a Q159L amino acid substitution in the spike protein. Virology 239, 1, 1–10. [CrossRef] [PubMed] [Google Scholar]

[17] He R, Dobie F, Ballantine M, Leeson A, Li Y, Bastien N, Cutts T, Andonov A, Cao J, Booth TF, Plummer FA, Tyler S, Baker L, Li X (2004), Analysis of multimerization of the SARS coronavirus nucleocapsid protein. Biochem Biophys Res Commun 316, 2, 476–483. [Google Scholar]

[18] van Boheemen S, de Graaf M, Lauber C, Bestebroer TM, Raj VS, Zaki AM, Osterhaus AD, Haagmans BL, Gorbalenya AE, Snijder EJ, Fouchier RA (2012), Genomic characterization of a newly discovered coronavirus associated with acute respiratory distress syndrome in humans. MBio 3, 6, e00473-12. [CrossRef] [PubMed] [Google Scholar]