Partition Markov Model for Covid-19 Virus

In this paper, we investigate a specific structure within the theoretical framework of Partition Markov Models (PMM) [see García Jesús and González-López, Entropy 19, 160 (2017)]. The structure of interest lies in the formulation of the underlying partition, which defines the process, in which, in addition to a finite memory o associated with the process, a parameter G is introduced, allowing an extra dependence on the past complementing the dependence given by the usual memory o. We show, by simulations, how algorithms designed for the classic version of the PMM can have difficulties in recovering the structure investigated here. This specific structure is efficient for modeling a complete genome sequence, coming from the newly decoded Coronavirus Covid-19 in humans [see Wu et al., Nature 579, 265–269 (2020)]. The sequence profile is represented by 13 units (parts of the state space’s partition), for each of the 13 units, their respective transition probabilities are computed for any element of the genetic alphabet. Also, the structure proposed here allows us to develop a comparison study with other genomic sequences of Coronavirus, collected in the last 25 years, through which we conclude that Covid-19 is shown next to SARS-like Coronaviruses (SL-CoVs) from bats specimens in Zhoushan [see Hu et al., Emerg Microb Infect 7, 1–10 (2018)].


Introduction
A Partition Markov Model (PMM) is a model established in a discrete stochastic process on a finite alphabet, with finite order, see [1]. PMM generalizes other models, including Variable Length Markov chains and Markov chains with finite order. A PMM identifies a partition in the state space, bringing together in each part of the partition, the states that share the same transition probabilities for all the elements in the alphabet. All the states of a part are used to compute the transition probabilities, allowing us to use several states (all those included in the part) to estimate a unique set of transition probabilities. By construction, this is a parsimonious model, since it reduces the number of probabilities to estimate by identifying equivalent states (found in the same part). PMM models have already shown sufficient flexibility in the field of genomic structure modeling, for several purposes, such as determining similarities between Zika's genomic sequences and modeling the genomic Zika's profile [2]. This family of models has also been used to define the genomic profile of the Epstein-Barr virus [3]. The consistent estimation of a PMM [1], is achieved by the Bayesian Information Criterion, BIC, which has led to the definition of a BIC-based metric, see also [4]. The metric has allowed the use of the dynamics of the PMM models for other open problems; it has been efficiently applied in subjects such as (1) the comparison between Dengue's genomic sequences of different origins [5] and (2) to compare genomic sequences of Burkitt lymphoma/leukemia [6]. Since a PMM is defined by a partition on the state space of the process, we understand that the partition's structure could be the key to modeling certain phenomena. Then, in this paper, we investigate specific impositions on the partition of a PMM, with the purpose of improving the modeling of genomic sequences. Given a memory o, the states are sequences of o elements of the alphabet. The state space is the set of states, and the partition of the PMM is a partition of the state space. Each state is a configuration that occurs in a consecutive interval of time, so in order to define the next element of the process (the transition) is necessary to observe a past of size o. In addition to a finite memory o, this paper introduces a parameter G. G allows considering the dependence on previous events in the past of the process realization, that are not accounted for the memory o. The PMM model thus specified could be used to achieve more intricate dependence structures allowing the representation of genomic sequences and, that is the objective of this paper, to verify if, in fact, we can achieve a finer model for the genomic structure of a complete DNA sequence, of the outbreak of a novel Coronavirus (Covid-19), collected in Wuhan of Hubei province, China. In addition to tracing the genomic profile of Covid-19, we also want to compare genomic sequences that according to the literature could point to the origins of the sequence used in this article, considered one of the first records of the virus.
This article is organized as follows. Section Theoretical Background addresses the theoretical foundations of Partition Markov Models, the estimation process, and the specific case of PMM that we investigate in this article. Section Covid-19 DNA Model describes a real problem associated with the identification of the profile of a new Coronavirus named Covid-19. This section describes the data and how the specific case of PMM shows an improved performance to describe the stochastic behavior of the new virus. The conclusions and considerations are given in Section Conclusion.

Theoretical background
In this section, we present the notation as well as the fomalization of the model on which we developed our discussion, the Partition Markov Model. We also show how the aforementioned model can be consistently estimated. Let (Z t ) t!1 be a discrete time Markov chain of order o on a finite alphabet A, such that o < 1. Let us call S = A o the state space and denote the string a m a m+1 , . . ., a n by a n m where a i 2 A, m i n. For each a 2 A and s 2 S define the transition probability P ðajsÞ ¼ Prob ðZ t ¼ ajZ tÀ1 tÀo ¼ sÞ. In a given sample z n 1 , coming from the stochastic process, the number of occurrences of s is denoted by N n (s) and the number of occurrences of s followed by a is denoted by N n (s, a). In this way, N n ðs;aÞ N n ðsÞ is the maximum likelihood estimator of P(a|s). The Partition Markov Model introduced in the next definition is designated to obtain a parsimonious model for a Markov process with finite memory on a finite alphabet. This model proposes the identification of states in the state space in units called parts (of a partition), the parts are composed by states which have in common their transition probabilities.
. ., L jLj } if this partition is the one defined by the equivalence relationship introduced by item i.
The model given by Definition 2.1 was introduced in [1]. The parameters to be estimated are (a) the partition L, (b) the transition probabilities of each part L to any element of A, P(Á|L) which is P(Á|L) = P(Á|s), "s 2 L. We note that the partition of S that responds to item (ii) of Definition 2.1 is minimal in relation to the number of parts |L|. Given a sample of (Z t ) t!1 , z n 1 , according to [1] the partition can be consistently estimated by means of d L given by Definition 2.2.
Definition 2.2. Let (Z t ) t!1 be a discrete time Markov chain of order o on a finite alphabet A, with state space S = A o , z n 1 a sample of the process and let L = {L 1 , L 2 , . . ., L jLj } be a partition of S such that for all s, r 2 L l , P(Á|s) = P(Á|r) for each l = 1, 2, . . ., |L|, l¼i;j N n ðL l ; aÞ ln N n ðL l ; aÞ N n ðL l Þ À N n ðL ij ; aÞ ln N n ðL ij ; aÞ N n ðL ij Þ ( ) ; with N n (L i ) = P s2Li N n ðsÞ; N n ðL i ; aÞ ¼ P s2Li N n ðs; aÞ; N n ðL ij Þ ¼ N n ðL i Þ þ N n ðL j Þ; N n ðL ij ; aÞ ¼ N n ðL i ; aÞ þ N n ðL j ; aÞ; for a 2 A, L i , L j 2 L; a, a real and positive value.
The estimation of Partition Markov Models can be carried out via algorithms such as the one introduced in [7], which uses d L . Note that d L is a metric designed to build a structure in the state space, identifying equivalent states, it is applied (see the algorithm of [7]) for example in an initial set consisting of the entire state space S, and whenever d L (i, j) < 1 the elements L i and L j must be in the same part (see properties of d L in [1]). The metric d L is derived from the Bayesian Information Criterion (BIC), as proved in [1], and the BIC indicates the junction of two elements in the same part of the partition, if and only if d L < 1. For each part L ofL the transition probability is estimated byP ðajLÞ ¼ N nðL;aÞ N n ðLÞ . Note that all equivalent states are used to estimate each probability. An economy is produced in the total number of probabilities to be estimated, since the identification of the partition produces a reduction of the number of probabilities to be estimated and each probability can be better estimated since the occurrences of several states are used for the estimation of each probability, now related to the part of the partition P(Á|L).
In the next subsection, we discuss a specific structure of the partition L and how it might make sense in practice. We also discuss the impact of such specificity on estimation algorithms, such as the one exposed in [7].

A further memory for the process
When fitting a PMM, the state space is restricted by the maximum value for the memory o allowed by the sample size and, if the alphabet is A, the space where the partition is arranged is build in S = A o . This means that to define the next step of the process, it is enough to know the past of size o, but it could happen that in addition to a past of size o, the process depends on a more distant jump, say G, with G > o. It is natural to think then in defining the state space as A G , where we would again have the structure of a Partition Markov Model, but with a redundancy of information, since the values between times t -G + 1 and to À 1 are not relevant for the future, see Figure 1 to illustrate the idea (the zigzag part is irrelevant). Let see the formalization of the situation, suppose that there is a value G > o, such that the transition probability to Z t = a 2 A from Z tÀ1 tÀG ¼ z . . . s, z 2 A, s 2 A o , where ". . ." is any concatenation of elements of A of size Go À 1, is given by, Then, the space is given by A Â A o , where A records all possibilities for z and A o records all possibilities for s, (z, s) of equation (1). The kind of process on which we want to identify the partition of the state space is given to follow.
where o < 1, transition probabilities following equation (1), for an adequate and finite G such that G > o. Given a sample z n 1 , the number of ocurrences of (z, s) 2 A Â A o in the sample is N n ðz; sÞ ¼ jft : G < t n; z tÀG ¼ z; z tÀ1 tÀo ¼ sgj and the ocurrences of (z, s) 2 A Â A o followed by a 2 A is N n ððz; sÞ; aÞ ¼ jft : G < t n; z tÀG ¼ z; z tÀ1 tÀo ¼ s; z t ¼ agj: Itens (i) and (ii) of Definition 2.1 allow defining the partition of (Z t ) t!1 , say I (partition of W of Definition 2.3). We can also adapt the metric of Definition 2.2 to this situation, to do that it is enough to change S by W, defining for each part I of the partition I of W, N n (I) = P (z,s)2I N n (z, s) and N n (I, a) = P (z,s)2I N n ((z, s), a), for a 2 A. Denote by d I the metric on the state space W. So, to estimate I we can use the algorithm introduced in [7].
Given a sample z n 1 of the process (Z t ) t!1 , denote by P(a|(z, s)) the transition probability given by equation (1), then, the likelihood of the sample P ðz n which is the same expression of equation (2) of [1]. Then, the procedure is to compute the BIC for each model I , which is given by equation (3), N n ðI; aÞ N n ðIÞ N n ðI;aÞ ! À ðjAj À 1ÞjI j lnðnÞ a : ð3Þ As the BIC definition itself shows, the models are characterized (Eq. (3)) by the logarithm of the maximum likelihood, ln Q a2A;I2I N n ðI;aÞ N n ðIÞ N nðI ;aÞ ! , penalized by the number of parameters to be estimated, (|A| À 1)|I |, properly scaled by the size of the data set (by the term ln(n)/a). Thus, the model indicated by the BIC is the one with the highest BIC value that corresponds to the most plausible, taking into account the complexity of the model (number of parameters). The criterion is derived in [8], using a = 2, and it corresponds to the maximization of a posterior distribution assuming a non-informative prior distribution on the dimension of the parametric space. We note that the BIC continues to be valid, replacing the constant 2 with any positive constant a, as given in equation (3). By means of the maximization of equation (3) the partition can be estimated, obtainingÎ as equation (4), Figure 1. Scheme of the past necessary to determine the state of the process at time t, according to equation (1). In zigzag the irrelevant period with limits on top of the where P is the set of all the partitions of W: As the set P can be huge, to obtainÎ it is necessary to use the metric d I with the algorithm introduced by [7].

Remark 2.2
Note that under the Definition 2.3, for the construction of equation (3) and subsequent derivation of the maximum given in equation (4), the parameters G and o are previously set.
The BIC criterion shows great advantages, and therefore, its use is recommended. We quote some of its qualities, (i) it is a consistent method for the estimation of models given by Definition 2.3 (Theorem 3 -[1]). Already particular cases of Definition 2.3 anticipated the consistency of the BIC for the estimation, see, for example [9], in the framework of variable-length Markov chains. (ii) the BIC allows creating a metric like the one detailed in Remark 2.1 (Theorem 2, corollary 2 -[1]), which facilitates the implementation of algorithms [7]. The second property imposes the preference of the BIC against other criteria such as the Krichevsky-Trofimov (KT) [9].
The structure that we want to investigate in this paper is about the form of the partition, which responds to the rule illustrated in Figure 1. In the next example, we present a case. 1 , I 2 , I 3 } and transition probabilities given by Table 1. For instance, since P(a|I 1 ) = 0.2 we have that P ðZ t ¼ ajZ tÀ1 tÀ4 ¼ aHÃaÞ ¼ 0:2, 8H 2 A and 8Ã 2 A. Then, the values H and * at positions t À 3 and t À 2 respectively (between the times t À 4 and t À 1) are irrelevant for the state at time t.
In the following two simulations we see the impact of the structure of a G Markov model -Definition 2.3when we ignore it and apply the algorithm of [7] with the help of d L ; assuming only the usual PMM structure -Definition 2.1.

Example 2.2
We apply the algorithm of [7] in a simulated data from the law given by Table 1, the algorithm is applied in two settings  Table 2. Note that under the setting (i) each element of the state space is a concatenation of o = 4 consecutive elements of A, for example accb, and this element under the setting (ii) is denoted by (a, b) where the memory o = 1 is related to the element b and G = 4 is related to the element a, being irrelevant the central elements cc. Also observe that there is a relationship between some parts of Tables 1 and 2, L 1 = I 2 and L 5 = I 3 , but the part I 1 is distributed in the parts: L 2 , L 3 and L 4 . That is to say that when adjusting via (i) a confusion of the original structure (Tab. 1) is generated.
To visualize the behavior of the settings (i) and (ii) when increasing the sample size, for each sample size n = 5 Â 10 4 , 10 6 , 5 Â 10 6 , 10 7 we perform 100 simulations of the model - Table 1. We record the performance of the settings (i), (ii) in Table 1. Partition I and transition probability to each element in the alphabet A.

Part
Elements P(a|I) P(b|I)   Table 3. We see how (ii) has a more efficient behavior, showing that a usual procedure, such as (i) can show difficulties in recovering the structure given by Definition 2.3. Note also that all the parts recovered by (ii) in the last 3 sample sizes are the original onessee Table 1.
In practical terms, when modeling with real data, a specific sample size is available, say n. In general, the memory of the process that can be used in the model depends on n as well as the cardinal |A| of the process alphabet. Usually, the memory must be less than log |A| (n), in the next example, we see the effect of this condition on G.  Table 4. We perform simulations of the process following Table 4, with n = 2000. The algorithm of [7] is applied in two settings (i) using o ¼ 4 < log jAj ðnÞ j k À 1 ¼ 5, initial set {a, b, c} 4 and d L as given by Definition 2.2 and, (ii) using o = 0, G = 10, initial set {a, b, c} and d I with the modifications imposed by Remark 2.1. In other words, case (i) reflects the conditions that are generally applied to proceed with the adjustment and determination of the process memory.
In Table 5 we show the resulting partition of (i). We see that it is not possible to recover the original structure given in Table 4. As expected, (ii) recovers the structure given by Table 4. These results reflect the insufficiency for n = 2000 to reach a memory that encompasses G = 10, which is the period that the process needs to determine the choice of the next step.
Section Covid-19 DNA Model shows how the model stands out when it comes to representing the genome of Covid-19.

Covid-19 DNA model
In this section, we investigate the stochastic behavior of a complete DNA sequence of the outbreak of a novel Coronavirus (Covid-19) associated with a respiratory disease in Wuhan of Hubei province, China. The sequence was extracted from a patient coming from the Wuhan seafood market, a place that was associated with the origin of the outbreak, its accession number is MN908947 (version MN908947.3). The sequence is coming from a 41-year-old man with no history of hepatitis, tuberculosis or diabetes [10]. The patient was admitted and hospitalized in Wuhan Central Hospital on December 26, 2019, 6 days after the onset of illness. He reported fever, chest tightness, cough, pain, and weakness for one  Table 4. Partition I and transition probability of each part I i , i = 1, 2 to each element in the alphabet A.

Part
Elements P(a|I) P(b|I) week. The cardiovascular, abdominal, and neurologic examination was normal; see more details in [10]. The sequence can be obtained from https://www.ncbi.nlm.nih.gov/nuccore/MN908947. We use in this paper the FASTA format of MN908947, which is composed by 29 903 bases: a, c, g, t.
For the construction of the model, we must choose a memory o. In a discrete Markov process with a discrete alphabet, the criterion used is o < log jAj ðnÞ j k À 1. So, for the alphabet A = {a, c, g, t} with n = 29 903, we have the restriction o < 6. Since the bases in a DNA structure are organized in triples, it is recommended o = 3. This organization also applies to memory G, it is expected that the values of G multiples of 3 show better performance.
We apply the algorithm of [7] in two settings (i) using o 2 {1, 2, 3, 4}, initial set {a, c, g, t} o and d L as given by Definition 2.2, (ii) using o = 3, G 2 {5, 6, 7, 8, 9, 10, 11, 12}, initial set {a, c, g, t} Â {a, c, g, t} 3 and d I with the modifications imposed by Remark 2.1. To identify the best model for the sequence, we use the BIC criterion (with a = 2), see equation (3). And, therefore, the higher the BIC value, the better the model represents the sequence. To compare the results, we also report the Krichevsky-Trofimov (KT) criterion [9]. According to the KT definition, the smaller the value, the better the model will be for representing the data. According to Table 6, the best models are given by two models in (ii). This shows us the convenience of assuming the existence of an extra parameter G. Note that in the three best cases of (ii) G is a multiple of 3, G = 9, 12, 6, which confirms the nature of the genomic organization in triples formed by elements of the alphabet A.
As is usual in DNA sequences [2], the transition probabilities are moderate (in this case 0.44, see Tab. 8). Note that there is a predilection of the process to choose as the next state a or t. Sequences of other viruses lead to other predilections, see for example [2], in which the Zika process is modeled, revealing a predilection for the states a or g. We observe that under Definition 2.3 and without imposing the partition structure, the total number of parameters to be estimated is unfeasible. For example, with o = 3 and G = 9 we have |A Â A o | Â (|A| À 1) = 768 parameters to estimate, and when using the strategy given by equation (2) and Remark 2.1, it is necessary to estimate jI j Â (|A| À 1) = 39 parameters. Table 7 shows the composition of each part, for example, part I 1 is composed of 31 elements of type (z, s) where z 2 A and s 2 A o . The elements are listed from left to right according to how they have been inserted in the part by the algorithm of [7] and following Remark 2.1.

Covid-19 and coronaviruses
In this subsection, we incorporate 11 sequences into the study to compare how distant or close to them the sequence investigated is. It is speculated that the new sequence is the product of mutations of other types of Coronaviruses, and a way to deal with it could be to determine those sequences that are the closest. Although Coronaviruses similar to severe acute respiratory syndrome (SARS, see [11]) have been widely identified in mammals, including bats, since 2005 in China, the exact origin of Coronaviruses infecting humans remains unclear. Therefore, it is necessary to determine the natural reservoir and any intermediate hosts of Coronavirus in its current version (Covid-19). We describe the sequences in Table 9, those are complete sequences of the Coronavirus genome of different types that occurred in the last 25 years.
The metric introduced to follow makes this comparison possible.  Table 8.P(Á|I i ), i = 1, . . ., 13, for the model -(ii) with o = 3 and G = 9, see Table 6. In bold the highest probabilities by part.  Table 7. Part composition for the model -(ii) with o = 3 and G = 9, see Table 6.
In Figure 2, we show the dendrograms built from d max values between the pair of sequences, see Table 10. The dendrograms confirm that based on the available sequences, the sequences MN908947.3 (Covid-19), MG772934.1 and MG772933.1 could be considered as a cluster. The sequences MG772934.1 and MG772933.1 are records from July 2015 and February 2017, respectively, and the sequences come from Zhoushan China. We also see that d max (MG772933.1, MG772934.1) = 0.0902 is close to zero, and that proximity is confirmed in [12]. These discoveries allow us to speculate on certain aspects, one of them is that bats are consolidated as efficient transmitters and are a risk to the human immune system, at least about Coronavirus and its last versions [13]. We note that in [10], the similarity between MN908947.3 and MG772933.1 is mentioned, and this is confirmed here, by means of the G-Model conception.

Conclusion
Partition Markov Models allow a vast economy in the construction and representation of phenomena since, through Definition 2.1, they establish units (parts) in the state space that share the same transition probabilities. Thus several states contribute to the determination of a single transition probability. The parts (elements of the process' partition) consider a finite memory o, that is to say, that the next step of the process will be determined knowing a past constituted by the concatenation of o elements coming from the alphabet. Thus, the step to time t is determined with the knowledge of the occurrences at times t À o, . . ., t À 1. In this paper, we investigate a specific structure within the theoretical framework of Partition Markov Models. The structure of interest lies in the formulation of the partition that defines the process, in which, in addition to a finite memory o associated with the process, a parameter G is introduced, which allows dependence on the past to complement that given by the memory o, see Definition 2.3. We show how algorithms designed for the classic version of Partition Markov Models can have difficulties in recovering the structure investigated here, see Examples 2.2 and 2.3. Under previous determination of the parameters o and G it is possible to adapt all the estimation tools of the usual Partition Markov Models (see [1] and [4]), see Remarks 2.1 and 2.2. This specific structure in the process' partition (see Definition 2.3, Eq. (2)) is shown efficient for modeling a complete sequence of newly decoded DNA [10], Genbank MN908947, from the newly discovered Coronavirus Covid-19, from a patient of Wuhan -China. A partition in a G-model allows a huge reduction of the number of parameters to be estimated, from 768 to 39 (Tabs. 7 and 8), leading to an increase in the estimation quality of the parameters. Already, in more general terms, we see that the inclusion of the parameter G generates flexibility that is very well evaluated by model selection criteria (Tab. 6), giving credibility to partition models with more specific structures. Table 7 shows that the stochastic performance of sequence MN908947 can be reduced to 13 stochastic units that are discriminated by how the next state is selected (transition probabilities). Such a configuration could be used to design a Covid-19 profile. The model given by Definition 2.3 also allows us to develop a comparison study with 11 other genomic sequences of Coronavirus, collected in the last 25 years. We conclude that Covid-19 is shown next to Bat SARS-like Coronavirus sequences, Genbanks MG772934 and MG772933, coming from Zhoushan -China (period: 2015-2017), see Table 10 and Figure 2, see also [13]. Our results are in accordance with the indications given in [10]. This evidence could point to one of the best vectors of the virus, and help in the search for vaccines for its treatment.