Issue 4open Volume 2, 2019 Statistical Inference in Copula Models and Markov Processes, Case Studies and Insights 20 8 Mathematics - Applied Mathematics https://doi.org/10.1051/fopen/2019018 03 July 2019

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## 1 Introduction

The sections and topics that compound this article are detailed below. The notions that we use as well as the definition of the criterion to classify the sequences are given in Section 2. We detail the database in Section 3.1. The results are in Section 3.2 and the conclusions in Section 4.

## 2 Theoretical basis

In this section we give the theoretical framework on which are established, (i) the notion of proximity between sequences as well as (ii) the criterion of classification of the sequences.

Let (Xt) be a discrete time, order o (with o < ∞) Markov chain on a finite alphabet A. Let us call the state space, denote the string am, am+1,…, an by where ai ∈ A, m ≤ i ≤ n. For each a ∈ A and define the conditional probability In a given sample coming from the stochastic process, the number of occurrences of s in the sample is denoted by Nn(s) and the number of occurrences of s followed by a in the sample is denoted by Nn(s, a). In this way, is the estimator of P(a|s). Consider now, two Markov chains (X1,t) and (X2,t), of order o, arranged on the finite alphabet A with state space Given denote by {P(a|s)}aA and {Q(a|s)}aA the sets of conditional probabilities of (X1,t) and (X2,t) respectively. We define a local metric ds (introduced by [2]) that, when evaluated in a given string s, allows us to define how far or near the processes are.

Definition 2.1. Consider two Markov chains (X1,t) and (X2,t), of order o, with finite alphabet A, state space and independent samples respectively.

1. For a string

with where and are given as usual, computed from the samples and respectively. With α a real and positive value.

The Definition 2.1 offers us two notions of proximity between sequences, “i.” is local and “ii.” is global, “i.” and “ii.” are statistically consistent, that is, by increasing the min{n1, n2} grows their ability to detect (a) discrepancies (when the underlying laws are different) and (b) similarities (when the underlying laws are the same). In the application we use α = 2 (see Definition 2.1-i.), with this value (α = 2), to decide that the sequences follow the same law when ds < 1, is equivalent to use the Bayesian Information Criterion (see [2, 10]). In [2] is proved also that ds is a metric:

To follow is introduced a notion that makes possible the classification of sequences that belong to a group of sequences.

Definition 2.2. Given a finite collection of samples from the processes with probabilities over the finite alphabet A, with state space (o < ∞). For a fixed i ∈ {1, 2,…, m} define

where, given a sequence if l = 2k + 1 and if l = 2k, for k an integer and z(j) denoting the jth order statistic of the collection .

With the V values attributed to each sample, we can proceed to order the samples, from lowest to highest value of V, in order to identify their classification. As we can perceive from the Definition 2.2, low values of V indicate that these samples represent the whole group better, while high values of V indicate little representativeness. The next result (proved in [1]), give us an adequate tool to classify sequences, according to their underlying laws, it allows to consolidate V as a robust and consistent classifier.

Theorem 2.1. Under the assumptions of Definition 2.2, for each i,≤ i ≤ m, set ,

where is the smallest integer greater than or equal to x. Theorem 2.1 guarantees that if at least 50% of the samples of the set follow the same law, each of them receives a value of V close to zero. And if this does not happen, V takes arbitrarily large values identifying discrepancies in the generating laws of the sequences.

## 3 Data and results

First, we describe the data, its source and structure, afterwards we proceed to measure the distance between the sequences and to classify them by representativeness.

### 3.1 Dengue virus type 1

The complete sequences were obtained from http://www.ncbi.nlm.nih.gov/ (NCBI – National Center for Biotechnology Information), sequenced and studied for the first time in [3]. We describe the sequences in Table 1.

Table 1

Complete Sequences of Dengue Virus Type 1. Columns from left to right: (1) the identification of the sequence/strain, (2) the number of access to the NCBI base, (3) the patient from which it is coming the sequence, (4) the possible local of contamination of the patient.

The epicenter of Dengue Fever (DF) outbreak during 2014 was possible in the Yoyogi Park in Tokyo. Part of the sequences are coming from patients who pass through there and nearby locations. Details of each patient listed in Table 1 are given in [3], here we emphasize some of them. In the last column of Table 1 we inform the place in where it is suspected that the contamination happened to each patient. The contamination of patient 14-149J occurred in a place near to Yoyogi Park. 14-153J did not visit Yoyogi Park for at least two weeks before the onset of DF and was likely infected in Chiba prefecture. 14-181J lives in Shizuoka prefecture and never visited Yoyogi Park or the other affected areas, visited other places in Tokyo before the onset of DF. Patient 14-188J lives in Nishinomiya city, Hyogo prefecture, over 500 km of Tokyo and never visited the Tokyo area before the onset of DF. He visited Malaysia for seven days and had the onset of DF 12 days after. To illustrate the structure of the data, consider the beginning of the sequence LC011945,

then, the alphabet is A = {a, c, g, t} with cardinal |A| = 4 and elements: adenine (a), cytosine (c), guanine (g) and thymine (t). All the sequences have around a size of 10 700 elements.

In Figure 1 we see a map of Japan with the regions from are coming the patients listed in Table 1. To calculate the classification of the sequences and establish the similarity between them, in the next section we first calculate the values of ds for each pair of sequences, where s is a state of the state space . As usual, the elements of the alphabet A are organized in triples, then we can choose a memory o = 3, 6, 9, etc., therefore, the state space is composed of o concatenations of elements of A (). In this case, the size of the sequences is approximately 10 700, so the recommended memory is where is the greatest integer less than or equal to x. Then we can use memory three or six, to simplify, we use the memory o = 3.

 Figure 1 Map of Japan with the regions listed in Table 1.

### 3.2 Similarity between the genomic sequences

Since we want to obtain global measurements between the sequences we calculate the values of dmax (Definition 2.1-ii.) between each pair of sequences. From this, we found that the three Tokyo sequences, LC011945, LC11946 and LC11947 have dmax = 0. So, the three Tokyo sequences will be represented by LC011945. Then, we will work with four sequences, the sequence names an index number are shown in Table 2. Table 3 shows the value of dmax for each pair of sequences. That is, for each pair of sequences in the Table 2 we compute dmax, the computation of dmax requires the computation of ds for each s of the state space. And in that case the memory used is o = 3.

Table 2

Genomic sequences and index number. The original set of six genomic sequences (Table 1) was reduced to four genomic sequences.

Table 3

dmax values (see Definition 2.1-ii.). Columns 1 and 2 list the combinations of two sequences, from Table 2. Column 3 shows the value of dmax for the sequences to its left.

We see that the lowest value of dmax is caused by the sequences LC011945 and LC011948, with the second lowest value being the dmax between the sequences LC011945 and LC016760. Already the highest value of dmax occurs between the sequence LC011949 in relation to the sequences LC011945, LC011948 and LC016760 respectively. It is useful to represent the values of dmax through a dendrogram as seen on Figure 2. We build different dendrograms (average, median, single and complete) and they all point to the same organization between the sequences, see http://www.ime.unicamp.br/~jg/cadvj/. As we can see, in fact the dendrogram exposes the homogeneity between three of the four sequences: LC011945, LC011948 and LC016760, leaving exposed the disparity between the sequence LC011949 and the group of three sequences.

 Figure 2 Dendrogram by average criterion build from the dmax values, reported in the Table 3.

Observe that dmax < 1 in all cases (Table 3), this implies that all values of ds < 1 in all states i.e. the four sequences are considered as generated by the same stochastic law, but between them exist certain heterogeneity, detected by the magnitudes of dmax. This fact allows us to carry out investigations that answer which of them is more or less representative in the set, which is the approach of the following subsection.

### 3.3 Classification of each sequence by means of V

We determine the classification attributed to each sequence, according to criterion V (see Definition 2.2). Table 4 shows the results.

Table 4

Value of V (Definition 2.2) for each sequence (see Table 2), ordered by increasing magnitude from top to bottom. In bold letter the most representative sequence (top) and the least representative sequence (bottom).

The sequence that best represent de set of sequences (listed in Table 2) is coming from Tokyo LC011945. The most discrepant sequence (larger V) is LC011949, being the less representative sample and it indicates that LC011949 may have a different origin than the other sequences. This is, patient 14-181J was probably infected by a different strain from the other Japanese patients of Table 1. Comparing the classification of LC011949, which is 0.04200, we see clearly the impact of the dmax values coming from Table 3. Each time the sequence LC011949 is compared with another one in the list, the value of dmax increases by one decimal.

To identify more strongly the meaning of this classification, we have compared all the sequences found in the base http://www.ncbi.nlm.nih.gov/ with the profile of being complete sequences of Dengue Type 1, year 2014, and coming from Asia. The list of accession numbers is given in the Table 5. For each sequence identified through its accession number we attach two letters to that number, in order to easily identify the country. In Figure 3 we show a dendrogram build by the average criterion with all the complete sequences.

 Figure 3 Dendrogram by average criterion for the sequences listed in Table 5 build from the dmax values (Definition 2.1-ii.). With arrows we indicate the four Japanese sequences from Table 2.

Table 5

List of accession numbers (NCBI base) of complete sequences of Dengue virus Type 1, year 2014 – from Asia. The first column shows the country and the second column shows the sequences coming from the country, on the left.

The circular dendrogram shows that the sequence of Japan (LC011949) with the highest V (among those in the list of Table 4) is in a cluster quite far away from the others in the list (LC011945, LC011948, LC016760). Some observations from Figure 3 can be done, for instance, the Japanese sequence LC011949 is next to the Chinese sequence KT827371, while the other Japanese sequences (Table 4) are closer to a variety of sequences from various countries including Japan, Malaysia and Singapore. Moreover, by the form of organization of the dendrogram, we verified that the sequence LC011949 is considerably more distant from the group {LC011945, LC011948, LC016760} in comparison with other foreign sequences, such as sequences coming from China, Malaysia and Singapore. As seen in Figure 2, the dendrogram of Figure 3 also shows the proximity of the Japanese sequences LC011945 and LC011948, which supports the argument of its representativeness in the group of Table 1. See also http://www.ime.unicamp.br/~jg/cadvj/, in order to corroborate the results with dendrograms build applying several criteria. The Japanese sequence LC011949 (Shizuoka patient who never visited Yoyogi Park) besides being the least representative (V and dmax higher) is also shown in Figure 3 closer to those of Chinese origin, which could implies a contamination of different origin.

## 4 Conclusion

In this paper we use two stochastic and statistically consistent notions to, (i) establish the proximity between genomic sequences (see [2]), (ii) classify the sequences in terms of their representativeness (see [1]). The classification rule gives low values to more representative sequences and it gives high values to less representative sequences. We classify genomic sequences of Dengue Virus Type I, originating in Japan and all corresponding to the outbreak occurred in Japan during 2014 (see Table 4). We identify the most representative sequences of the outbreak (those are from Tokyo), and we verify that these resemble other sequences (of 2014) coming from countries like Malaysia, Singapore, and China. The less representative sequence of the outbreak (from Shizuoka) is also a sequence that could resemble another one of Chinese origin (from 2014), but the latter being distant from the representative sequences of the outbreak. According to the classification that we have obtained and because of the evidence (see Figs. 2 and 3) we tend to agree with [3] in the sense of affirming that the outbreak in Japan during 2014 could involve more than one type of Dengue Virus Type I. By means of this type of approach it is possible to quantify the representativity of sequences, when compared with groups of sequences. This way of classifying is a genuinely stochastic tool, as explained in Section 2, that reports how close or distant are the stochastic laws of the sequences under consideration.

Future research could include the various serotypes of Dengue virus, in order to, (a) establish whether the notion dmax/ds is capable of discriminating between the serotypes, (b) identify the spectrum of variation of the classifier (V) in each serotype, (c) establish the impact of the α constant (see Definition 2.1) in (a) and (b).

## Acknowledgments

M. Cordeiro and S. Londoño gratefully acknowledge the financial support provided by CAPES with fellowships from the PhD Program in Statistics – University of Campinas. J.E. García and V.A. González-López gratefully acknowledge the support provided by the project Inhibitory deficit as a marker of neuroplasticity in rehabilitation grant 2017/12943-8, São Paulo Research Foundation (FAPESP). Also, the authors wish to thank the three referees for their many helpful comments and suggestions on an earlier draft of this paper.

## References

1. Fernández M, García Jesús E, Gholizadeh R, González-López VA (2019), Sample selection procedure in daily trading volume processes. Math Meth Appl Sci, 1–13. https://doi.org/10.1002/mma.5705. [Google Scholar]
2. García Jesús E, Gholizadeh R, González-López VA (2018), A BIC-based consistent metric between Markovian processes. Appl Stoch Models Bus Ind 34, 6, 868–878. [Google Scholar]
3. Tajima S, Nakayama E, Kotaki A, Moi ML, Ikeda M, Yagasaki K, Saito Y, Shibasaki K, Saijo M, Takasaki T (2017), Whole genome sequencing–based molecular epidemiologic analysis of autochthonous dengue virus type 1 strains circulating in Japan in 2014. Jpn J infect Dis 70, 1, 45–49. [CrossRef] [PubMed] [Google Scholar]
4. Liu P, Fang X, Feng Z, Guo YM, Peng RJ, Liu T, Huang Z, Feng Y, Sun X, Xiong Z, Guo X, Pang SS, Wang B, Lv X, Feng FT, Li DJ, Chen LZ, Feng QS, Huang WL, Zeng MS, Bei JX, Zhang Y, Zeng YX (2011), Direct sequencing and characterization of a clinical isolate of Epstein-Barr virus from nasopharyngeal carcinoma tissue by using next-generation sequencing technology. J Virol 85, 21, 11291–11299. [CrossRef] [PubMed] [Google Scholar]
5. Zeng MS, Li DJ, Liu QL, Song LB, Li MZ, Zhang RH, Yu XJ, Wang HM, Emberg I, Zeng YX (2005), Genomic sequence analysis of Epstein-Barr Virus strain GD1 from a nasopharyngeal carcinoma patient. J Virol 79, 24, 15323–15330. [CrossRef] [PubMed] [Google Scholar]
6. Kwok H, Tong AH, Lin CH, Lok S, Farrel PJ, Kwong DL, Chiang AK (2012), Genomic sequencing and comparative analysis of Epstein-Barr Virus genome isolated from primary nasopharyngeal carcinoma biopsy. PLoS One 7, 5, e36939. [CrossRef] [PubMed] [Google Scholar]
7. García Jesús E, González-López VA (2016), Markov partition models for Epstein Barr virus, in: JR Bozeman Jr, T Oliveira, CH Skiadas (Eds.), Stochastic and Data Analysis Methods and Applications in Statistics and Demography, International Society for the Advancement of Science and Technology (ISAST), Athens. [Google Scholar]
8. García Jesús E, Gholizadeh R, González-López VA (2018), Stochastic distance between Burkitt lymphoma/leukemia strains, in: C Skiadas, C Skiadas (Eds.), Demography and Health Issues. The Springer Series on Demographic Methods and Population Analysis, Vol. 46, Springer, Cham. [Google Scholar]
9. Baer R, Bankier AT, Biggin MD, Deininger PL, Farrell PJ, Gibson TJ, Hatfull G, Hudson GS, Satchwell SC, Seguin C, Tuffnell PS, Barrell BG (1984), DNA sequence and expression of the B95–8 Epstein-Barr virus genome. Nature 310, 5974, 207. [CrossRef] [PubMed] [Google Scholar]
10. Schwarz G (1978), Estimating the dimension of a model. Ann Stat 6, 2, 461–464. [Google Scholar]

Cite this article as: Cordeiro MTA, García JE, González-López VA & Londoño SLM 2019. Classification of autochthonous dengue virus type 1 strains circulating in Japan in 2014. 4open, 2, 20.

## All Tables

Table 1

Complete Sequences of Dengue Virus Type 1. Columns from left to right: (1) the identification of the sequence/strain, (2) the number of access to the NCBI base, (3) the patient from which it is coming the sequence, (4) the possible local of contamination of the patient.

Table 2

Genomic sequences and index number. The original set of six genomic sequences (Table 1) was reduced to four genomic sequences.

Table 3

dmax values (see Definition 2.1-ii.). Columns 1 and 2 list the combinations of two sequences, from Table 2. Column 3 shows the value of dmax for the sequences to its left.

Table 4

Value of V (Definition 2.2) for each sequence (see Table 2), ordered by increasing magnitude from top to bottom. In bold letter the most representative sequence (top) and the least representative sequence (bottom).

Table 5

List of accession numbers (NCBI base) of complete sequences of Dengue virus Type 1, year 2014 – from Asia. The first column shows the country and the second column shows the sequences coming from the country, on the left.

## All Figures

 Figure 1 Map of Japan with the regions listed in Table 1. In the text
 Figure 2 Dendrogram by average criterion build from the dmax values, reported in the Table 3. In the text
 Figure 3 Dendrogram by average criterion for the sequences listed in Table 5 build from the dmax values (Definition 2.1-ii.). With arrows we indicate the four Japanese sequences from Table 2. In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.