Classification of autochthonous dengue virus type 1 strains circulating in Japan in 2014

In this paper, we classify by representativeness the elements of a set of complete genomic sequences of Dengue Virus Type 1 (DENV-1), corresponding to the outbreak in Japan during 2014. The set is coming from four regions: Chiba, Hyogo, Shizuoka and Tokyo. We consider this set as composed of independent samples coming from Markovian processes of finite order and finite alphabet. Under the assumption of the existence of a law that prevails in at least 50% of the samples of the set, we identify the sequences governed by the predominant law (see [1, 2]). The rule of classification is based on a local metric between samples, which tends to zero when we compare sequences of identical law and tends to infinity when comparing sequences with different laws. We found that the order of representativeness, from highest to lowest and according to the origin of the sequences is: Tokyo, Chiba, Hyogo, and Shizuoka. When comparing the Japanese sequences with their contemporaries from Asia, we find that the less representative sequence (from Shizuoka) is positioned in groups considerably far away from that which includes the sequences from the other regions in Japan, this offers evidence to suppose that the outbreak in Japan could be produced by more than one type of DENV-1.


Introduction
In the present study, we analyze the entire genome of six autochthonous DENV-1 (Dengue Virus Type 1) strains isolated from patients during the 2014 outbreak in Japan (see [3]). Our objective is to identify the most representative sequence and the least representative sequence of the set. The virus is transmitted to humans by infected mosquitoes of the Aedes genus. The Dengue virus, in their four types, can be contracted more than once, what makes it extremely efficient. Individuals who have already contracted Dengue present risk factors that, by contracting the virus for the second time in a different variant/type, can develop severe forms such as Dengue hemorrhagic fever and Dengue shock syndrome both potentially fatal. In recent years it has been possible to have access to a vaccine, despite this, it is only recommended for those who have already contracted Dengue before. Since the Second World War, the cases of Dengue Fever identified in Japan have been imported, but between 2013 and 2014 more than 160 autochthonous cases were identified. This has demanded a deep investigation of the nature of this outbreak. The contamination has been caused by DENV-1. The findings in [3] suggest that there were at least two independent autochthonous epidemics in Japan in 2014 caused by DENV-1. In this work our objective is to classify the Dengue samples considering the representativeness that each sample has in relation to the group. That is, we know that the samples belong to different individuals and as a consequence are subject to variations in their genomic construction. We wish to identify the most representative sample of the group, this being the most similar to all other samples of the group. Also, we want to identify the most discrepant sample in the group. To achieve our goal, we will identify each sequence with a sample of a stochastic process. Then we will measure the distance between the sequences using a specially designed metric, and applying a robust method (introduced in [1]), we can identify the most representative sample and we can also classify all the samples in order of representativeness. The problem of establishing the proximity between genomic sequences has aroused the interest of several areas and with different objectives. For example, in the area of virology a relevant issue is the characterization of the Epstein Barr virus under different diagnoses, such as Burkitt's lymphoma, nasopharyngeal carcinoma or even among several types of associated diagnoses [4]. See for examples [5][6][7][8], in each case some notion of similarity between strains is explored. An aspect that differentiates this article is that the notion used to establish the proximity between the genomic sequences is a metric (see [2]), that is, it is symmetric, not negative and verifies the triangular inequality. This metric also has very convenient statistical characteristics, such as being statistically consistent. The second aspect and perhaps the most relevant one is the construction of a metric-based classification between the sequences (see [1]). A criterion for classifying samples, statistically consistent as is the case, could be used in the future to construct standard representations for genomic structures. For example, it could certify that the sequence currently used as a gold standard in the Epstein Barr virus study, sequence B95-8, from [9] (see also [5]), in fact, serves impartial criteria such as the one used in this paper.
The sections and topics that compound this article are detailed below. The notions that we use as well as the definition of the criterion to classify the sequences are given in Section 2. We detail the database in Section 3.1. The results are in Section 3.2 and the conclusions in Section 4.

Theoretical basis
In this section we give the theoretical framework on which are established, (i) the notion of proximity between sequences as well as (ii) the criterion of classification of the sequences.
Let (X t ) be a discrete time, order o (with o < 1) Markov chain on a finite alphabet A. Let us call S ¼ A o the state space, denote the string a m , a m+1 ,. . ., a n by a n m where a i 2 A, m i n. For each a 2 A and s 2 S define the conditional probability P ðajsÞ ¼ Prob ðX t ¼ ajX tÀ1 tÀo ¼ sÞ: In a given sample x n 1 ; coming from the stochastic process, the number of occurrences of s in the sample x n 1 is denoted by N n (s) and the number of occurrences of s followed by a in the sample x n 1 is denoted by N n (s, a). In this way, N n ðs;aÞ N n ðsÞ is the estimator of P(a|s). Consider now, two Markov chains (X 1,t ) and (X 2,t ), of order o, arranged on the finite alphabet A with state space S: Given s 2 S denote by {P(a|s)} a2A and {Q(a|s)} a2A the sets of conditional probabilities of (X 1,t ) and (X 2,t ) respectively. We define a local metric d s (introduced by [2]) that, when evaluated in a given string s, allows us to define how far or near the processes are. (i) For a string s 2 S; d s ðx n 1 1;1 ; x n 2 2;1 Þ ¼ a ðjAj À 1Þ lnðn 1 þ n 2 Þ X a2A N n 1 ðs; aÞ ln N n 1 ðs; aÞ N n 1 ðsÞ þ N n 2 ðs; aÞ ln N n 2 ðs; aÞ N n 2 ðsÞ ÀN n 1 þn 2 ðs; aÞ ln N n 1 þn 2 ðs; aÞ N n 1 þn 2 ðsÞ ; with N n 1 þn 2 ðs; aÞ ¼ N n 1 ðs; aÞ þ N n 2 ðs; aÞ; N n 1 þn 2 ðsÞ ¼ N n 1 ðsÞ þ N n 2 ðsÞ; where N n 1 and N n 2 are given as usual, computed from the samples x n 1 1;1 and x n 2 2;1 respectively. With a a real and positive value. The Definition 2.1 offers us two notions of proximity between sequences, "i." is local and "ii." is global, "i." and "ii." are statistically consistent, that is, by increasing the min{n 1 , n 2 } grows their ability to detect (a) discrepancies (when the underlying laws are different) and (b) similarities (when the underlying laws are the same). In the application we use a = 2 (see Definition 2.1-i.), with this value (a = 2), to decide that the sequences follow the same law when d s < 1, is equivalent to use the Bayesian Information Criterion (see [2,10]). In [2] is proved also that d s is a metric: (b) d s ðx n 1 1;1 ; x n 2 2;1 Þ ¼ d s ðx n 2 2;1 ; x n 1 1;1 Þ; (c) d s ðx n 1 1;1 ; x n 2 2;1 Þ d s ðx n 1 1;1 ; x n 3 3;1 Þ þ d s ðx n 3 3;1 ; x n 2 2;1 Þ: To follow is introduced a notion that makes possible the classification of sequences that belong to a group of sequences.
Definition 2.2. Given a finite collection fx n j j;1 g m j¼1 of samples from the processes fðX j;t Þg m j¼1 with probabilities fP j g m j¼1 ; over the finite alphabet A, with state space where, given a sequence fz j g l j¼1 ; median fz j ; 1 j lg ¼ z ðkþ1Þ if l = 2k + 1 and median fz j ; 1 j lg ¼ z ðkÞ þz ðkþ1Þ 2 if l = 2k, for k an integer and z (j) denoting the jth order statistic of the collection fz j g l j¼1 .
With the V values attributed to each sample, we can proceed to order the samples, from lowest to highest value of V, in order to identify their classification. As we can perceive from the Definition 2.2, low values of V indicate that these samples represent the whole group better, while high values of V indicate little representativeness. The next result (proved in [1]), give us an adequate tool to classify sequences, according to their underlying laws, it allows to consolidate V as a robust and consistent classifier. l m : where x d e is the smallest integer greater than or equal to x. Theorem 2.1 guarantees that if at least 50% of the samples of the set follow the same law, each of them receives a value of V close to zero. And if this does not happen, V takes arbitrarily large values identifying discrepancies in the generating laws of the sequences.

Data and results
First, we describe the data, its source and structure, afterwards we proceed to measure the distance between the sequences and to classify them by representativeness.

Dengue virus type 1
The complete sequences were obtained from http://www.ncbi.nlm.nih.gov/ (NCBI -National Center for Biotechnology Information), sequenced and studied for the first time in [3]. We describe the sequences in Table 1.
The epicenter of Dengue Fever (DF) outbreak during 2014 was possible in the Yoyogi Park in Tokyo. Part of the sequences are coming from patients who pass through there and nearby locations. Details of each patient listed in Table 1 are given in [3], here we emphasize some of them. In the last column of Table 1 we inform the place in where it is suspected that the contamination happened to each patient. The contamination of patient 14-149J occurred in a place near to Yoyogi Park. 14-153J did not visit Yoyogi Park for at least two weeks before the onset of DF and was likely infected in Chiba prefecture. 14-181J lives in Shizuoka prefecture and never visited Yoyogi Park or the other affected areas, visited other places in Tokyo before the onset of DF. Patient 14-188J lives in Nishinomiya city, Hyogo prefecture, over 500 km of Tokyo and never visited the Tokyo area before the onset of DF. He visited Malaysia for seven days and had the onset of DF 12 days after. To illustrate the structure of the data, consider the beginning of the sequence LC011945, gacaagaacagtttcgaatcggaagcttgcttaacgtagttctaacagtt . . . then, the alphabet is A = {a, c, g, t} with cardinal |A| = 4 and elements: adenine (a), cytosine (c), guanine (g) and thymine (t). All the sequences have around a size of 10 700 elements. In Figure 1 we see a map of Japan with the regions from are coming the patients listed in Table 1. To calculate the classification of the sequences and establish the similarity between them, in the next section we first calculate the values of d s for each pair of sequences, where s is a state of the state space S. As usual, the elements of the alphabet A are organized in triples, then we can choose a memory o = 3, 6, 9, etc., therefore, the state space is composed of o concatenations of elements of A (S ¼ A o ). In this case, the size of the sequences is approximately 10 700, so the recommended memory is o < log jAj ð10700Þ j k À 1 ¼ 7; where x b c is the greatest integer less than or equal to x. Then we can use memory three or six, to simplify, we use the memory o = 3.

Similarity between the genomic sequences
Since we want to obtain global measurements between the sequences we calculate the values of d max (Definition 2.1-ii.) between each pair of sequences. From this, we found that the three Tokyo sequences, LC011945, LC11946 and LC11947 have d max = 0. So, the three Tokyo sequences will be represented by LC011945. Then, we will work with four sequences, the sequence names an index number are shown in Table 2. Table 3 shows the value of d max for each pair of sequences. That is, for each pair of sequences in the Table 2 we compute d max , the computation of d max requires the computation of d s for each s of the state space. And in that case the memory used is o = 3.
We see that the lowest value of d max is caused by the sequences LC011945 and LC011948, with the second lowest value being the d max between the sequences LC011945 and LC016760. Already the highest value of d max occurs between the sequence LC011949 in relation to the sequences LC011945, LC011948 and LC016760 respectively. It is useful to represent the values of d max through a dendrogram as seen on Figure 2. We build different dendrograms (average, median, single and complete) and they all point to the same organization between the sequences, see http://www.ime.unicamp.br/~jg/cadvj/. As we can see, in fact the dendrogram exposes the homogeneity between three of the four sequences: LC011945, LC011948 and LC016760, leaving exposed the disparity between the sequence LC011949 and the group of three sequences.
Observe that d max < 1 in all cases (Table 3), this implies that all values of d s < 1 in all states s 2 S; i.e. the four sequences are considered as generated by the same stochastic law, but between them exist certain heterogeneity, detected by the magnitudes of d max . This fact allows us to carry out investigations that answer which of them is more or less representative in the set, which is the approach of the following subsection.

Classification of each sequence by means of V
We determine the classification attributed to each sequence, according to criterion V (see Definition 2.2). Table 4 shows the results.
The sequence that best represent de set of sequences (listed in Table 2) is coming from Tokyo LC011945. The most discrepant sequence (larger V) is LC011949, being the less representative sample and it indicates that LC011949 may have a different origin than the other sequences. This is, patient 14-181J was probably infected by a different strain from the other Japanese patients of Table 1. Comparing the classification of LC011949, which is 0.04200, we see clearly the impact of the d max values coming from Table 3. Each time the sequence LC011949 is compared with another one in the list, the value of d max increases by one decimal.  Table 1.
To identify more strongly the meaning of this classification, we have compared all the sequences found in the base http://www.ncbi.nlm.nih.gov/ with the profile of being complete sequences of Dengue Type 1, year 2014, and coming from Asia. The list of accession numbers is given in the Table 5. For each sequence identified through its accession number we attach two letters to that number, in order to easily identify the country. In Figure 3 we show a dendrogram build by the average criterion with all the complete sequences.

Cluster Dendrogram
Height Figure 2. Dendrogram by average criterion build from the d max values, reported in the Table 3.   The circular dendrogram shows that the sequence of Japan (LC011949) with the highest V (among those in the list of Table 4) is in a cluster quite far away from the others in the list (LC011945, LC011948, LC016760). Some observations from Figure 3 can be done, for instance, the Japanese sequence LC011949 is next to the Chinese sequence KT827371, while the other Japanese sequences (Table 4) are closer to a variety of sequences from various countries including Japan, Malaysia and Singapore. Moreover, by the form of organization of the dendrogram, we verified that the sequence LC011949 is considerably more distant from the group {LC011945, LC011948, LC016760} in comparison with other foreign sequences, such as sequences coming from China, Malaysia and Singapore. As seen in Figure 2, the dendrogram of Figure 3 also shows the proximity of the Japanese sequences LC011945 and LC011948, which supports the argument of its representativeness in the group of Table 1. See also http://www.ime.unicamp.br/~jg/cadvj/, in order to corroborate the results with dendrograms build applying several criteria. The Japanese sequence LC011949 (Shizuoka patient who never visited Yoyogi Park) besides being the least representative (V and d max higher) is also shown in Figure 3 closer to those of Chinese origin, which could implies a contamination of different origin.

Conclusion
In this paper we use two stochastic and statistically consistent notions to, (i) establish the proximity between genomic sequences (see [2]), (ii) classify the sequences in terms of their representativeness (see [1]). The classification rule gives low values to more representative sequences and it gives high values to less representative sequences. We classify genomic sequences of Dengue Virus Type I, originating in Japan and all corresponding to the outbreak occurred in Japan during 2014 (see Table 4). We identify the most representative sequences of the outbreak (those are from Tokyo), and we verify that these resemble other sequences (of 2014) coming from countries like Malaysia, Singapore, and China. The less representative sequence of the outbreak (from Shizuoka) is also a sequence that could resemble another one of Chinese origin (from 2014), but the latter being distant from the representative sequences of the outbreak. According to the classification that we have obtained and because of the evidence (see Figs. 2 and 3) we tend to agree with [3] in the sense of affirming that the outbreak in Japan during 2014 could involve more than one type of Dengue Virus Type I. By means of this type of approach it is possible to quantify the representativity of sequences, when compared with groups of sequences. This way of classifying is a genuinely stochastic tool, as explained in Section 2, that reports how close or distant are the stochastic laws of the sequences under consideration.
Future research could include the various serotypes of Dengue virus, in order to, (a) establish whether the notion d max /d s is capable of discriminating between the serotypes, (b) identify the spectrum of variation of the classifier (V) in each serotype, (c) establish the impact of the a constant (see Definition 2.1) in (a) and (b).