Establishing a core collection from the integration of morpho-agronomical , phytopathological and molecular data 1

The aim of this study was to establish and compare, as to their representativeness, core collections obtained from quantitative data, multicategoric, molecular and collections that covering all this information simultaneously. Ten subcollections were established from 67 tomato accessions of the Germplasm Bank of the Universidade Federal de Viçosa (BGH-UFV), characterized according to 19 quantitative traits, 30 multicategoric characters, 52 ISSR loci and to the reaction to three pathogens. These subcollections were defined by the combination of the nature of data collected and the sampling rate. The COD-20 subcollection stood out in 20% intensity of sampling by has higher rates of coincidence amplitude followed by more appropriate values of variance. At 30% intensity, subcollection MOL-30 was as efficient as the subcollection COD-30 when considering only the rates of coincidence of the amplitude and the variances. However, the graphical analysis of the variability showed a slight superiority of subcollection COD-30 in maintaining the variability, especially regarding multicategoric characters. So whenever data from different sources are available, should be prioritized the establishment of core collections from the integration of these data, since these were more representative when the amplitude coefficient, variance, and retention index of variability, are regarded simultaneously.


INTRODUCTION
The emphasis given to the importance of preserving genetic resources has led to the formation and maintenance of large germplasm collections around the world.However, the large size of these collections has often been an obstacle to their use, conservation and management (VASCONCELOS et al., 2010).With the objective of enhancing utilisation and accessibility, and minimising maintenance difficulties, Frankel (1984) proposed the concept of the core collection, as being a subsample derived from a set of germplasm chosen to represent, with minimum redundancy, the maximum genetic variability of the initial collection or base of a particular species.
Establishing a core collection requires an integrated effort involving curators, breeders and geneticists, to define the size and the accessions that will make up the collection (ABADIE et al., 2005).Much work is done in order to establish core collections, and several strategies have been outlined for their formation, which include from simple random, non-random, and stratified sampling (UPADHYAYA et al.,2007;XU et al., 2006), to more sophisticated sampling methods (JANSEM; VAN HINTUM, 2007;VASCONCELOS et al., 2007;WANG et al., 2007).
Studies of genetic diversity show that there is not always a correlation between the dissimilarity distances established from the different types of data, so that the diversity observed from one set of characteristics, cannot always be extrapolated to the rest (GOMES, 2007;MARTINS et al., 2011).It is therefore prudent to consider the maximum amount of information to characterise the germplasm when establishing a core collection.
In general, studies of genetic diversity have been carried out, and core collections established, based on quantitative, qualitative or molecular characteristics alone.Uniting all the data into a single study has been hampered by the absence of germplasm banks that would contain such a detailed characterisation of the germplasm, due to limited financial and human resources (MARTINS et al., 2011).
Although scarce, methodologies for the integration of different types of data into a single analysis are required to establish more effective core collections, especially when the primary purpose is the preservation of genetic variability.
The aim of this study therefore, was to establish and validate core collections obtained from quantitative, multi-category or molecular data, and from collections that include all of this information.

MATERIAL AND METHODS
The study was carried out on a set of data of 67 tomato accessions from the Vegetable Germplasm Bank of the Federal University of Viçosa (BGH-UFV); all had been previously characterised for 19 quantitative features, 30 multi-category features and 53 ISSR loci -determined by Aguilera et al. (2011) -and for their reaction to Alternaria solani, Pseudomonas syringae pv.Tomato, and to the begomovirus, Tomato yellow spot (ToYSV).
Quantitative characteristics were evaluated for different development phases of the tomato.At the seedling stage, the diameter of the hypocotyl and length of the cotyledon were evaluated.At the vegetative stage, the thickness of the main petiole and the lengths of the leaf and internode were evaluated.In the fruit, the following characteristics were assessed: fruit length, width of the fruit and central axis, thickness of the endocarp, number of locules, total soluble solids content, total titratable acidity and organoleptic quality.In addition, agronomic characteristics such as the weight and number of good fruit, the weight and total number of fruit, average fruit weight and precocity index were evaluated.
Data for reaction of the accessions to A. solani and P. syringae are by nature quantitative, as they are obtained respectively from measurement of the leaf area damaged by the fungus, and by counting the total number of bacterial pustules on each plant.The reaction of the accessions to ToYSV was characterised into five classes: highly resistant, resistant, moderately resistant, susceptible and highly susceptible.The data from this evaluation were analysed together with the multi-category data: hypocotyl colour; type of plant growth; density of pilosity on the stem and foliage; attitude and type of leaf; type of corolla; external colour of the immature fruit; the presence and frequency of green shoulder on the fruit; the shape, homogeneity and size of the fruit; colouration and colouration intensity in the mature fruit; secondary fruit shape and shape of the shoulder; size of the area of corking around the pedicel scar; ease of removal of the epicarp; colour of the epicarp; colour and colour intensity of the mesocarp; shape of the cross section of the fruit and stylar scar; shape of the distal extremity of the fruit; condition of the stylar scar; colour of the central axis; and radial and concentric cracking.
The accessions were previously separated or stratified by commercial group: Saladinha, Santa Cruz, Italian or Saladette, Apple or Persimmon and Cherry, according to an analysis of the shape of the fruit, assessed by grading (IPGRI, 1996) (Table 1).
Ten sub-collections of tomato accessions from BGH-UFV were evaluated, defined by the combination of the type of data evaluated and a sampling density of 20 or 30%, representing 14 and 20 accessions per collection respectively.A logarithmic strategy was employed for deciding the number of entries selected for each stratum, calculated from the expression: (1) where: NAC is the number of accessions sampled per stratum; a i is the number of accessions of the i-th class; nt is the total number of accessions sampled, defined by the sampling density; and ns is the number of strata.
The choice of accessions to comprise each of the sub-collections was based on diversity analysis within each stratum.Dissimilarity matrices between accessions within each stratum were therefore obtained from the quantitative data, either multi-category or molecular, whether integrated by matrix addition or by encoding the quantitative data into multi-category data through a strategy of equal division of the amplitude into three classes, as described by Martins et al. (2011).In obtaining the dissimilarity matrices, the standardised mean Euclidean distance was used for the quantitative data, and the arithmetic complements of the simple index of coincidence for the multi-category data, and of the Jaccard similarity coefficient for the molecular data.
For each dissimilarity matrix from each stratum, the maximum distance value was found and converted into a similarity value using the equation: (2) where: s is the similarity; d is the distance value between the i and i' individuals on the dissimilarity matrix; and d max is the maximum distance value.
In this way, the maximum distance was equated to zero.Due to the dissimilarity matrix becoming a similarity matrix, when grouping accessions by the Tocher method, F. A. Martins et al.
an inverse result was produced, i.e. the most divergent genotypes formed clusters (VASCONCELOS et al., 2007).Selection of those accessions that made up the sub-collections was carried out according to the clustering sequence of the Tocher method, until the number of accessions reached the predetermined selection number for each stratum, according to sampling density (OLIVEIRA et al., 2010).
The process of validation of the sub-collections involved comparing them to the initial collection.Comparisons were made taking into consideration the amplitude coincidence index (AC), and the variance for each group of characteristics, whether quantitative, multi-category or molecular.In addition, a graphical analysis of the variability was proposed, for which the sub-collections were also compared using a retention of variability index (RVI).
The amplitude coincidence index (AC) for each sub-collection was obtained for each group of characteristics, whether quantitative, multi-category or molecular, by means of the equation (HU; ZHU; XU, 2000;WANG et al., 2007): (3) where: AC is the amplitude coincidence index; A i SC is the amplitude of the i-th characteristic in the sub-collection; AiCI is the amplitude of the i-th characteristic in the initial collection; and n is the number of characteristics for a particular group.
The variance was estimated for each characteristic, both in the initial collection and the core sub-collections.The average of these variances was obtained for each set of characteristics in each of the sub-collections and in the main collection.All comparisons between variances were carried out by F-test (SNEDECOR; COCHRAN, 1980).
To assess the representativeness of the core subcollections as to retention of variability, the encoded quantitative data and the multi-category data were recoded into binary data, i.e. each class was regarded as one characteristic and the accessions were rated as 1 when belonging to that class, and 0 when not.
The frequency of accessions in each class was estimated for all the core sub-collections.The class of a characteristic, when present in all individuals of a subcollection, was considered a fixed characteristic, while those classes with zero frequency in the sub-collections were considered as extinct characteristics.In short, this showed the fixation or loss of alleles for the different phenotypes of a characteristic.Once the frequency of each characteristic in the initial collection and in the sub-collections was estimated, the values were plotted on a graph and the retention of variability index (RVI) estimated: VRI = (number of classes kept in the sub-collections/total number of classes) x 100 (4)

RESULTS AND DISCUSSION
In the samples for a density of 20%, the number of accessions in the Saladinha, Santa Cruz, Italian, Apple and Cherry strata were 5, 4, 1, 2 and 2 respectively.While in the samples at 30% density, the number of accessions for the same strata were 8, 6, 2, 2 and 2 respectively.
From the grouping for each stratum by the inverse Tocher method based on the dissimilarity matrices obtained from the different types of data, and considering the adopted sampling densities of 20% and 30%, 10 core sub-collections were established (Table 2).
Only the 980, 2216 and 2234 accessions from the Cherry, Apple and Apple strata respectively, are present in all the sub-collections formed.For those strata with a higher number of accessions, Saladinha and Santa Cruz, no accessions were seen common to all the sub-collections.The differences found arise from the use of different types of data in establishing the sub-collections.This suggests that sub-collections based on isolated groups of characteristics do not cover the genetic diversity as a whole.
The logarithmic strategy ensured that groups containing few accessions, such as the Italian, Apple and Cherry strata, were represented in the sub-collection, and also that those groups with a large number of accessions contributed with relatively less accessions to the core collection.Logarithmic sampling therefore increases the probability of capturing the less frequent alleles compared to random sampling (BROWN, 1989).This strategy avoids the excessive sampling of accessions from large strata and increases the number of accessions sampled in the smaller strata, reducing bias due to group size (OLIVEIRA et al., 2010).
Once established, each sub-collection was evaluated as to its representativeness, i.e. its capacity to retain the variability of the initial collection.The representativeness of the core collection for the purposes of conservation, means maintaining the genetic variability.When comparing mean values, amplitudes, frequencies and variances for specific characteristics among the different members of the core and initial collections, it is expected that the intervals remain similar, while mean values move toward the median, and variances increase in the core collection ( VAN HINTUM et al., 2000).489, 1987, 2203, 989 and 2213 489, 1987, 2203, 989, 2213, 1989, 2202 489, 1993, 2202, 181 and 1987 489, 1993, 2202, 181, 1987, 850, 2208  ¹QUANT: sub-collection formed from quantitative data; MULT: sub-collection formed from multi-category data; MOL: sub-collection formed from molecular data; COD: sub-collection formed from the integration of quantitative, multi-category and molecular data by the encoding of quantitative into multi-category data; SUM: sub-collection formed from the integration of quantitative, multi-category and molecular data through algebraic matrix addition.² Stratum: SAL -Saladinha; SC -Santa Cruz; ITA -Italian or Saladette; AP -Apple or Persimmon; CHE -Cherry At the sampling density of 20%, the sub-collections COD-20 and SUM-20, obtained by data integration, stood out, as they displayed values for the amplitude coincidence index (AC) that were greater than 0.80 in relation to all the groups of characteristics, whether quantitative, multicategory or molecular (Table 3).
According to Hu, Zhu and Xu (2000) and Wang et al. (2007), a sub-collection is representative when its amplitude coincidence index is at least 80%.Subcollections established from each set of characteristics, QUANT-20 MULT-20 or MOL-20, were therefore not considered representative, as they presented values for AC of 0.79, 0.77 and 0.78 for the molecular, quantitative and multi-category groups of characteristics respectively.
The variance of a sub-collection for each group of characteristics is another parameter that should be taken into account when assessing the representativeness of the sub-collections.By maintaining the amplitude of the characteristics, an increase in variance is expected in the sub-collections relative to the initial collection, since the number of individuals sampled is smaller.However, a significant decrease was seen in variance for the subcollections, , in relation to the group of quantitative characteristics (Table 3), indicating that individuals with extreme phenotypic values were not included in the sampling process for these sub-collections.
For the sub-collection, QUANT-20, a significant increase was seen in variance for the group of quantitative  ¹QUANT-20: sub-collection formed from quantitative data; MULT-20: sub-collection formed from multi-category data; MOL-20: sub-collection formed from molecular data; COD-20: sub-collection formed from the integration of quantitative, multi-category and molecular data by the encoding of quantitative into multi-category data; SUM-20: sub-collection formed from the integration of quantitative, multi-category and molecular data through algebraic matrix addition; IC: initial collection. 2 Corresponds to the group of characteristics used to estimate AC and variance. 3Parenthesis: variance values; * ** significant at 5% and 1% probability, ns not significant at 5% probability by F-max test characteristics.As this strategy gave a high value for AC, it can be inferred that this increase in variance is due to the greater frequency of extreme classes in this sub-collection, and furthermore, that the intermediate classes were underrepresented.Consequently, the evaluation of AC and variance should be complementary when validating and choosing the best strategy for obtaining core sub-collections, allowing the conclusion that for sub-collections with the same value for CA, those with less variance are the most representative.Considering the values of AC and variance for the quantitative characteristics, the sub-collections, COD-20 and SUM-20, were the most efficient.
For multi-category characteristics, where the intermediate classes should also be represented in the sub-collections, the considerations relative to AC and the change in variance are the same as discussed above for quantitative characteristics.It is desirable to have subcollections with a high amplitude coincidence index, together with a variance that is non-significant compared to the initial collection.
Regarding the group of molecular characteristics however, the core collections should be established in such a way as to preserve the greatest number of alleles and increase the frequency of rare alleles (alleles with a frequency of less than 5%).The ideal core collection is one that can maintain for a given locus the highest number of alleles of the same frequency.In the case of dominant molecular markers such as ISSR, the most suitable subcollection is that in which the frequency of individuals is the same, possessing a mark or not.Thus, the greater its variance, the more efficient will be the sub-collection.
Considering the three groups of characteristics, quantitative, multi-category and molecular, the COD-20 strategy stood out in relation to SUM-20, with high values for AC, lower and non-significant estimates of variance for the quantitative and multi-category groups of characteristics, and greater variance relative to the molecular characteristics.
Where the sampling density was 30%, all the subcollections presented values for AC of more than 80% (Table 4).However, QUANT-30 and MULT-30 gave large ¹QUANT-30: sub-collection formed from quantitative data; MULT-30: sub-collection formed from multi-category data; MOL-30: sub-collection formed from molecular data; COD-30: sub-collection formed from the integration of quantitative, multi-category and molecular data by the encoding of quantitative into multi-category data; SUM-30: sub-collection formed from the integration of quantitative, multi-category and molecular data through algebraic matrix addition; IC: initial collection. 2 Corresponds to the group of characteristics used to estimate AC and variance. 3Parenthesis: variance values; * ** significant at 5% and 1% probability, ns not significant at 5% probability by F-max test and significant estimates for variance in relation to the groups of quantitative and multi-category characteristics respectively, being therefore considered less efficient.
Only the sub-collections, MOL-30 and COD-30, were representative, with values for AC over 80%, together with variance values of the appropriate magnitude for the three groups of characteristics.This result demonstrates that at this sampling density, only molecular characterisation was enough to establish a core sub-collection as efficient as COD-30 in maintaining the maximum of germplasm variability.
For the graphical analysis of the retention of variability, encoding the quantitative and multi-category data into binary data, together with the molecular data, resulted in 288 classes, with 63 for the quantitative data, 173 for the multi-category data and 52 for the molecular data.
Figure 1 shows the frequencies of those classes relating to the quantitative, multi-category and molecular characteristics respectively, for the COD-20 subcollection.
In each graph, for each class of characteristics, the frequency of accessions belonging to that class are shown, both for the initial collection and the sub-collection, with their deviation.It can be seen that only two of the quantitative classes were not retained in the sub-collection (indicated by arrows), i.e. none of the accessions chosen to make up COD-20 had that characteristic (Figure 1A).
For the multi-category characteristics, 142 of the 173 classes were represented in the sub-collection, that is, 82% of the phenotypes were retained, as they were present in at least one of the accessions that comprised COD-20 (Figure 1B).For the molecular data, variability was maintained for 92% of the loci, which represents the loss of five alleles (indicated by arrows) in the sampling process, either by fixing the presence of a mark or fixing the absence of a mark (Figure 1C).Accordingly, the retention of variability index (RVI) for this collection was 86.8%.
In the graphical analysis of the sub-collections, MOL-30 and COD-30, variability was retained for all classes of quantitative characteristics (Figure 2A), and no classes with a frequency of zero were seen.For the multi-category characteristics, the sub-collection, MOL-30, was less efficient than COD-30, as more classes of zero frequency were noted in MOL-30, where 29 of the 173 characteristics present in the initial collection were not represented, whereas in COD-30, the number of characteristics not sampled was 25 (Figure 2B).
Comparing the graphs of variation in allele frequency (classes of molecular characteristics) for MOL-30 and COD-30, a loss of one and three alleles was seen respectively for these sub-collections (Figure 2C).Dealing in this case with dominant molecular markers, the loss of alleles was recorded by fixing the presence or the absence of a mark (highlighted by arrows).In general, from the graphical analysis it was concluded that, although the amplitude coincidence index and the mean of the variances of the characteristics in the sub-collections, MOL-30 and COD-30, demonstrate that both are equally efficient, COD-30 was slightly more so, as it presented an RVI of 93.75%, retaining a greater number of classes of characteristics in comparison with the sub-collection, MOL-30 (RVI = 89.6%).
A core collection is defined as set of accessions from a sample of germplasm, chosen to represent the maximum genetic variability of the initial collection with the minimum of redundancy (BROWN, 1989;FRANKEL, 1984).Whenever data of different types are available therefore, the establishment of core collections from the integration of such data should be prioritised.In this context, encoding quantitative into multi-category  v. 53, n. 3, p. 515-521, 2006.ZEWDIE, Y.; TONG, N.; BOSLAND, P. Establishing a core collection of Capsicum using a cluster analysis with enlightened selection of accessions.Genetic Resources and Crop Evolution, v. 51, n. 2, p. 147-151, 2004.

F
. A. Martins et al.

Figure 1 -Figure 2 -
Figure 1 -Frequency variation for each class of characteristics, quantitative (A) multi-category (B) and molecular (C).Shaded circles indicate frequency for classes in the initial collection, empty circles indicate frequency for classes in the established core sub-collection, COD-20, (integration of quantitative, multicategory and molecular data by the encoding of quantitative into multi-category data, at a sampling density of 20%).Arrows indicate classes of characteristics with zero frequency in the subcollection (A) and alleles that were fixed, for presenting either zero or maximum frequency in the sub-collection (C)

F
. A. Martins et al.XU, H. M. et al.Sampling a core collection of island cotton (Gossypium barbadense L.) based on the genotypic values of fiber traits.Genetic Resource Crop Evolution,

Table 1 -
Classification of 67 tomato accessions from the Vegetable Germplasm Bank of UFV (BGH-UFV) by commercial group

Table 2 -
Core sub-collections of the tomato from BGH-UFV, established at a sampling density of 20% and 30%, from integrated and different type data, by the method of logarithmic stratified sampling

Table 3 -
Amplitude coincidence index (AC) and mean variance of characteristics in the sub-collections formed at a sampling density of 20% of the data

Table 4 -
Amplitude coincidence index (AC) and variance for the sub-collections formed at a sampling density of 30% of the data