Non-hierarchical clustering of Manihot esculenta Crantz germplasm based on quantitative traits1 Agrupamento não-hierárquico em germoplasma de Manihot esculenta Crantz com base em características quantitativas

The knowledge of the phenotypic variation of cassava (Manihot esculenta Crantz) germplasm allows the estimative of the genetic variability to support the selection of contrasting genitors. Therefore, the aim of this work was to define homogeneous groups of cassava germplasm based on yield traits, disease resistance and root quality using K-means as a nonhierarchical method. Breeding values estimated by Best Linear Unbiased Predictor (BLUP) were used for the cluster analysis. The number of groups was defined according to the stabilization of the smallest within-group sum of squares. Seventeen clusters were defined to represent the diversity of the germplasm, whose number of accessions ranged from 7 (Group 15) to 69 (Group 9). In general, accessions belonging to Groups 1, 4, 7, 12, 15 and 16 showed good agronomic traits, such as high fresh root yield and starch yield (> 60.7 t ha-1 and 18.6 t ha-1, respectively). In contrast, only Group 15 presented low bacterial blight severity. The groups obtained showed strong differences, as evidenced by the within-groups sums of squares values, which ranged from 215.1 (Group 15) to 2,338.3 (Group 8). The K-means algorithm allowed the formation of consistent groups based on yield traits, disease resistance and root quality. Therefore, the K-means algorithm was efficient in the formation of groups with low within genotypic variation, especially concerning large amounts of data, such as in cassava germplasm banks.


INTRODUCTION
Cassava (Manihot esculenta Crantz) is the third most important source of calories in Africa, Asia and Latin America (after rice and maize) and is characterized as a food security crop in these regions, where most crops are often held by small farmers on marginal lands.The enormous potential of cassava as a raw material base for a range of industrial products can generate increased interest in the culture and thus raise the demand and contribute to agricultural transformation and economic growth in developing countries (FAO, 2013).
Genetic improvement of cassava has achieved significant progress, especially increasing the crop yield potential through the development of new varieties.Furthermore, in addition to the productivity and root quality, the crop's variety as a component of the production system plays an important role in the market diversification due to the starch specific characteristics that allow its use in several areas (ZHANG et al., 2010).
From the point of breeding, cassava genetic resources are used as a genetic variation source for incorporation of high agronomical value genes in new cassava varieties.Many important traits have not even been discovered in this germplasm, but in general the search for new alleles for traditionally important characteristics in the cassava production system, which are increasing productivity, starch content and quality and resistance to drought, pests and diseases, is still an extremely important line of research as a step that precedes their use in breeding programs.
In general, the cassava genetic improvement is limited to the evaluation and selection of promising clones in segregating F 1 populations involving only a few dozen parents.However, to increase the crop's genetic base, the breeding programs should incorporate new alleles through the use of new germplasms, whether local or even wild species varieties.Thus, an early and important step is the evaluation of new genetic resources of cassava, defining the most promising accessions to be used by plant breeders.Previous studies in cassava accessions revealed that there is enough genetic variability to be exploited in breeding programs (ESUMA et al., 2012;KIZITO et al., 2007;KAWUKI et al., 2013;TURYAGYENDA et al., 2012).
The knowledge of the phenotypic variation for the cassava germplasm makes it possible to know the actual genetic variability and support the selection of genetically different parents, which can be used in intercrosses to obtain a high heterosis effect, thereby increasing the probability of recovering superior segregators in further generations.Although several methods are available to analyze the phenotypic variation, some non-hierarchical methods like K-means have emerged in clustering genotypes, especially when there is a large data set (RONNING et al., 2003).
This study is part of an effort to characterize, evaluate and use one of the largest collections of the cassava germplasm from Brazil, which holds thousands of accessions with distinct geographical origins, selection patterns and uses.The proposal is to maximize the potential contribution of the collection to develop a database that will serve as a guide in choosing parents to be used in breeding programs.Therefore, the objective of this study was to define homogeneous groups of cassava germplasm accessions based on yield and root quality traits.

Plant material
Six hundred and twenty-nine accessions from the cassava germplasm bank (CGB) from Embrapa Mandioca e Fruticultura (Cruz das Almas, BA, Brazil) were evaluated.This database consists of landraces and improved varieties resulting from conventional breeding procedures, such as crossing and selection, as well as the selection of local varieties with high yield potential identified by farmers or research institutions.

Experimental design
The field experiment was installed in September 2011 at the Starch Producers Cooperative (Coopamido) in the municipality of Laje (BA).The experiment was conducted in augmented block design with 629 cassava germplasm accessions as non-common treatments and 11 checks as common treatments distributed in 10 blocks with 10 plants per plot.The spacing used was 0.90 m between rows and 0.80 m between plants and cultivation was performed according to the crop's recommendations.Plants were harvested 20 months after planting.

Evaluated characteristics
The data evaluated in t ha -1 were yield of commercial roots (YiComRoo); yield of noncommercial roots (YiNComRoo); shoot weight (ShoWe) and starch yield (StaYi -considering the starch content and total root yield).Harvest index (HI), which characterizes the relation between root production and aerial cassava biomass, measured as %; plant height (PlHei), measured in m; starch weight (StaWe), measured by specific gravimetric analysis; dry matter content (DMC), in % and starch content (StaC), in %.
The evaluation for anthracnose (Colletotrichum gloesporioides) and bacterial blight (Xanthomonas axonopodis pv manihotis) severity were performed under field conditions, disease symptoms were observed at 10 months after planting.For bacterial blight (RBB) the following rating scale was used: 0 = no symptoms; 1 = symptoms on leaves only -blight; 2 = presence of necrotic lesions on the stem or petiole; 3 = most severe symptoms on leaves and / or the presence of necrotic lesions with gum exudation; 4 = complete loss of leaves with apical death or death of the plant, while for anthracnose (RAn) the used scale was: 0 = no anthracnose symptoms; 1 = presence of small cankers or older lesion in the lower half of the plant; 2 = presence of deep cankers in the upper half of the plant; 3 = presence of deep cankers with sporulation, distortion or wilting in leaves and drying up of shoots; 4 = dieback or death of the plant.
The genotypes recoverability after disease infestation (RADI) by means of a scale (0, 1 and 2 = low, medium and high resilience, respectively) was also evaluated.The data of the plots were taken on 10 plants per plot, adopting the criterion of mode to represent the access notes.

Data Analysis
We used the method of restricted maximum likelihood (REML) as described by Resende ( 2007) to estimate the variance components.Breeding values were obtained from the best linear unbiased predictor (BLUP) using the software Selegen (RESENDE, 2007).Then, the data were subjected to K-means clustering analysis using R version 3.0.1 (R DEVELOPMENT CORE TEAM, 2013).This method employs a non-hierarchical group to minimize the distance between the elements to a set of centers given by = {x1, x2... xk} iteratively.The distance between a p i point and a set of clusters given by d (p i , ), is defined as the distance from the closest point to its center.The function to be minimized is given by: (EVERITT; HOTHORN, 2010).

RESULTS AND DISCUSSION
The identification number of the groups was based on the stabilization of the within-group sum of square, whose increase in the number of groups contributed little to the decrease of the sum of squares.Thus, according to Figure 1, 17 groups were identified to represent the genotypic diversity of cassava accessions.
The choice of this number of groups shows high genotypic variability in this analyzed germplasm set, whose mean values of each group are presented in Table 1.The number of accessions in each group ranged from 7 (Group 15) to 69 (Group 9), averaging 37.6 accessions per group.
The partition of the groups with similar traits and, at the same time, with great distance between groups, was observed.This can be demonstrated by the low values of the within-group sum of square, which ranged from 215.1 (Group 15) to 2,338.3 (Group 8), with an average of 1,372.2.This represents on average only 1.2% of the total sum of squares (109,754.7).The K-means is an algorithm that minimizes the sum of distances for each pattern to each group's centroid, over all groups, resulting in the formation of groups quite consistent as observed in the analysis of cassava germplasm.Moreover, according to Ronning et al. (2003), with the K-means clustering of the potato transcriptome using expressed sequence tags we were able to correlate changes in gene expression with major physiological events in potato biology, such as tuber initiation, dormancy, and sprouting.
Groups 1 and 7 showed the highest values for YiComRoo (65.63 and 74.20 t ha -1 , respectively), constituting promising groups to select genotypes for commercial use per se or to be used in breeding programs.K-means clustering were also used in soybean breeding programs to select F3 progenies by their ability to link progeny according to the most important characteristics Regarding the YiNComRoo, although groups are quite homogeneous, group 3 had the highest standard deviation (0.76), ranging from 8.46 to 13.88 t ha -1 (Table 2).For ShoWe, a greater variability between the groups formed by K-means method, whose average values ranged from 23.78 to 44.79 t.ha -1 , was observed.However, averages for groups 3 and 7 are above 43.60 t ha -1 , reaching more than 56.00 t.ha -1 for some genotypes in these groups.
More than 58% of the groups showed a harvest index above 60%, even though the means of groups 11, 13, 16 and 17 are above 65%.An important aspect of this trait is that even with accessions with high shoot production, group 7 also has a high YiComRoo, evidenced by its high harvest index (average of 65.47%).Using phenotypic means, Ojulong et al. (2010) also observed high amplitude for this trait (5 to 90%).Moreover, considering the high inheritability of the HI trait, as well as its correlation with fresh root production, some authors have used the HI as an indirect measure of root yield in the early breeding stages (KAWANO et al., 1998).The balance between the plant total weight and HI has great importance in the variety selection program, especially in low yield environments.
Plant height is an important trait in cassava crops, whereas cuttings taken from the stems are the main means for the species' propagation.Therefore, it is desirable to have accessions to the height desired for rapid multiplication of plant genetic material as well as to increase the distribution flow of the propagation material to a greater number of farmers.In this respect, groups 15 and 7 showed average values of 2.43 m and 2.55 m, respectively, ranging from 2.20 m to 2.85 m, in these two groups' accessions (Table 2).
Another trait of utmost importance for the cassava crop, particularly for crops for industrial production, is the starch weight.For this trait, the group's average values ranged from 289.4 g to 551 g (groups 14 and 6, respectively) (Table 2).The means for groups 4, 12, 8, 7 and 6 are above 500 g, with some accessions reaching over 600 g, constituting themselves as groups of agronomic interest.
For DMC and StaC, groups 7 and 6 groups were also the most promising, with dry matter content of around 34% and starch levels above 29% in the roots (Table 3).Recently, Ojulong et al. (2010) reported averages for DMC ranging from 16.3% to 69.1%.However, these values are related to phenotypic averages of segregating populations in the early improvement stages and must; therefore, be confirmed in further trials with the greatest number of plants per plot.
In contrast, even with high starch content, a good cassava variety must have a root yield potential.Therefore, the starch yield (StaYi) reflects the relation between these two traits well.Generally, accessions present in groups 1 and 7 showed the highest starch yield.Moreover, group 1 presented high root yield, resulting in a good choice for industrial systems aimed at exploiting starch as a raw material.Moreover, accessions from group 7 seem to have a better balance between yield and starch content for fresh roots (Table 3).
Regarding diseases, groups 14 and 15 presented less severity on bacterial blight (RBB) in field conditions, where average values of the scores were below 1 (resistant).In contrast, groups 1, 4 and 7 had high average scores, with some accessions presenting values near 3 (susceptible).Bacterial blight is one of the most important diseases of   2005).As the development of resistant varieties is an effective measure in controlling the disease, identification of sources resistant to the pathogen is an important step for incorporating these alleles in commercial varieties.
Considering anthracnose (RAn), the mean genotypic values of different groups were very similar; although accessions from group 8 have shown an average score of 3.13 (range 2.81 to 3.30).Anthracnose is widespread in all producing regions but is especially severe in the Northeast and Southeast regions of Brazil, where environmental conditions are more favorable for its development.Long rainy periods with temperatures between 18 and 28 °C are the ideal conditions for the disease's occurrence and spread.Although the records of economic damage caused by anthracnose are scarce, it is necessary to take control of preventive measures to avoid risks to the production and the infection of the propagation material for the next cycles.
Genotypic data on severity caused by anthracnose and bacterial blight (RADI) is an initial indication of the resistance of cassava accessions evaluated in this work, whereas the incidence of these diseases in the field occurred by natural pathogen dispersal, and so a high uniformity on inoculums distribution and concentration is not to be expected.Hence, further evaluation should be performed so that this information could be used in the selection of sources resistant to these important cassava diseases.
In genotypic terms, differences were not observed between K-means clusters for RADI (Table 3).However, as the heritability of this trait was very low (0.02 -data not shown), it is necessary to conduct further experiments under these conditions to verify the effectiveness of this trait's use in selecting accessions more tolerant to the infection by foliar diseases.
The K-means method partitioned the data into mutually exclusive k clusters , without the building of dendrograms to describe data grouping.However, it is possible to represent the distribution of individuals and the evaluated characteristics based on the analysis of the main components (Figure 2).Genotype grouping based on multivariate methods constitutes a strategy to summarize the information for breeders.In this case, the representative cassava accessions can be easily selected Figur e 2 -K-means clustering analysis for the 1 st and 2 nd components of PCA.Numbers represent different clusters.Grey and colored numbers represent the non-selected and selected groups, respectively.Arrows indicate the projection of the traits onto the principal components dimension and their relationships with differences between and within groups from the groups with specific characteristics of interest and used in hybridization programs with improved cultivars or accessions belonging to other groups holding other important attributes.
In general, it is observed that accessions belonging to Groups 1, 4, 7, 12, 15 and 16 present agronomic characteristics for direct use in the production system, or as parents in crossing blocks.For example, in these groups the average yield for commercial roots and starch was 60.7 t.ha -1 and 18.6 t.ha -1 , which is approximately 26.7% and 22.3%, respectively, higher than the averages for other groups (Table 2).Regarding the severity of the disease, no major differences were identified in the genotypic mean of the selected groups compared to the others, except for Group 15, in which the main focus of their choice was being less severely affcted by bacterial blight, although these accessions have a lower root income and therefore less dry matter and starch.Thus, considering that the success of breeding programs is dependent on the availability of sufficient genetic variation to promote new genotypic combinations that may bring some agronomic advantage over the available varieties, it is expected that the evaluation of genetic variability degree for cassava germplasm promoted in this paper can lead to a more effective use of this important genetic resource.

CONCLUSIONS
The genetic variability of the 629 cassava accessions is high enough to be exploited in breeding programs.Based on this, it was possible to define groups of quite different traits to each genotype of interest.Therefore, non-hierarchical groups such as K-means are an efficient method for the formation of groups with the lowest within-genotypic variation, especially with large amount of data, such as in cassava germplasm banks.

Figure 1 -
Figure 1 -Distribution of the sum of squares according to the number of clusters.The red circle shows the number of clusters selected

Table 1 -
. J. Oliveira et al.Distribution of cassava accessions based on genotypic values predicted by best linear unbiased predictor (BLUP), through the analysis of 12 agronomic traits and disease resistance (DALLASTRA et al., 2014)roup sum of square in each cluster, being effective in aiding progeny selection(DALLASTRA et al., 2014).

Table 2 -
Variation of agronomic traits present in clusters formed by K-means analysis, based on the analysis of genotypic values obtained by best linear unbiased predictor (BLUP) in cassava accessions E. J. Oliveira et al.

Table 3 -
Variation of agronomic traits and disease resistance present in clusters formed by K-means analysis, based on the analysis of genotypic values obtained by best linear unbiased predictor (BLUP) in cassava accessions Non-hierarchical clustering of Manihot esculenta Crantz germplasm based on quantitative traits cassava in Brazil.In regions where climatic conditions are favorable, such as the South, Southeast and Midwest, this disease becomes limiting and can cause total losses when susceptible varieties are grown.In the Northeast, the disease dissemination is more located in Bahia state and is particularly found in crops in the Cerrado ecosystems, Low and Middle São Francisco, Chapada Diamantina and recently also Falcão River Valley (FUKUDA; GOMES,