Results

Here are some taxonomy trees built from matrices of distances between sequences. The distances between sequences are the difference of relative abundance based of CGR.

The experiments were made with the CGR in the square.

Used sequences

For these experiments, we used sequences retrieved from genome banks available on Internet. For each species, we took 10 sub-sequences of size 100000, and we added their reversed complementary.

AbbreviationSpeciesGenbank Id
homsa1 Homo Sapiens NT_022184.13
homsa2 Homo Sapiens NT_005403.14
homsa3 Homo Sapiens NT_025741.13
homsa4 Homo Sapiens NT_011520.9
homsa5 Homo Sapiens NT_011757.13
mmusMusmusculus NT_078586.1
ratn1Rattus Norvegicus NC_005118
ratn2Rattus Norvegicus NC_005117
ratn3Rattus Norvegicus NC_005107
ratn4Rattus Norvegicus NC_005105
gal1Gallus gallus NC_006097.1
gal2Gallus gallus NC_006096.1
gal3Gallus gallus NC_006095.1
gal4Gallus Gallus NC_006094.1
gal5Gallus Gallus NC_006093.1
gal6 Gallus GallusNC_006092.1
gal7 Gallus Gallus NC_006091.1
agam1Anopheles gambiae NW_045719.1
agam2Anopheles gambiae NW_045746.1
agam3Anopheles gambiae NW_045763.1
agam4Anopheles gambiae NW_045815.1
dmela1Drosophila melanogaster NC_004354.1
dmela2Drosophila melanogaster NT_033779.2
dmela3Drosophila melanogaster NT_033778.1
dmela4Drosophila melanogaster NT_037436.1
dmela5Drosophila melanogaster Arm X
dmela6Drosophila melanogaster Arm2R
dmela7Drosophila melanogaster Arm 2L
dmela8Drosophila melanogaster Arm3L
dmela9Drosophila melanogaster Arm4
dmela10Drosophila melanogaster Arm3R
celeg1Caenorhabditis elegans CHR_I
celeg2Caenorhabditis elegans CHR_II
celeg3Caenorhabditis elegans CHR_III
celeg4Caenorhabditis elegans CHR_IV
celeg5Caenorhabditis elegans CHR_V
celeg6Caenorhabditis elegans CHR_X
pfal Plasmodium FalciparumNC_004317
osat1Oryza SativaNT_036323
osat2Oryza SativaNT_079973
osat3Oryza SativaNT_080060
osat4Oryza SativaNT_080067
osat5Oryza SativaNT_080068
athal1 Arabidopsis thaliana NC_003070
athal2 Arabidopsis thaliana NC_003071.3
athal3 Arabidopsis thaliana NC_003074.4
athal4 Arabidopsis thaliana NC_003075.3
athal5 Arabidopsis thaliana NC_003076.4
baccBacillus cereusNC_004722
brajBradhyrhizobium japonicumNC_004463
ccreCaulobacter crescentusNC_002696
mlotMesorhizobium loti NC_002678
mbovMycobacterium bovis AF2122/97
saveStreptomyces Avermitilis NC_003155
scoeStreptomyces Coelicolor NC_003888
mace Methanosarcina Acetivorans C2A NC_003552
mazeMethanosarcina Mazei NC_003901
ssolSulfolobus Solfataricus P2 NC_002754.1
paerPyrobaculum Aerophilum NC_003364
stokSulfolobus Tokodaii NC_003106
afulArchaeoglobus Fulgidus NC_000917
haloHalobacterium sp NC_002607
mkanMethanopyrus Kandleri NC_003551
mther Methanothermobacter Thermautotrophicus NC_000916
pabyPyrococcus Abyssi NC_000868
phorPyrococcus Horikoshii NC_000961
taciThermoplasma Acidophilum NC_002578
tvolThermoplasma volcanium NC_002689
btheBacteroides Thetaiotaomicron NC_004663
violChromobacterium Violaceum NC_005085
ecolEscherichia Coli NC_004431
rbalRhodopirellula Baltica NC_005027
vibpVibrio Parahaemolyticus NC_004603
xcamXanthomonas Campestri NC_003902
ypseYersinia Pseudotuberculosis NC_006155
Partitions
The first partition used is composed of 16 zones corresponding to 3-letter words.
Picture of the first partition used
The second partition used is composed of 20x20 regular sub-squares of the unit square, grouped into 16 equiprobable sets, i.e. each sub-square has a probability of 1/16 to belong to one of the set. So the partition is also composed of 16 zones.
Picture of the second partition used
Trees obtained
Computations of distances between sequences with the first partition, then using the "Neighbour-Joining" method on these distances gives the tree on the right (click to enlarge). The Eukaryotes are represented in green, Archaea in blue and the Bacteria in orange, yellow and brown. We can see that several species are "not well classified".Taxonomy tree obtained with the first partition
Using the second partition, where zones does not correspond to words, the species in the tree obtained on the right (click to enlarge) are "better classified": we distinguish clearly Eukaryotes, Bacteria and Archaea. Only three species of Archaea are mixed in the Bacteria.Taxonomy tree obtained with the second partition
Choice of the partition

The choice of the partition is crucial: some partitions which don't correspond to word counting can give "less good" results than the 20x20 partition above.

Indeed, a partition created with 20x20 non-regular rectangles grouped in 8 equiprobable sets (see picture below) gives the tree on the right. Even if we distinguish well Bacteria from Eukaryotes, lots of Archaea are mixed with other species.
Picture of the partition
Taxonomy tree obtained with another 20x20 partition
By taking another partition composed of 20x20 regular sub-squares distributed in 8 equiprobable sets (see picture below), we obtain the other tree on the right, in which Archaea and Eukaryotes are isolated but some Bacteria remain mixed.
Picture of the partition
Taxonomy tree obtained with yet another 20x20 partition