Tools overview

Here is a nonexhaustive list of the programs included in MyCGR, with the main functionalities of each one.

Manipulating sequences

mycgr_seq.x is used to

  • generate i.i.d. or markovian sequences, according to a law and a length given in parameters,
  • compute empirical frequences of nucleotides in a sequence,
  • cut a sequence in several sub-sequences of a given length.
Computing the CGR

mycgr_square.x perform various computations on the points of the CGR in the square:

  • compute and display the coordinates of the points in the CGR, from a given sequence,
  • count the points of the CGR in various given zones, from a given sequence,
  • compute various test statistics.

Various options modify the behavior of the program, for example to use already existing sequences to make the tests rather than to regenerate sequences for each simulation. This option allows to compare the various tests by applying them to the same sequences. One can also indicate parameters for the cache in order to optimize computations in some cases. The programs mycgr_segment.x and mycgr_tetra.x offer the same functionalities for the points of the CGR on the segment and in the tetrahedron.

Distances

mycgr_square_dist.x implements the computation of the distances between sequences, by generalizing in the CGR the dinucleotide-relative abundance profile. There are several options for the computation of these distances, in particular one can choose to compute the absolute values or the squares of the differences. Moreover, one can possibly gather the distances by species. The results can be generated in the formats Graphviz, Newick [1], PHYLIP, LaTeX. Computations on several hundreds of long sequences and for partitions composed of many zones take time. To solve this problem, the program can launch them in parallel on several machines by placing all the files on a shared file system (NFS for example). The cluster of the INRIA Rocquencourt was thus used to divide by 15 the time taken by long computations. A database can also be used to store intermediate results more effectively. The programs mycgr_segment_dist.x and mycgr_tetra_dist.x offer the same functionalities for the CGR on the segment and in the tetrahedron.

Drawing

mycgr_square_draw.x draws the CGR in the square. It reads the coordinates of points to draw on its standard input (these coordinates are generated by the program mycgr_square.x). Various options allow to:

  • display a grid on a given size,
  • draw a partition defined in a given file,
  • show the construction of points with arrows,
  • hide coordinates and/or letters corresponding to the corners,
  • display, instead of points, the frequencies of points in sub-squares of a given size. The color of each sub-square is darker when the frequency is higher.

The generated files are in PostScript of Embedded-PostScript format.

Tests of the structure of sequences

mycgr_square_test_markov.x is used to empirically evaluate the level and power of the tests of markovian structure of order m. In parameter, one can choose the number of experiments, the partitions, the lengths and the types (i.i.d., Markovian, Markovian mixed) of sequences that one wants to test. One can also make these simulations with the method of Bonferroni [2]. The results are generated in the LaTeX format. The equivalent programs exist for the CGR on the segment (mycgr_segment_test_markov.x) and in the tetrahedron (mycgr_tetra_test_markov.x).

mycgr_square_vn.x empirically evaluates the level and power of the independence test, by chosing the number of experiments, the partitions, the lengths and the type (i.i.d., Markovian, Markovian mixed) of the sequences to test. The equivalent programs exist for the CGR on the segment (mycgr_segment_vn.x) and in the tetrahedron (mycgr_tetra_vn.x).

Construction of zones and partitions

mycgr_square_zones.x generates files of zones or partitions on the unit square. This can be zones corresponding to words or to regular or random subdivisions of the square. One can also generate random rectangles and circles. It is also possible, to make a partition, to define one of the zones as the complementary of the others. At last, it is also possible to define a partition by cutting out the square regularly or randomly in a multitude of zones and then grouping them in N sets, possibly in a nonequiprobable way. The equivalent programs exist for the CGR on the segment (mycgr_segment_zones.x) and in the tetrahedron (mycgr_tetra_zones.x).

Coherent use, naming convention

Because of the many experiments made on various sizes of sequences with various zones and various methods to compute the distances, some naming conventions for the files became necessary. All the files are thus placed in a tree structure of directories whose root (meta_root) is a parameter of compilation. Then the files are oganized in the following way:
meta_root/sequences contains the orginial sequences.
meta_root/size contains the sequences of size size isolated from orginial sequences.
meta_root/cache contains the cache files.
meta_root/zones/segment contains the files of partitions on the segment.
meta_root/zones/square contains the files of partitions in the square.
meta_root/zones/tetra contains the files of partitions in the tetrahedron.
meta_root/results/size/dists contains the files of results of distances computations with sequences of size size. In this directory, the names of the files follow a naming convention to indicate the method used to compute the distances, the partition file used, whether the reversed complementary sequence was appended to sequences, and if the CGR was on the segment, in the square or in the tetrahedron.

The program mycgr_meta.x is used to launch the other MyCGR tools with the correct options and filenames to respect the naming conventions. This simplifies the commands to use the tools and place the results files in the correct directories.

mycgr.x is a graphical interface to access to the main functionalities of the other tools while respecting the naming conventions of the files:

  • handling of original sequences and extraction of smaller sequences to use them in simulations,
  • display of the CGR in the square for a given sequence, or a given partition file; one can also merge the two representations and then visualize the frequency of points in the zones of a partition. The user can save the resulting image in a file.
  • handling of the files of zones defined for the segment, the square and the tetrahedron,
  • browsing of the results already obtained and display each file with the appropriate tool (which can be parameterized),
  • computation of distances between species, for a given length of sequence and other parameters.

Click on the images to enlarge these screenshots of mycgr.x:

MyCGR screenshotMyCGR screenshot
[1] In particular, this is the format in input of the program NJPLOT used to draw the unrooted trees.
[2] Y. Baraud, S. Huet, et B. Laurent. Adaptive tests of linear hypotheses by model selection. Ann. Statist., 31(1) :225251, 2003. ISSN 0090-5364.C