First steps

This page guides the first steps of the user of MyCGR.

We call meta_root the directory containing the files following the naming convention. We suppose that the directory containing the programs mycgr[...].x is in the PATH environment variable of the user, so that he can launch them directly without having to indicate this directory.

Sequences

A sequence is represented by a succession of ASCII characters. Blanks (white spaces, carriage return, new line, tabulation) are ignored. The following characters are accepted:

  • a, A or 1 for adenine,
  • c, C or 2 for cytosine,
  • g, G or 3 for guanine,
  • t, T or 4 for thymine.

Other characters cause an error.

For the moment, the internal representation of the sequences is an array of integers. This can be a problem if we must load a whole sequence bigger than the maximum size allowed for an array of integers in OCaml (a little more than 4.000.000 on a 32 bits architecture). It is sufficient for our applications, except in some cases (for example the generation of sequences to build the CGR-trees) where another representation with larger but slower arrays was developed (using the Bigarray module of the OCaml standard library).

The sequences handled are either retrieved from the databases of sequenced genomes then put in the format indicated above, or generated according to a given law.

The tool mycgr_seq.x is used to manipulate files of sequences. Here are some examples.

The command below generates an i.i.d. sequence of length 10000 and of law (pA,pC,pG,pT)=(0.1,0.2,0.3,0.4); the sequence is stored in the file file.seq instead of being printed in the standard output (the screen):

mycgr_seq.x --iid 10000 --probs 0.1,0.2,0.3,0.4 -o file.seq

To generate a markovian sequence or order 1 and of transition matrix (pAA,pAC,pAG,pAT,pCA,pCC,pCG,pCT,pGA,pGC,pGG,pGT,pTA,pTC,pTG,pTT)= (0.1,0.2,0.3,0.4,0.12,0.28,0.3,0.3,0.15,0.35,0.1,0.4,0.1,0.2,0.3,0.4), we would rather do:

mycgr_seq.x --markov 1,10000 \
  --mprobs 0.1,0.2,0.3,0.4,0.12,0.28,0.3,0.3,0.15,0.35,0.1,0.4,0.1,0.2,0.3,0.4 -o file.seq

We can also generate a sequence with a random law, for example a markovian sequence of order 2:

mycgr_seq.x --markov 2,10000 --random-probs -o file.seq

It is also easy to generate N sequences in a row in distinct files with a given prefix, here foo. The following command generates 100 files foo_1, foo_2, ..., foo_100, each one containing an i.i.d. sequence of length 5000 with a different random law:

mycgr_seq.x --iid 5000 --random-probs --number foo,100

From empirical frequencies of letters and words in a sequence, we can compute a law as if the sequence was i.i.d. or markovian. The following command generates an i.i.d. sequence with the law (0.35,0.15,0.1,0.4) and gives the sequence in input to a second call to mycgr_seq.x to get the frequencies:

mycgr_seq.x --probs 0.35,0.15,0.1,0.4 --iid 10000 | mycgr_seq.x --get-probs

In the same way, it is possible to get a transition matrix corresponding to a markovian sequence or order 3, once again from an i.i.d sequence:

mycgr_seq.x --probs 0.35,0.15,0.1,0.4 --iid 10000 | mycgr_seq.x --get-mprobs 3

... or using a sequence stored in a file file.seq:

mycgr_seq.x -f file.seq --get-mprobs 3

Another useful functionality is cutting a big sequence in several smaller sequences, for example after having retrieved a long squence from a database on the net, to do some computation on pieces of this sequence. The following command reads the sequence from the file bigseq.seq and cut it in 20 sequence of size 20000, storing them in the files bigseq_part-0.seq, ..., bigseq_part-19.seq:

mycgr_seq.x -f bigseq.seq --isolate bigseq_part,20000,20000 -n 20 -v

The option -v sets the verbose mode on. The parameters of the option --isolate are the prefix of the generated files, the minimal and maximal length of isolated sequences. This can be useful when a retrieved sequence contains invalid characters (for example U to indicate several possibilities for a nucleotide); if we have already found the minimal length when we encounter this character, we generate a file, else we start from zero after the invalid character.

Computing the CGR

In the rest of this guide, we describe the commands corresponding to the Chaos Game Representation in the square. Equivalent commands exist for the CGR on the segment or in the tetrahedron, by replacing square by segment or tetra in the command names.

The coordinates of the points of the CGR in the square for a sequence given in the file file.seq are printed with the following commande:

mycgr_square.x -f file.seq

Please note that the version above reads the whole sequence, compute all coordinates then print them. This can be problematic for big squences (cf. Sequences). The option --stream can be used to compute the coordinates on the fly, i.e. the program reads one nucleotide, compute and print the coordinates of the point, then continue with the next nucleotide. Coordinates are not conserved (except for the previous point, of course).

mycgr_square.x -f file.seq --stream
Drawing the CGR

From coordinates generated by mycgr_square.x, the program mycgr_square_draw.x draws the points of the CGR in the square. So, to create a PostScript file file.ps containing the picture of the CGR in the square for the sequence in the file file.seq, we will do:

mycgr_square.x -f file.seq | mycgr_square_draw.x -o file

Several options of mycgr_square_draw.x allow to change the size of the picture, the font, to display of not the corner coordinates and corner letters, print a view with frequencies by sub-squares, ...

Zones and partitions

To carry out the tests of structure and to calculate the distances between sequences, we use partitions of the square, segment or tetrahedron. A partition is a list of zones, each zone of the partition being a list of simple zones.

The partitions are defined by some text with a specific syntax. A simple zone can be:

  • a rectangle, defined by the coordinates of its bottom-left and top-right corners, in the following way: [(x1,y1);(x2,y2)],
  • a circle, defined by its center and its radius, with the syntax C((x,y),r),
  • the intersection of two simple zones, noted zone1&zone2,
  • a predicate, indicated by the name of a predefined function and an optional sequence. To know if a point satisfies the predicate, we apply the reversed CGR of the sequence on the point and check whether the function applied to the coordinates returns true or false. The predefined function are not documented yet; in particular, they are based on sinus or lines, but we can imagine a lot of other constructions. The syntax is the following: name_of_function(), or name_of_function(sequence). For example: sin1(AC),
  • the complementary of a predicate, as indicated above but with a '-' before, for example: -sin2(CGT).

A partition zone is a list of simple zones separated by commas, on a single line. The zone can be named, by putting this name at the beginning of the line, followed by a column. For example: with Z1:[(0.0,0.0);(0.1,0.2)],C((0.5,0.5),0.4)&[(0.0,0.5);(1.0,1.0)], we define a zone "Z1" composed of two simple zones, the first being a rectangle of coordinates [(0.0,0.0);(0.1,0.2)], and the other the intersection of the circle of center (0.5,0.5) and radius 0.4 and the rectangle [(0.0,0.5);(1.0,1.0)]. A point will be in the zone "Z1" if it belongs to any of the two simple zones composing "Z1" (so the comma between two simple zones is the union of these simple zones).

At last, a partition is a list of partition zones, each one on a different line. The user must be sure that these zones form a partition (no overlapping, in particular). Indeed, the methods of tests and distances used are defined for partitions. The user can add an "empty" zone of the form name: (with no simple zone) to indicate that this partition zone is the complementary of the others. Only one complementary zone is accepted by partition.

Example

We create the file example_partition.txt with the following contents:

Z1:[(0.0,0.0);(0.2,0.7)]
Z2:C((0.6,0.5),0.35)
Z3:[(0.1,0.8);(0.3,1.0)]&C((0.1,0.9),0.1)
Z4:

The following command generates an Embedded-PostScript file example.ps containing the picture of this partition in the square. The square has 150 pixels of side, the coordinates of letters and corners are not displayed, and we use 100x100 points to display the zones. The colors of the zones being randomly assigned, the user can launch the command again to get other colors if the result is not satisfying.

echo -n | mycgr_square_draw.x -o example --guess-draw-zones 100 -I example_partition.txt \
  -b 150 --hide-coords --hide-letters --eps
This gives the following image:
Picture of partition
Generating partitions

It is possible to automatically generate zones or partitions, with the program mycgr_square_zones.x. Various options can be used to generate random or regular partitions. For example, this command generates a file random.txt containing a random partition composed of 5x5 zones:

mycgr_square_zones.x --random 5 -o random.txt

Use the option --help of mycgr_square_zones.x to get the list of supported options.

Zones and partitions on the segment and in the tetrahedron

What differs from the square is how simple zones are defined.

On the segment, the simple zones can only be intervals in the form [x1;x2].

For the moment in the tetrahedron, the simple zones can only be smaller tetrahedrons, defined by their four vertices with the following syntax: [(x1,y1,z1);(x2,y2,z2);(x3,y3,z3);(x4,y4,z4)].

Tests of structure

The simulations for the new family of tests introduced in the thesis of Peggy CÚnac are carried out with the programs mycgr_square_vn.x (independence test) and mycgr_square_test_markov.x (markovian dependence test for a given order). Here are some usage examples.

The following command performs the test of independence on 1000 generated i.i.d. sequences with a random law of generation for each one. For these sequences, we compute the statistics of the independence test and we accept or reject each sequence with a quantile corresponding to a level of acceptance of 5%. We carry out these calculations for various sizes of sequences (500,1000,2000,5000 and 10000). At last, we carry out these tests for 3 different partitions which are in the files zones_pearson.txt, zones_square_9_random.txt and zones_square_9_equi.txt. The result is added at the end of the file test.tex.

mycgr_square_vn.x --nb 1000 --lengths 500,1000,2000,5000,10000 --gen-order 0 \
   -I zones_pearson.txt -I zones_square_9_random.txt -I zones_square_9_equi.txt \
   -O test.tex

We can carry out the same test but generating markovian sequences of order 1:

mycgr_square_vn.x --nb 1000 --lengths 500,1000,2000,5000,10000 --gen-order 1 \
   -I zones_pearson.txt -I zones_square_9_random.txt -I zones_square_9_equi.txt \
   -O test.tex

The contents of the file test.tex can be included in a LaTeX document and compiled. We obtain results in this form.

To carry out the test of markovian structure of a given order, we use in a smiliar way the program mycgr_square_test_markov.x. For example, to test if sequences are markovian of order 1 by generating markovian sequences of order 1:

mycgr_square_test_markov.x --order 1 --nb 1000 --lengths 500,1000,2000,5000 \
   --gen-order 1 \
   -I zones_pearson.txt -I zones_square_9_random.txt -I zones_square_9_equi.txt \
   -O test.tex

One last example, generating mixed markovian sequences:

mycgr_square_test_markov.x --order 1 --nb 1000 --lengths 500,1000,2000,5000 \
   --mixed-markov 1,3 \
   -I zones_pearson.txt -I zones_square_9_random.txt -I zones_square_9_equi.txt \
   -O test.tex
Higher level programs

mycgr_meta.x and mycgr.x are "higher-level" programs because they given access to the functionalities of the other programs in a simplified way and respect the naming convention of files and directories.

During the compilation of MyCGR, the default directory meta_root was set. We can indicate another one at runtime through the environment variable MYCGR_META_ROOT.

The first thing to do before using the two tools above is to create the directory tree under meta_root. This is achieved with the following command:

mycgr_meta.x --build-root -v

The directories created are printed. Then it is possible to put in the directory meta_root/sequences files containing sequences retrieved from genome banks. Only the files with the extension .seq are considered as sequences in this directory. Once it is done, we can cut these sequences in smaller sequences to use them to compute distances between species (cf. Taxonomy). The following command cuts "original" sequences in sequences of 100000 nucleotides, and creates only 5 sub-sequences for each original file (-v sets the verbose mode on):

mycgr_meta.x --add-size 100000,5 -v

Using the graphical interface of mycgr.x is easier and allows to browse the directories under meta_root. This program requires a MySQL database to store some information to manipulate CGR-trees (cf. Chapter 3 of the thesis). By default, the database that mycgr.x connects to is mycgr on localhost, with the user login as database user name. Ths options -d, -h and -u can be used to specify a different database name, machine or user. To create the database mydb and be able to access it as user myuser from the machine localhost, we must connect to the MySQL database as administator and enter the following commands:

create database mydb;
grant all privileges on mydb.* to myuser@localhost;

Please consult the MySQL documentation for details.

Once the database is set up, we must create the required tables:

mycgr_meta.x --create-db [-h ...] [-u ...] [-d ...]

Then, we can launch mycgr.x to use the graphical interface and manipulate the sequences, zones, results and CGR-trees. Let's see how to create taxonomy trees in next section.

Taxonomy

We will create a taxonomy tree of species from sequences available in meta_root/sequences.

The names of the files containing original sequences are important, because they are used when cutting these sequences in smaller ones (we append -n with n being a number assigned while cutting sequences). Then while grouping by species the distances between sequences, we use the filename to know what sequences are parts of the same species. From a filename, the species name if retrieved in the following way: We look in the name for the first character which is not a letter or '-' and we isolate the part before (minus the final '-') to use as the species name. For example, from the filenames Euk-mae-homsa1-0.seq and Bac-p-braj-0.seq, the species names are respectively Euk-mae-homsa and Bac-p-braj.

We start by cutting the original sequences available in meta_root/sequences to create 5 sequences of length 100000 from each original file. To do so, we use the menu Sequences/Add size and we enter 100000 as length and 5 as maximum number of files create from each original sequence.

Then, we must define a partition of the square, if we want to compute the distances with the CGR on the square. To be completed for the creation of zones from the interface. By now, files must be added by hand in the correct directory.

Now we can launch the computation with the menu Dists/Cenac dists. In the dialog box which appears, we indicate

  • the length of sequences to use (the one we just added, for example),
  • the CGR method (here it is the square),
  • the partition file,
  • whether we want to add to sequences, during the computation, their reversed complementary,
  • the method of distance computation, for example AbsRn.

Now we validate and wait. When the computation is done, the results are available in the "Results" tab, in the chosen length. A double-click on a result file opens it in the appropriate application. Please note that we may need to use the menu "View/Refresh" to display the last results files created.