Building Trees with calcdist, qclust, and drawtree
7-Oct-1998
The latest version of this document is at
ftp://gause.biology.ualberta.ca/pub/jbrzusto/trees/treedoc.html
Documentation on a utility for converting data into Cornell Condensed Format
is at ftp://gause.biology.ualberta.ca/pub/jbrzusto/trees/toccfdoc.html
Bugs and changes are listed at the end of this document.
drawtree is from Joel Felsenstein's Phylip package, available at
http://evolution.genetics.washington.edu/phylip.html
These programs perform the same functions as the clustering
web page at http://www.biology.ualberta.ca/jbrzusto/cluster.html,
where there is also documentation on the distance
measures and clustering
methods.
Getting the programs:
ftp (with userid anonymous) to gause.biology.ualberta.ca and change directory to:
/pub/jbrzusto/trees/dos32 for 32-bit DOS versions (these run on a '386 or higher, with or without a math processor, in DOS or the DOS box of Windows 3.1 or '95)
/pub/jbrzusto/trees/source for ANSI-C source code and makefile, for recompiling on your machine
You will need to obtain all files from the appropriate directory. Copy them into a single directory on your machine. Make sure you transfer files in the appropriate mode: binary mode for .exe files, and ASCII mode for all other files.
All commands for using these programs are typed at the DOS command prompt, so if you are using Windows, you must first open a DOS window, and change to the directory into which you downloaded the programs.
If you want to run these programs in plain DOS, (ie. not within a Windows DOS box), you must have CWSDPMI.EXE in the directory from where the programs run, unless you are already running a DPMI server. This driver provides programs with access to extended memory.
Getting a text file:
Export your data matrix from a spreadsheet in text or ASCII format, or use a text editor like EDIT or NOTEPAD to create the file from scratch.
Making a distance matrix:
You start with one or more text files containing a matrix, for example the one in the file called sample2.txt, that comes with the programs:
OTU1 1 1 2 0 3 4 OTU2 0 2 0 3 0 1 OTU3 2 0 0 1 1 0 OTU4 0 3 3 0 2 0 OTU5 3 1 0 1 0 1
In this matrix, the rows represent items between which you want distances calculated. To compute the Euclidean distance between all pairs of rows in this file, you would type:
calcdist 5 6 -nr -r -d0 sample2.txt > dist2.txt
The parameters following the command calcdist mean the following:
the input matrix has 5 rows and 6 columns, not counting names. In sample2.txt, there are 7 columns, but the first is a column of names.
the input file has names of rows; these names are the first item in each row; if your file doesn't have names for rows, don't specify this option. Rows will be assigned numbers, if necessary. If your columns have names, then specify -nc instead (or as well, if both rows and columns are named).
the rows are the elements between which distances are to be calculated. Your input matrix can also be arranged to have the columns be these elements, in which case you should not specify this option.
the distance measure to use is the Euclidean distance . You can get a list of the other possible measures by running calcdist -h
the name of the input file. If you specify several names, separated by spaces then calcdist will paste these files together horizontally. That is, if your matrix is very wide and you export it piece-by-piece as several text files, calcdist will reassemble your matrix for you (make sure to put the names of the files in the correct order, left to right). For example, the 3rd row in your matrix will come from the collection of 3rd lines in your input files.
this says to send the output from calcdist into the file dist2.txt. The output is a lower triangular matrix of distances between all pairs of rows in the input.
Building a Tree:
Now that you have a distance matrix (or if you had one to begin with), you can build a tree by running qclust:
qclust -m1 -c2 -i dist2.txt -o out2.txt -t tree2.txt > progress.txt
The parameters following the command qclust mean the following:
the matrix is in lower triangular format; If you have a square matrix, use -m0, and if you have an upper triangular one, use -m2. The program always expects diagonal zero values to be present in the input.
the tree-building (i.e. clustering) method to use is UPGMA. If you want to do Neighbour joining, use -c7 (or leave this option out, because that is the default). For a complete list of methods, run
qclust -h
the input comes from file dist2.txt. If you leave out this option, the program reads from a file called infile.
the output will be put in file out2.txt. If you leave out this option, the program puts output in a file called outfile. This file contains a textual picture of the tree, with topology but not lengths represented. A table of lenghts of edges is included.
the drawtree-usable output will be put in file tree2.txt. If you leave out this option, this output is put in a file called treefile.
the progress of the clustering process is output to the file progress.txt. If you leave out this option, progress is displayed on the screen as the program runs. You can specify the -q option to turn off progress output.
Drawing a Tree:
The drawtree-usable output, in the file called tree2.txt, can be converted to a variety of graphical file formats (in both versions). For subsequent editing of the image, the best format to use is the MacDraw PICT format. To run this program, simply type
drawtree
This program prompts for options, as explained in the file drawtree.doc If there is no file called treefile, the program asks you what file to get its input from. In the example above, that would be tree2.txt. This program creates a plot in a file called plotfile. You might have to rename this file to get graphic manipulation programs to recognize it. For example, if you chose PC Paintbrush PCX format for output, you should rename the file as follows:
rename plotfile plotfile.pcx
in order that programs such as PhotoShop and PhotoStyler can recognize it. drawtree is completely documented in its accompanying documentation file, a copy of which is available at ftp://gause.biology.ualberta.ca/pub/jbrzusto/trees/drawtree.txt
Other program options:
calcdist:
-t
items in the input matrix are separated by TAB characters (as is often the case when exporting data from spreadsheets in Text or ASCII format). Blank items will be taken as zeros.
-s
strip names; even if there are names in the input, don't put them in the output matrix
-f0
output distances in a square matrix
-f1
output distances in a lower triangular matrix (the default)
-f2
output distances in an upper triangular matrix
-f3
output a table; each line is a pair of item names, followed by the distance between them
-p.5g
output numbers to 5 decimal places; the characters following the p are a C printf-style format control string. Using -p10.3f means numbers will always take up 10 characters, and three digits will be printed after the decimal point.
the program will use the prefix No., or any other prefix you specify, when printing item numbers; this is so that when the output is used in qclust, the item numbers will be distinguishable from numbers representing steps in the clustering process. If the prefix includes the two characters %d, the item number will be substitued for this, rather than simply appended to the prefix. For example, the option -u (%d) would print item numbers as (1), (2), and so on.
qclust:
-gN
makes N groups from the input items by clustering and then chopping off the tree at the height where there will be N leaf nodes. A table of groups and their elements is printed. This allows for a coarse-scale view of the classification (as suggested by Kjersti Aas (kjersti.aas@nr.no))
-d
don't calculate the RMS error for the tree (for Neighbour-joining method only); this prevents the program from doing a calculation that asks for a big chunk of memory, and takes some time. (the program should be able to detect when there is not enough memory to perform this calculation, so you should never need this, but programmers being the way we are, try specifying this option if you get an 'out of memory' error)
-z
for Neighbour-joining, disallow negative edges. Any negative edges are forced to zero, with the negative value added to the adjoining edge.
-n0
the names in the input distance matrix all appear before the matrix
-n1
the names in the input distance matrix appear one per row, before the distance values
-n2
there are no names in the input distance matrix
drawtree: (these options are chosen from the menu, not specified on the command line)
5
choose this option to turn off "iterate to improve tree", as this sometimes results in crossed edges
8
increase the "relative character height" to make larger labels on the tree
Other Tools:
The TreeToy utility runs over the Web on Java-enabled browsers (such as Netscape version 2.0 or later). If you have a big tree, you can paste the TREEFILE output from qclust into TreeToy, which lets you selectively shrink or expand subtrees. You can copy the TREEFILE-style descriptions of these reduced trees back out, and run drawtree on them, to print your tree in sections. TreeToy is at: http://www.biology.ualberta.ca/jbrzusto/TreeToy.html
Date reported | Date Fixed/Made | Program | Change/Bug/Fix |
7-Oct-1998 | 7-Oct-1998 | qclust | added the -gN option for printing the tree as N groups |
17-Sept-1997 | 17-Sept-1997 | calcdist | didn't skip over the empty top-left corner cell in a tab-delimited file with row and column names (so the file wasn't read correctly), and an inappropriate (or no) error message was given |
17-Sept-1997 | calcdist | added the -u option (see above) to create item names from numbers | |
21-Oct-1997 | 21-Oct-1997 | calcdist | several of the distance measures bombed if both input vectors were all zero; the distance is now set to 0 in these cases |
Bugs, comments, etc. to John Brzustowski jbrzusto@gpu.srv.ualberta.ca
There are more free programs at http://www.biology.ualberta.ca/jbrzusto/index.html