Review of: T.N.T.—Tree Analysis Using New Technology. Version 1.0, by P. Goloboff, J. S. Farris and K. Nixon. Available from the authors and from http://www.zmuc.dk/public/phylogeny
2004; Wiley; Volume: 20; Issue: 4 Linguagem: Inglês
10.1111/j.1096-0031.2004.00026.x
ISSN1096-0031
Autores Tópico(s)Plant Taxonomy and Phylogenetics
ResumoTNT, Tree Analysis Using New Technology, is a collaborative project by Steve Farris, author of Hennig86, Pablo Goloboff, author of PeeWee and Nona, and Kevin Nixon, author of Winclada, and can therefore be considered as the successor to each of these programs. From Hennig86 it inherits the terse style and the general format of the datafiles, from PeeWee/Nona the general tendency to consider the search for optimal trees as something that can and must be directed by the user, and from Winclada, to a large extent, the graphical interface for the display of trees and character optimizations (though not the data-editor). Compared to its predecessors Nona and PeeWee, TNT has an impressive array of new features. Some of these are based on the integration of features so far available in separate programs only, some are entirely new, and some are more gradual improvements on existing features. It is obvious that the research interests of the authors have to a large degree, guided the selection of new features. It should therefore not be a surprise that TNT does not include Maximum likelihood as an optimality criterion, does neither Neighbor-joining nor any other phenetic clustering, and there are some other conspicuously absent features, such as support for third-position characters as a separate partition or an Independent Length Difference test. Among the features that TNT does support, two are immediately obvious. One is the new windowed interface, which does away with the need to remember syntax for commands, parameters and options, and allows the examination and manipulation of trees and character mapping. The other consists of the improvements in tree space exploration, under the heading “New Technology”. These improvements allow the analysis, in a reasonable time, of datasets far larger than was previously possible. In addition, there are a number of other features integrated into TNT which were absent from the predecessor programs, or available only by way of macro-instruction files. TNT finds one of its major justifications in its capability to handle large datasets. Investigators facing the more than astronomically large tree spaces for large datasets have been increasingly relinquishing the search for shortest trees in favor of trees composed of well-supported nodes (Farris et al., 1996; see also, e.g., Salamin et al., 2003; Freudenstein et al., 2004), produced by methods such as parsimony jackknifing or quick bootstrap. Alternatively, the search for shortest trees has been stubbornly pursued, aided by algorithmic improvements such as those described by Nixon (1999) and Goloboff (1999). Most of the improvement in efficiency of these methods is due to the abandonment of the principle that each round of branch swapping should start on a tree built from scratch—these methods explore tree space with fewer starting points, but wider exploration than can be achieved by simple branch swapping, and thus they avoid getting stuck on a single large but suboptimal island. With TNT, the most efficient implementations of many of these approaches are available for the first time, in an easy to handle, user-friendly and compact package. A major improvement has also been made in the scripting language. TNT contains a fully developed programming language, which, apart from the standard TNT commands, gives access to a large number of internal variables used in branch swapping, tree evaluation, tree comparison and character optimization. It allows manipulation of arrays of up to five dimensions, virtually unlimited numbers of user variables, and recursive calls. The macro language is in fact the core of the TNT program, and allows extensions of its possibilities limited only by the user's programming capabilities. On the downside, of course, it is not an easy language to learn, but to users with any experience in programming it should not be a major obstacle. Sample files with macro instructions are distributed with TNT, which may serve to give some idea of the power of this language and of the way that it can be applied. Another novelty is that from within TNT, data entry and editing is possible. This does not takes place in a spreadsheet-like environment (as in Winclada), but by listing all character states for a selected taxon, or all taxa with a selected character state. In principle, it is even possible to start with an empty data matrix and to enter all data without leaving TNT, but this option is primarily useful to lookup and correct data entry errors when they become apparent in the course of an analysis. I found the option highly convenient, especially for the possibility of having a quick look at the distribution of a particular character state, but not really a replacement for dedicated spreadsheet-like data editors. When TNT starts, it opens as a single window, with a menu bar, a configurable toolbar, and a status indication at the bottom. The window itself serves as the display buffer. In standard mode, it echoes commands and results, and can be used to list trees in a character-based display. In addition, a command-line can be opened at the bottom of the window—but as this hides the lower part of the display, it is often easier to open it only when it is needed. Results are primarily written to the display buffer, and it is the user's responsibility to ensure that they are also saved to an output file. From the menu bar, submenus and windows open in a way that sometimes appears a little erratic, but that one gets quickly accustomed to. More intimidating is the way that some of the windows are crammed with text- and checkboxes; but after all, most of the options and parameters must have a place somewhere in a menu. Despite the appearance of a standard Windows interface, TNT partly uses its own conventions, which differ occasionally from the standard Windows conventions. For example, checking a check-box or radio button to indicate the use of a selection of taxa or characters will often immediately bring up a selection window, but in other cases, there may be a separate selection button to do this. In following its own conventions, TNT is not always consistent. Thus, it is best not to look for deeper meanings behind the different appearances that windows may take—the presence or absence of borders or title bars in windows is erratic rather than dependent on function. and some of the deviations from Windows standards are more annoying: the function of the tool buttons is not displayed as a standard “tooltip”-text, but appears in the status line, and then only when the button is pressed. It takes some agility not to execute the button after having pressed it in order to find out its function. Despite the presence of a menu-driven mode, the command-driven mode has been retained and is available both to procedure files and by way of a command line accessible from the graphical interface. Using a procedure file is the easiest way to enter data, trees and other complicated commands. All search procedures can be similarly run from a procedure file, but here the use of the menu interface is easier. The menu interface allows (as far as I have been able to ascertain) access to very nearly the same possibilities as the command line, and it is only rarely the case that some operations are more easily performed using commands. Setting constraints is the one example I found which can be performed with more flexibility using commands than using menus. In addition, there is a “batch mode” which stores and executes sequences of the menu choices. However, this batch mode does not allow for interactive menu choices, and can therefore only be used to automate relatively simple tasks. Most users who are intimidated by command-driven programs will be grateful for the presence of the menus. On the other hand, users who are accustomed to batch mode operations will not lose this ability. TNT accepts data in a format that is recognizably derived from Hennig86/Nona format, but is only partly backwards compatible, due to various modifications. These include a number of major improvements, such as the ability to read data in interleaved format, or data of mixed type, and some relatively minor ones, such as the inclusion of a separate command specifying the type of data, or the allowing of taxon names of up to 32 characters, and characters of up to 32 states. Together, the modifications are sufficient to disable the reading of most existing Hennig86 or Nona files. From Hennig86 and Nona, TNT inherits the still somewhat annoying habit of starting the numbering of most entities with 0. While I appreciate how this appeals to programmers, it is not the way most users think about numbers, and the best one can say about it is that one gets used to it. Confusingly, TNT does not consistently follow this convention in the case of numbering data blocks, where the number 0 is reserved and the numbering of the separate blocks starts with 1. In addition to ordered or unordered multistate characters, TNT accepts Step matrix (Sankoff) characters, and even offers the possibility of entering these in the shape of a character-state tree, either in a procedure file or using the menus. This is by far the most intuitive way that this type of data can be represented and entered, and this facility will no doubt stimulate the use of step matrices. For entering data, there is also a basic compatibility with the Nexus-format, which means that TNT can both read and write plain-vanilla Nexus files. So far, this only works for very simple files, and many of the elements that Nexus is intended to handle, such as comments interspersed with the data, will lead to error messages. Thus, although it is possible to present data in a way that allows an exchange of data and results between TNT and Nexus-based programs, this is not particularly easy, and requires careful considerations of the limitations of the programs involved. As in practice, all existing data files must be revised before they can be read by TNT, a list of the most important differences between the data format required by TNT and Hennig86, Nona and Nexus formats would have been a welcome addition to the documentation. Procedure files containing instructions for analysis must in all cases undergo major revision, as most of the commands used for performing an analysis have changed from their predecessors. Procedure files can only be run unattended when some elements in the menu interface (such as confirmation windows) are disabled. TNT implements three basic tree search strategies. First, it can do an exhaustive branch-and-bound search, or implicit enumeration. As we have become used to acknowledging, this option is impractical for datasets of over 20–30 taxa, but it is still there. Secondly, it can do traditional heuristic searches. For neither of these two strategies, does the TNT approach differ dramatically from those of other programs. The algorithms for implicit enumeration are the same as those used in Hennig86 or Nona, and heuristic searches are performed using multiple random addition sequences followed by branch swapping. The inclusion of new tree search methods has not made these two methods obsolete—for smallish datasets, an exhaustive search still is the only method that actually guarantees finding the shortest trees, and for the trees found using the newly introduced methods, traditional branch-swapping is still the best way to find more equally parsimonious trees. Under the catch-all term “New Technology”, search methods are included that have been developed to search for shorter trees in more efficient ways. It is in the implementation of these methods that a real quantum leap in computational speed has been made. The methods incorporated are sectorial search, ratchet (although in a somewhat different implementation than in Nixon's original description), tree drifting, and tree fusing. All of these methods can be used either singly or in combination, or they can be carried out as a “driven search”, where a driver program tracks the results of the analysis and decides on or adapts the next step to be performed. It is possible to set this driver to different levels of “aggressiveness”, which will change the relevant settings in a coordinated way. To illustrate the power of this New Technology, the distribution package includes the dataset first analyzed by Chase et al. (1993), and later reanalyzed by Rice et al. (1997). In the form distributed (“zilla”) it consists of 500 taxa and 759 characters. Using the “driven” search with standard settings, I found trees of length 16 220 (the optimal trees reported by Rice et al.) for this Zilla dataset within 2 min—and length of 16 218 (reported to be the shortest length for this dataset) in the next minute of running, but some luck may have been involved here: the next tree of length 16218 was not found for another 5 min. In another case, with a less “clean” dataset of 172 taxa and 69 morphological (often multistate) characters, I found that with a New Technology search, better trees than result from a multiple addition sequence are often quickly found, but that continued searches would lead to a slow, gradual and continuous improvement on these first trees. Thus, the New Technology methods appear to be very effective in finding short trees in what most investigators will be inclined to regard as a reasonable time. However, by their nature, they are not very effective in indicating the relative optimality of this tree. There is no indication whether the optimal trees represent single most parsimonious trees, or are trees from a large set of equally parsimonious trees. TNT offers three ways of solving this problem. First, one can always simply use the result of the New Technology search as the input tree for a round of standard branch swapping, either with or without the retention of suboptimal trees (for the computation of Bremer supports). Second, it is possible to run the “driven” search until a specified number of optimal trees have been found in independent replications—a very effective way of constructing consensus trees (Nixon, 1999). Lastly, there is the possibility of doing a “quick consensus” estimation using the method of Goloboff and Farris (2001). With poorly decisive datasets, I found all three methods very effective in removing spurious resolution from the initial consensus of the few trees returned form a single search. One of the problems that will arise for many users will center on the reproducibility of their results. Working with TNT, it quickly becomes clear that results of the New Technology search procedures can be difficult to reproduce. When using the “driven” search, it is impossible to predict at what time and at what stage during which round of one of the four basic methods a tree of optimal length will be found. When taking a more manual approach, it is equally difficult to tell what exact sequence of search commands is most effective and will lead to increasingly better trees. Users who, out of a concern for repeatability, have become accustomed to specifying search parameters in full detail when reporting the results of a search, face two options. Either they continue the practice of reporting all heuristic search settings, up to and including the seed for the random number generator—but that means, for a basic New Technology search, they would have to specify some 50 parameters. Would anything useful be gained by doing that? Alternatively, they can simply state that New Technology searches were conducted until the reported tree was found—leaving it to others to find better trees (perhaps using other parameter settings)—but this may offend their sensitivities about the repeatability of their work. The problem is, of course, related to the meaning of the word “repeatability”. Working with TNT made me realize that it is futile to specify the exact parameters for a search, and that my belief that “phylogenetic reconstruction is highly repeatable, because cladistic methods are explicit and rigorous” (Lipscomb, 1998) was incomplete, and had been so ever since it became clear that large datasets required heuristic search procedures. The best trees may be found with methods that may be rigorous, but intrinsically unrepeatable due to the incorporation of random effects in the search algorithm. The only element in the entire procedure that is rigidly repeatable is the optimization algorithm, used to assess the length of a tree. For those who feel that this is all irrelevant, and that the scientific requirement of repeatability of methods should be upheld rigidly, there is of course always the option to start any search with a specified seed for the random generator and to list all search parameters in detail. One of the limitations of TNT is that it can only carry out parsimony analysis. It is not possible to apply the algorithmic improvements used in TNT to tree search under different optimality settings. It is, however, possible to apply differential weight vectors to characters. TNT accepts user-defined weights and it implements Goloboff's (1993) implied weighting method (even with the extension that a user-defined weighting function can be entered). TNT offers options to compute strict, majority and combinable components (but not Adams) consensus trees. However, it does not have an option (as in Nona) to save the consensus directly to a tree-file. To save a consensus tree, it must saved to RAM first, in which case it is added to the tree buffer after all the other trees, and there seems to be no simple way after that to isolate the consensus and save it to a file using only the menu options. This appears to be one of the very few points where useful functionality is more easily accessed by using commands than by using the menus. In addition to the standard ways of constructing consensus trees, TNT implements Goloboff and Pol's (2002) method of constructing supertrees. This method differs from the conventional way of computing supertrees, but it is also possible to construct supertrees in the usual way—one of the features of TNT is that any tree can be converted into its MRP-representation at the click of a button. For assessing branch supports, TNT offers a number of resampling measures—all of which resample or perturb characters, but in different ways. Apart from standard bootstrap and jackknife analysis, the possibilities offered include Poisson weighting and symmetric resampling, methods developed by Goloboff et al. (2003) to overcome a number of perceived disadvantages of the standard bootstrap procedure. Although it is nice to have these resampling measures available, I wonder if they will make as much impact as the improvements in search strategies. The various improvements appear to have a minor impact on the actual support figures: as far as I have been able to ascertain, ordinary bootstrap supports appear fairly heavily correlated with Poisson corrected bootstrap support, and, taking into account the degree of disturbance, also with the support values that can be obtained with symmetric resampling. Using the graphical display options, characters can be mapped onto trees using any choice of colors, with the added option to map the characters on individual trees, or to view the mappings common to all memory trees (basically, viewing the mappings on the strict consensus). For this character mapping, TNT is not limited to mapping either under ACCTRAN or DELTRAN optimization. Instead, with one option it displays all ambiguous parts of the trees as such, with another option, all possible different optimizations. While many users may be bewildered by the many possible character mappings, they will at least be forced to consider a number of alternatives. The results can be saved to the text-based display or, in graphical format, to an EMF-file, in which case the tree can be compressed vertically or horizontally, and lines can be made thinner or thicker. I found this to be one of the most attractive features of TNT—and certainly one of the elements contributing to the Wow!-factor; and one may be dismayed by the refusal of CorelDraw to load the resulting metafile (a defect of CorelDraw, I am told), but the resulting files are easily displayed in Word or PowerPoint. Using a slightly different graphical display, it is possible to view and manipulate trees in several ways. Nodes can be shrunk and replaced by clade names, they can be moved around the tree, and they can be examined for character support without leaving the graphical display. By selecting nodes or deselecting nodes, one can make selections of taxa and so build up sets of taxa to in- or exclude, which are not limited to nodes in a tree, but may also be non-monophyletic groups. TNT has on-line documentation in HTML-format, which deals with some basic issues and gives short explanations of some of the more complex possibilities, but does not contain a point-by-point treatment of all options and possibilities. In addition, there is an internal basic help-system, accessible from the command line, which gives an exhaustive, point-by-point listing of all the commands and their parameters, but does not refer back to menu options. Additional information is included in an introductory and helpful PowerPoint presentation distributed with TNT. The documentation provided with TNT is thus scattered over three different systems, and many cases offers only background information coupled with general advice, such as that the time taken to find the best score depends on the aggressiveness of the search. For a program with the complexity of TNT, and dealing with a subject as complex as, e.g., tree search, character optimization and nodal support, it is, of course, virtually impossible to write a comprehensive manual or help system without making it into an advanced course on phylogenetic analysis. Nevertheless, for the (currently) 117 commands, many of which take between 5 and 20 parameters, users will need some guidance for the values for the parameters they have to enter. Such guidance is now difficult to find, or is lacking. Fortunately, for most purposes, simply accepting the default settings will produce reasonably good results. In order to really understand how one can use TNT and interpret the results it provides, it is necessary to go back to the primary literature in which many of the procedures are first described. There is a very helpful list of these references in the HTML-document. Finding out in more detail what TNT can do and how it works is often a journey of exploration, during which the commands and options under investigation must be applied to simple sample files. I expect that in the near future we will see a community effort in collating all the information into a single self-sufficient documentation. Meanwhile, the best way for fledgling users to obtain help and support is probably the Cladsoft mailing list: http://nature.Berkeley.EDU/mailman/listinfo/cladsoft. TNT is a formidable program. It expands the possibilities for parsimony analysis of large and difficult datasets with a quantum leap, and it will therefore certainly find its way to the high-end users who are working on large scale phylogeny reconstruction. The presence of a really powerful macro language will appeal to those engaged in simulations or data exploration. However, TNT is not only useful for such heavy-duty tasks, but the facilities for character mapping and data editing will also appeal to researchers engaged in the analysis of smaller groups, be it from a molecular or a morphological viewpoint. It will find its way to the desktop of many who are now dependent on shared Mac workstations dedicated to phylogeny analysis and character interpretation. The functionality of TNT may even be a reason for many of the labs now totally committed to Macs to set up a lone Windows station to run it. TNT is distributed in a crippled form as a shareware program. Installation is simple and straightforward. When you launch TNT, you register with a username after which you may use it, without registration or payment, for 10 sessions of up to 10 h each. This restriction is very strictly upheld, and not easily hacked. The necessary uncrippling can be done simply by entering the combination of the username and a password that is obtained upon payment of the registration fee. Ten sessions of 10 h should be enough to convince anyone of the utility of TNT. Full registration is only 80 US$, and should not present an obstacle.
Referência(s)