Building a tree of life with PyCogent

This is a PyCogent use case that I put together for my lecture a few weeks ago in BIOI 7711 at UC Denver. I thought this was a cool example because it quickly shows some powerful aspects of PyCogent, including:

  • parsers for common file formats (MinimalFastaParser);
  • support for common biological data types (alignments, sequences, and trees);
  • interaction with external applications (MUSCLE and FastTree) via the application controller framework;
  • and visualization (via the UnrootedDendrogram object).

The example illustrates, in about 20 or so lines of code, how to apply PyCogent to evaluate the idea that life on Earth clusters into three related domains, detectable by distances between their small-subunit ribosomal genes (i.e., their 16s rDNA sequences). Using sequences collections derived from the Silva database (filtered with cd-hit-est so the max pairwise identity between any two sequences is 90%), I randomly select sequences, align the sequences with MUSCLE, build a tree with FastTree, visualize the tree via a PDF.

To run this example, in addition to PyCogent, you’ll need MUSCLE, FastTree, and matplotlib installed. You can read more about the ideas behind this example in Woese 1987 and Woese 1990.

Let’s get started. I did this with the ipython interpreter, but the standard python interpreter will work fine too.


[email protected] tol_example> ipython

First, we’ll load up the sequences. We’ll select only sequences that are at least 1000 bases long, and contain no N characters. As I mentioned above, the sequences I am importing here come from the Silva database of 16S sequences, and have been filtered with cd-hit-est at the 90% sequence identity. (Skipping the filtering step would probably result in more sequences being required to see the nice clusters.)


> from cogent.parse.fasta import MinimalFastaParser

> arch16s = []
> for seq_id, seq in MinimalFastaParser(open('archaeal_v11.fasta')):
|..> if len(seq) > 1000 and seq.count('N') < 1:
|..> arch16s.append((seq_id,seq))
> len(arch16s)
696

> bac16s = []
> for seq_id, seq in MinimalFastaParser(open('bacterial_v11.fasta')):
|..> if len(seq) > 1000 and seq.count('N') < 1:
|..> bac16s.append((seq_id,seq))
> len(bac16s)
16826

> euk16s = []
> for seq_id, seq in MinimalFastaParser(open('eukaryotic_v11.fasta')):
|..> if len(seq) > 1000 and seq.count('N') < 1:
|..> euk16s.append((seq_id,seq))
len(euk16s)
3902

Import shuffle from the random module so I can extract a random collection of sequences

> from random import shuffle
> shuffle(arch16s)
> shuffle(bac16s)
> shuffle(euk16s)

Take some random sequences from each domain:

> combined16s = arch16s[:3] + bac16s[:10] + euk16s[:6]
> len(combined16s)

Load the combined sequences into a SequenceCollection object. The SequenceCollection object has many useful attributes and methods associated with it. Call dir(seqs) (where seqs is a SequenceCollection object for a listing. getNumSeqs is one method of the SequenceCollection object.

> from cogent import LoadSeqs, DNA
> seqs = LoadSeqs(data=combined16s,moltype=DNA,aligned=False)
> len(seqs)
> seqs.getNumSeqs()

Get an aligner function — here we’ll align with MUSCLE via the MUSCLE application controller. The result of calling align_unaligned_seqs is an Alignment object.

> from cogent.app.muscle import align_unaligned_seqs
> aln = align_unaligned_seqs(seqs,DNA)

Get a tree-building function — here we’ll use FastTree. The result is a PhyloNode object.

> from cogent.app.fasttree import build_tree_from_alignment
> tree = build_tree_from_alignment(aln,DNA)

Next I import a drawing function to visualize the tree.

> from cogent.draw.dendrogram import UnrootedDendrogram
> dendrogram = UnrootedDendrogram(tree)

I can then generate a PDF of the tree, and save it to file:

> dendrogram.drawToPDF('./tol.pdf')

Here’s the final figure:

A quick-and-dirty tree of life, built using PyCogent and 16s rDNA sequences from the Silva database.

A quick-and-dirty tree of life, built using PyCogent and 16s rDNA sequences from the Silva database.

There you have it: about 20 lines of code to reproduce a classic bioinformatics experiment. As we can see the sequences (or tips in the tree), do appear to fall into three distinct clusters. By looking up these sequence identifiers in Silva, we can see that the three clusters match the three domains of life: archaea, bacteria, and eukarya. (We can also see that the number of tips in the clusters match the numbers of sequences we pulled from each of the sequence collections above.)

PyCogent is a very nice toolkit to work with because it lets you ignore the boring and/or frustrating aspects of the study, like parsing files, interacting with the different interfaces associated with different third-party tools, etc., and focus on the fun parts: the experiments. PyCogent is an open-source project with the core development team split between the University of Colorado at Boulder and Australia National University. I’ve been a developer on the project since about 2003.

Leave a Reply