Script for performing intermolecular coevolutionary analysis with the PyCogent Bioinformatics Toolkit

24 November 2009

I’ve posted a script and a basic readme here which allows users of the python bioinformatics toolkit PyCogent to run intermolecular coevolutionary analyses using PyCogent’s built-in coevolution module.

PyCogent’s coevolution module supports several tree-aware and tree-ignorant methods for identifying pairs of coevolving pairs within and between biological sequences. Additionally it contains support for pre- and post-processing coevolution input and results, such as recoding multiple sequence alignments with reduced-state amino acid alphabets. The script included in PyCogent supports intramolecular coevolutionary analysis, while this new script supports intermolecular coevolutionary analysis.

Building a tree of life with PyCogent

3 October 2009

This is a PyCogent use case that I put together for my lecture a few weeks ago in BIOI 7711 at UC Denver. I thought this was a cool example because it quickly shows some powerful aspects of PyCogent, including:

  • parsers for common file formats (MinimalFastaParser);
  • support for common biological data types (alignments, sequences, and trees);
  • interaction with external applications (MUSCLE and FastTree) via the application controller framework;
  • and visualization (via the UnrootedDendrogram object).

The example illustrates, in about 20 or so lines of code, how to apply PyCogent to evaluate the idea that life on Earth clusters into three related domains, detectable by distances between their small-subunit ribosomal genes (i.e., their 16s rDNA sequences). Using sequences collections derived from the Silva database (filtered with cd-hit-est so the max pairwise identity between any two sequences is 90%), I randomly select sequences, align the sequences with MUSCLE, build a tree with FastTree, visualize the tree via a PDF.

To run this example, in addition to PyCogent, you’ll need MUSCLE, FastTree, and matplotlib installed. You can read more about the ideas behind this example in Woese 1987 and Woese 1990.

Let’s get started. I did this with the ipython interpreter, but the standard python interpreter will work fine too.

[email protected] tol_example> ipython

First, we’ll load up the sequences. We’ll select only sequences that are at least 1000 bases long, and contain no N characters. As I mentioned above, the sequences I am importing here come from the Silva database of 16S sequences, and have been filtered with cd-hit-est at the 90% sequence identity. (Skipping the filtering step would probably result in more sequences being required to see the nice clusters.)

> from cogent.parse.fasta import MinimalFastaParser

> arch16s = []
> for seq_id, seq in MinimalFastaParser(open('archaeal_v11.fasta')):
|..> if len(seq) > 1000 and seq.count('N') < 1:
|..> arch16s.append((seq_id,seq))
> len(arch16s)

> bac16s = []
> for seq_id, seq in MinimalFastaParser(open('bacterial_v11.fasta')):
|..> if len(seq) > 1000 and seq.count('N') < 1:
|..> bac16s.append((seq_id,seq))
> len(bac16s)

> euk16s = []
> for seq_id, seq in MinimalFastaParser(open('eukaryotic_v11.fasta')):
|..> if len(seq) > 1000 and seq.count('N') < 1:
|..> euk16s.append((seq_id,seq))

Import shuffle from the random module so I can extract a random collection of sequences

> from random import shuffle
> shuffle(arch16s)
> shuffle(bac16s)
> shuffle(euk16s)

Take some random sequences from each domain:

> combined16s = arch16s[:3] + bac16s[:10] + euk16s[:6]
> len(combined16s)

Load the combined sequences into a SequenceCollection object. The SequenceCollection object has many useful attributes and methods associated with it. Call dir(seqs) (where seqs is a SequenceCollection object for a listing. getNumSeqs is one method of the SequenceCollection object.

> from cogent import LoadSeqs, DNA
> seqs = LoadSeqs(data=combined16s,moltype=DNA,aligned=False)
> len(seqs)
> seqs.getNumSeqs()

Get an aligner function — here we’ll align with MUSCLE via the MUSCLE application controller. The result of calling align_unaligned_seqs is an Alignment object.

> from import align_unaligned_seqs
> aln = align_unaligned_seqs(seqs,DNA)

Get a tree-building function — here we’ll use FastTree. The result is a PhyloNode object.

> from import build_tree_from_alignment
> tree = build_tree_from_alignment(aln,DNA)

Next I import a drawing function to visualize the tree.

> from cogent.draw.dendrogram import UnrootedDendrogram
> dendrogram = UnrootedDendrogram(tree)

I can then generate a PDF of the tree, and save it to file:

> dendrogram.drawToPDF('./tol.pdf')

Here’s the final figure:

A quick-and-dirty tree of life, built using PyCogent and 16s rDNA sequences from the Silva database.

A quick-and-dirty tree of life, built using PyCogent and 16s rDNA sequences from the Silva database.

There you have it: about 20 lines of code to reproduce a classic bioinformatics experiment. As we can see the sequences (or tips in the tree), do appear to fall into three distinct clusters. By looking up these sequence identifiers in Silva, we can see that the three clusters match the three domains of life: archaea, bacteria, and eukarya. (We can also see that the number of tips in the clusters match the numbers of sequences we pulled from each of the sequence collections above.)

PyCogent is a very nice toolkit to work with because it lets you ignore the boring and/or frustrating aspects of the study, like parsing files, interacting with the different interfaces associated with different third-party tools, etc., and focus on the fun parts: the experiments. PyCogent is an open-source project with the core development team split between the University of Colorado at Boulder and Australia National University. I’ve been a developer on the project since about 2003.

PyNAST is live!

22 September 2009

My latest open source software project went live today on Sourceforge. PyNAST is biological sequence alignment tool which applies the NAST (Nearest Alignment Space Termination) algorithm to align a set of input (or candidate) sequences against a template alignment. The NAST algorithm guarantees that the number of positions in the output alignment will be identical to the number of positions in the template alignment. This is extremely convenient, for example, when you have a multiple sequence alignment that was built manually, and you want to study newly acquired sequences in the context of data (such as phylogenetic trees) which were derived from the manual alignment.

NAST (originally published here) has primarily been used for aligning newly acquired 16s rDNA sequences against the Greengenes “core sets” via the Greengenes website. NAST has become a popular tool in microbial community analysis, but wider adoption has been limited by the difficulty of running the original implementation locally. Since users may need to align thousands or even hundreds of thousands of sequences, it is important for them to be able to run the software on their own laptops, servers, or clusters. PyNAST, which was developed in collaboration with some of the original NAST authors, provides a command line interface, an API, and a Mac OS X GUI (which will go live shortly) to provide convenient access in all of these environments. Additionally, because users can provide their own template alignments when running locally, PyNAST is not specific to 16s rDNA alignments.

We currently have an Applications Note which provides more details on PyNAST, in addition to some speed benchmarks, under review at Bioinformatics.

Script for counting the number of sequences in a fasta file

7 September 2009

I’ve posted script that I use frequently for quickly finding out how many sequences are in a fasta file. This works for aligned or unaligned protein or xNA sequences. Note that for performance reasons, it does no error checking, so the user is responsible for making sure that the file is a valid fasta file.

You can download and

The way I use these is I have a scripts directory where I keep the script, and then have a symbolic link to it from my $HOME/bin directory. To do this, you could do the following:

> mkdir $HOME/scripts $HOME/bin

You’ll need to make sure that $HOME/bin is in your search path, or choose a different directory (e.g., /usr/local/bin) which will be in your search path.

Then save the script from the link above to $HOME/scripts. Make the file executable, and create a symbolic link in $HOME/bin so it’s in your search path.

> chmod 755 $HOME/scripts/
> ln -s $HOME/scripts/ $HOME/bin/count_seqs

Once you’ve done that, you should be able to apply the script.

Get usage information:

> count_seqs -h
Usage: count_seqs [options] fasta_filepath1 [fasta_filepath2 [...]]

--version show program's version number and exit
-h, --help show this help message and exit
-s, --suppress_errors
Suppress warnings about missing files [default: False]

Count the number of sequences in a set of fasta files in a hypothetical aln_size_tests directory:

> count_seqs aln_size_tests/*fasta not_a_real_file1.fasta /data/not_a_file.txt

10 : 10_seq.fasta
160 : 160_seq.fasta
1 : 1_seq.fasta
20 : 20_seq.fasta
2 : 2_seq.fasta
320 : 320_seq.fasta
3 : 3_seq.fasta
40 : 40_seq.fasta
4 : 4_seq.fasta
5 : 5_seq.fasta
80 : 80_seq.fasta
645 : Total

Some files were not accessible. Do they exist? Do you have read permission?

If you’re not interested in a report about files which couldn’t be found, pass the -s option to suppress error messages.

[email protected] aln_size_tests> count_seqs -s *fasta not_a_real_file1.fasta /data/not_a_file.txt

10 : 10_seq.fasta
160 : 160_seq.fasta
1 : 1_seq.fasta
20 : 20_seq.fasta
2 : 2_seq.fasta
320 : 320_seq.fasta
3 : 3_seq.fasta
40 : 40_seq.fasta
4 : 4_seq.fasta
5 : 5_seq.fasta
80 : 80_seq.fasta
645 : Total

Note that the sequence counting functions exactly as using egrep -c to search for lines beginning with >. For example:

> egrep -c '^>' aln_size_tests/320_seq.fasta

Wrapping it in a python script allows for convenient application to a list of files, and convenient totaling of different sequence files. That’s useful, for example, if you have a 16S database split on domains of life, and you want to see how many fall into each category and how many there are total. Here’s an example using a local copy of the Silva 16S data set:

> count_seqs silva/*fasta

12635 : silva/archaeal.fasta
296754 : silva/bacterial.fasta
42239 : silva/eukaryotic.fasta
0 : silva/unclassified.fasta
351628 : Total

I hope this is useful!