Integrating news feeds into Sphinx pages

26 January 2010

I’ve worked on a few Sphinx-based websites for software projects lately. (I’ve found the matplotlib sphinx tutorial useful for getting started.) A couple of examples are the PyCogent and PyNAST sites.

One thing I’ve needed to do was integrate a news feed that is easy for developers on the project to update, without having to mess with uploading new html, sharing the site password, etc. The solution I came up with was creating a wordpress blog (e.g., PyCogent, PyNAST); using feed.informer to generate a javascript feed digest; and integrating that javascript into the sidebar via custom layout.html"">two sites I mentioned above.

The steps are as follows:

  1. Create a blog (I chose wordpress, since that’s what I’ve used the most).
  2. Create a free feed.informer account. There are a few different sites out there that will host your javascript-based ‘feed digest’, but I’ve found this one to be the least obnoxious in terms of including an advertisements (see the ‘Powered by Feed Informer’ note on my sites) and fairly customizable.
  3. Run through the steps of adding your feed to a digest at feed.informer — for my PyNAST wordpress blog, the feed url is:

    One option you should be sure to use is ‘Show Only Live Items’, which has the effect of refreshing the feed rather than reading it from feed.informer’s cache. This allows you to, for example, delete posts in your blog and have them no longer show up in your feed digest.

  4. Paste the provided javascript into your layout.html"">PyCogent layout.html"">PyNAST layout.html"postmetadata"> Leave a Comment » | Software | Permalink
    Posted by gregcaporaso

Script for performing intermolecular coevolutionary analysis with the PyCogent Bioinformatics Toolkit

24 November 2009

I’ve posted a script and a basic readme here which allows users of the python bioinformatics toolkit PyCogent to run intermolecular coevolutionary analyses using PyCogent’s built-in coevolution module.

PyCogent’s coevolution module supports several tree-aware and tree-ignorant methods for identifying pairs of coevolving pairs within and between biological sequences. Additionally it contains support for pre- and post-processing coevolution input and results, such as recoding multiple sequence alignments with reduced-state amino acid alphabets. The script included in PyCogent supports intramolecular coevolutionary analysis, while this new script supports intermolecular coevolutionary analysis.

Building a tree of life with PyCogent

3 October 2009

This is a PyCogent use case that I put together for my lecture a few weeks ago in BIOI 7711 at UC Denver. I thought this was a cool example because it quickly shows some powerful aspects of PyCogent, including:

  • parsers for common file formats (MinimalFastaParser);
  • support for common biological data types (alignments, sequences, and trees);
  • interaction with external applications (MUSCLE and FastTree) via the application controller framework;
  • and visualization (via the UnrootedDendrogram object).

The example illustrates, in about 20 or so lines of code, how to apply PyCogent to evaluate the idea that life on Earth clusters into three related domains, detectable by distances between their small-subunit ribosomal genes (i.e., their 16s rDNA sequences). Using sequences collections derived from the Silva database (filtered with cd-hit-est so the max pairwise identity between any two sequences is 90%), I randomly select sequences, align the sequences with MUSCLE, build a tree with FastTree, visualize the tree via a PDF.

To run this example, in addition to PyCogent, you’ll need MUSCLE, FastTree, and matplotlib installed. You can read more about the ideas behind this example in Woese 1987 and Woese 1990.

Let’s get started. I did this with the ipython interpreter, but the standard python interpreter will work fine too.

[email protected] tol_example> ipython

First, we’ll load up the sequences. We’ll select only sequences that are at least 1000 bases long, and contain no N characters. As I mentioned above, the sequences I am importing here come from the Silva database of 16S sequences, and have been filtered with cd-hit-est at the 90% sequence identity. (Skipping the filtering step would probably result in more sequences being required to see the nice clusters.)

> from cogent.parse.fasta import MinimalFastaParser

> arch16s = []
> for seq_id, seq in MinimalFastaParser(open('archaeal_v11.fasta')):
|..> if len(seq) > 1000 and seq.count('N') < 1:
|..> arch16s.append((seq_id,seq))
> len(arch16s)

> bac16s = []
> for seq_id, seq in MinimalFastaParser(open('bacterial_v11.fasta')):
|..> if len(seq) > 1000 and seq.count('N') < 1:
|..> bac16s.append((seq_id,seq))
> len(bac16s)

> euk16s = []
> for seq_id, seq in MinimalFastaParser(open('eukaryotic_v11.fasta')):
|..> if len(seq) > 1000 and seq.count('N') < 1:
|..> euk16s.append((seq_id,seq))

Import shuffle from the random module so I can extract a random collection of sequences

> from random import shuffle
> shuffle(arch16s)
> shuffle(bac16s)
> shuffle(euk16s)

Take some random sequences from each domain:

> combined16s = arch16s[:3] + bac16s[:10] + euk16s[:6]
> len(combined16s)

Load the combined sequences into a SequenceCollection object. The SequenceCollection object has many useful attributes and methods associated with it. Call dir(seqs) (where seqs is a SequenceCollection object for a listing. getNumSeqs is one method of the SequenceCollection object.

> from cogent import LoadSeqs, DNA
> seqs = LoadSeqs(data=combined16s,moltype=DNA,aligned=False)
> len(seqs)
> seqs.getNumSeqs()

Get an aligner function — here we’ll align with MUSCLE via the MUSCLE application controller. The result of calling align_unaligned_seqs is an Alignment object.

> from import align_unaligned_seqs
> aln = align_unaligned_seqs(seqs,DNA)

Get a tree-building function — here we’ll use FastTree. The result is a PhyloNode object.

> from import build_tree_from_alignment
> tree = build_tree_from_alignment(aln,DNA)

Next I import a drawing function to visualize the tree.

> from cogent.draw.dendrogram import UnrootedDendrogram
> dendrogram = UnrootedDendrogram(tree)

I can then generate a PDF of the tree, and save it to file:

> dendrogram.drawToPDF('./tol.pdf')

Here’s the final figure:

A quick-and-dirty tree of life, built using PyCogent and 16s rDNA sequences from the Silva database.

A quick-and-dirty tree of life, built using PyCogent and 16s rDNA sequences from the Silva database.

There you have it: about 20 lines of code to reproduce a classic bioinformatics experiment. As we can see the sequences (or tips in the tree), do appear to fall into three distinct clusters. By looking up these sequence identifiers in Silva, we can see that the three clusters match the three domains of life: archaea, bacteria, and eukarya. (We can also see that the number of tips in the clusters match the numbers of sequences we pulled from each of the sequence collections above.)

PyCogent is a very nice toolkit to work with because it lets you ignore the boring and/or frustrating aspects of the study, like parsing files, interacting with the different interfaces associated with different third-party tools, etc., and focus on the fun parts: the experiments. PyCogent is an open-source project with the core development team split between the University of Colorado at Boulder and Australia National University. I’ve been a developer on the project since about 2003.

PyNAST is live!

22 September 2009

My latest open source software project went live today on Sourceforge. PyNAST is biological sequence alignment tool which applies the NAST (Nearest Alignment Space Termination) algorithm to align a set of input (or candidate) sequences against a template alignment. The NAST algorithm guarantees that the number of positions in the output alignment will be identical to the number of positions in the template alignment. This is extremely convenient, for example, when you have a multiple sequence alignment that was built manually, and you want to study newly acquired sequences in the context of data (such as phylogenetic trees) which were derived from the manual alignment.

NAST (originally published here) has primarily been used for aligning newly acquired 16s rDNA sequences against the Greengenes “core sets” via the Greengenes website. NAST has become a popular tool in microbial community analysis, but wider adoption has been limited by the difficulty of running the original implementation locally. Since users may need to align thousands or even hundreds of thousands of sequences, it is important for them to be able to run the software on their own laptops, servers, or clusters. PyNAST, which was developed in collaboration with some of the original NAST authors, provides a command line interface, an API, and a Mac OS X GUI (which will go live shortly) to provide convenient access in all of these environments. Additionally, because users can provide their own template alignments when running locally, PyNAST is not specific to 16s rDNA alignments.

We currently have an Applications Note which provides more details on PyNAST, in addition to some speed benchmarks, under review at Bioinformatics.