Script for counting the number of sequences in a fasta file

7 September 2009

I’ve posted script that I use frequently for quickly finding out how many sequences are in a fasta file. This works for aligned or unaligned protein or xNA sequences. Note that for performance reasons, it does no error checking, so the user is responsible for making sure that the file is a valid fasta file.

You can download count_seqs.py and test_count_seqs.py.

The way I use these is I have a scripts directory where I keep the script, and then have a symbolic link to it from my $HOME/bin directory. To do this, you could do the following:


> mkdir $HOME/scripts $HOME/bin

You’ll need to make sure that $HOME/bin is in your search path, or choose a different directory (e.g., /usr/local/bin) which will be in your search path.

Then save the script from the link above to $HOME/scripts. Make the file executable, and create a symbolic link in $HOME/bin so it’s in your search path.


> chmod 755 $HOME/scripts/count_seqs.py
> ln -s $HOME/scripts/count_seqs.py $HOME/bin/count_seqs

Once you’ve done that, you should be able to apply the script.

Get usage information:

> count_seqs -h
Usage: count_seqs [options] fasta_filepath1 [fasta_filepath2 [...]]

Options:
--version show program's version number and exit
-h, --help show this help message and exit
-s, --suppress_errors
Suppress warnings about missing files [default: False]

Count the number of sequences in a set of fasta files in a hypothetical aln_size_tests directory:


> count_seqs aln_size_tests/*fasta not_a_real_file1.fasta /data/not_a_file.txt

10 : 10_seq.fasta
160 : 160_seq.fasta
1 : 1_seq.fasta
20 : 20_seq.fasta
2 : 2_seq.fasta
320 : 320_seq.fasta
3 : 3_seq.fasta
40 : 40_seq.fasta
4 : 4_seq.fasta
5 : 5_seq.fasta
80 : 80_seq.fasta
645 : Total

Some files were not accessible. Do they exist? Do you have read permission?
not_a_real_file1.fasta
/data/not_a_file.txt

If you’re not interested in a report about files which couldn’t be found, pass the -s option to suppress error messages.

[email protected] aln_size_tests> count_seqs -s *fasta not_a_real_file1.fasta /data/not_a_file.txt

10 : 10_seq.fasta
160 : 160_seq.fasta
1 : 1_seq.fasta
20 : 20_seq.fasta
2 : 2_seq.fasta
320 : 320_seq.fasta
3 : 3_seq.fasta
40 : 40_seq.fasta
4 : 4_seq.fasta
5 : 5_seq.fasta
80 : 80_seq.fasta
645 : Total

Note that the sequence counting functions exactly as using egrep -c to search for lines beginning with >. For example:


> egrep -c '^>' aln_size_tests/320_seq.fasta
320

Wrapping it in a python script allows for convenient application to a list of files, and convenient totaling of different sequence files. That’s useful, for example, if you have a 16S database split on domains of life, and you want to see how many fall into each category and how many there are total. Here’s an example using a local copy of the Silva 16S data set:


> count_seqs silva/*fasta

12635 : silva/archaeal.fasta
296754 : silva/bacterial.fasta
42239 : silva/eukaryotic.fasta
0 : silva/unclassified.fasta
351628 : Total

I hope this is useful!


Costa Rica pictures, finally!

31 August 2009

I’ve finally put some pictures of our honeymoon in Costa Rica on my Flickr account: here they are.

I thought I’d also put a few of my favorite Costa Rica wildlife photos here.

Iguana
Iguana on the beach.

Crocodile
A crocodile, from the bridge above it.

Toucan
Toucan in the forest below Volcano Arenal.

White-faced monkeys
White-faced monkeys in the forest below Volcano Arenal.


When will I ever need to use trigonometry? (Or, how to avoid avalanche terrain.)

21 August 2009

I sometimes tutor high school math, and as anyone who has done this knows, a common question is ‘But when will I ever need to use blank?’. This came up last year, where the blank was trigonometry.

When will I ever need to use trigonometry? It’s a legitimate question. As an instructor, you could explain how it’s a foundation of mathematics, or how you’ll need to understand it for any career in science or math, or so on… (They’re both true, and are probably close to what I said, but neither are an answer that your student is likely to care much about.) Think about it: what practical problem, in day-to-day life, have you solved by applying trigonometry? (Comment if you have a good one!)

So, when I was backcountry skiing last winter and my friend Erik and I actually applied trigonometry to decide if it was safe to ski a certain slope, we were proud.

It’s generally accepted that if a slope is 30 degrees or less, it’s probably not steep enough to avalanche. Slopes around 40-55 degrees are the most dangerous. We were about to cross a slope, but we weren’t really prepared to be in avalanche terrain: we didn’t have our beacons, probes, and shovels. In short, we wanted to avoid avalanche terrain by staying off slopes that were 30 degrees or greater.

We didn’t have a slope meter, but we did know trigonometry. If we could make a right triangle where one of the legs (i.e., sides not opposite the right angle) was double the length of the other leg, the slope of the opposite side (the hypotenuse) would be less than 30 degrees. (We did need a calculator to figure out exactly what that angle would be.) What could we use to make the two legs? Ski poles!

Figure 1: Identifying a 26 degree slope with two ski poles.

Figure 1: Identifying a 26 degree slope with two ski poles.

Here’s how it works. Plant your first pole vertically, and try to make a 90 degree angle with the second pole beginning halfway down the vertical pole. If the tip of the horizontal pole touches when the poles are at 90 degrees with one another, you’re on a roughly 26 degree slope (Figure 1). If you can’t make a 90 degree angle because the horizontal pole hits the slope too high, the slope is steeper than 26 degrees (Figure 2). If you can make an angle smaller than 90 degrees before the tip of the horizontal pole hits the snow, the slope is less than 26 degrees (Figure 3).

Figure 2: Identifying a slope greater than 26 degrees with two ski poles.

Figure 2: Identifying a slope greater than 26 degrees with two ski poles.

The rule-of-thumb is this: If the poles can form a right triangle with the horizontal pole exactly halfway down the vertical pole, and with the slope as the hypotenuse, the slope is less than 30 degrees and you are probably safe. This is the situation depicted in Figures 1 and 3. If you can’t form the 90 degree angle because the slope is too steep, as in Figure 2, you’re in possible avalanche terrain so make sure you know what you’re doing!

Figure 3: Identifying a slope less than 26 degrees with two ski poles.

Figure 3: Identifying a slope less than 26 degrees with two ski poles.

Finally, you can also use this trick to identify a 45 degree slope (which is absolutely in avalanche terrain). To do that, this time place the horizontal pole at the top of the vertical pole rather than halfway down. If the tip of the horizontal pole just touches the slope when you have a 90 degree angle between the poles, the slope is just about 45 degrees (Figure 4). If it doesn’t touch, it’s less than 45 degrees. If you can’t make a 90 degree angle because the second pole hits the slope too high, you’re on some steep stuff!

Figure 4: Identifying a 45 degree slope with two ski poles.

Figure 4: Identifying a 45 degree slope with two ski poles.

There you have it: a field application of trigonometry. We’re not the first ones to figure this out — I’ve since heard that the Norwegians have been doing this forever. It’s a neat trick.

And of course, there is a lot more to avalanche safety than this. This is a basic rule, but by no means definitive. Be scared of avalanches, and don’t assume that if a slope is less than 30 degrees it can’t slide. If you’re planning to travel in avalanche country take a class, check the avalanche reports, and if you’re not scared of avalanches, watch some videos.

Skiing out from Ken's Cabin
Skiing out from Ken’s Cabin

More photos from this trip.


Python optparse example, part 3

11 August 2009

In the third part of my python optparse example we’ll look what my optparse example does, with a focus on how to define and error check the parameters.

The example script takes three optional parameters, -v, -s, -i, and one required parameter, output_filepath. (Note that I don’t need to define the -h or --help option — that functionality is automatically included.)

The -v parameter is a binary verbose flag defined as:


parser.add_option('-v','--verbose',action='store_true',\
        dest='verbose',help='Print information during execution -- '+\
        'useful for debugging [default: %default]')

When the script is called with -v or --verbose, True (because action='store_true') is stored in the verbose variable (because dest='verbose'). The help parameter defines the help string that will be presented when the user passes -h to the script. When the help string is printed, %default is replaced with the default value of the parameter.

Next I define an option which allows the user to pass a value to the script via -s:


parser.add_option('-s','--string_to_write',action='store',\
          type='string',dest='string_to_write',help='a string to write to '+\
          'file [default: %default]')

This allows for passing a -s or --string_to_write option to the script, the string value of which will be stored (because action='store') in a variable called string_to_write. Note that we explicitly define the type here (type='string') — something that is unusual in python.

The next option is similar, except that an int is stored to int_value:


    parser.add_option('-i','--int_value',action='store',\
          type='int',dest='int_value',help='an integer value to store '+\
          '[default: %default]')

After the options are defined I set defaults for them. The value for all options defaults to None, unless otherwise specified. I define the default for verbose to be False and the default for string_to_write as "Some example input.". I don’t define a default for int_value.

Finally, I parse the command line arguments into opts and args. I then have the values that the user provided, and am able to perform any necessary error checking.

The script requires that the user provides a single command line argument, output_filepath. So, I check that they have provided exactly one argument, and raise an error if not. Raising the error via parser.error is useful because it provides the user with the usage information in addition to the error message. Try calling the script with no options or arguments for an example.

If no errors are raised, opts and args are returned to the caller — in this case the main block of the code. The options and arguments can then be accessed as illustrated in the example.

You can see by experimenting with the script that the verbose flag prints some additional information to the terminal at runtime. Try running the script with and without -v, and you’ll see slightly different results. string_to_write is written to output_filepath. If it’s not provided by the user, the default value is used. Then int_value is written to output_file if provided.

That just about wraps up my optparse example. I’ve highlighted some of the optparse functionality in this tutorial, and provided a usage example that you can experiment with on the command line. A good way to learn more would be to start with my optparse example, and experiment with adding new functionality. You can find a lot more information about optparse here.

I previously mentioned that I start all new python scripts with a template file which includes optparse functionality. In my optparse template, which is a simplified version of my optparse example, I leave in some example options so it’s easy to add new ones in my script. Rather than having to remember how to instantiate the OptionParser object and define parameters, I can just modify what’s there and delete what isn’t necessary. Feel free to use and modify my optparse example to make it a template for your own python scripts.

Finally, you may have noticed that python has another module for parsing command line parameters: getopt. You should always use optparse over getopt as optparse provides a lot of additional functionality, and will continue to be supported in Python 3 (unlike getopt). (Scratch that about getopt not being supported in Python 3 — it looks like it is but the Python docs note that: “A more convenient, flexible, and powerful alternative is the optparse module.”.)

I think you’ll find that incorporating optparse-based parameter parsing to your python scripts will make them more useful to yourself and your users. I hope that my example will help you get up and running with optparse very quickly. Good luck!