Script for counting the number of sequences in a fasta file

I’ve posted script that I use frequently for quickly finding out how many sequences are in a fasta file. This works for aligned or unaligned protein or xNA sequences. Note that for performance reasons, it does no error checking, so the user is responsible for making sure that the file is a valid fasta file.

You can download count_seqs.py and test_count_seqs.py.

The way I use these is I have a scripts directory where I keep the script, and then have a symbolic link to it from my $HOME/bin directory. To do this, you could do the following:


> mkdir $HOME/scripts $HOME/bin

You’ll need to make sure that $HOME/bin is in your search path, or choose a different directory (e.g., /usr/local/bin) which will be in your search path.

Then save the script from the link above to $HOME/scripts. Make the file executable, and create a symbolic link in $HOME/bin so it’s in your search path.


> chmod 755 $HOME/scripts/count_seqs.py
> ln -s $HOME/scripts/count_seqs.py $HOME/bin/count_seqs

Once you’ve done that, you should be able to apply the script.

Get usage information:

> count_seqs -h
Usage: count_seqs [options] fasta_filepath1 [fasta_filepath2 [...]]

Options:
--version show program's version number and exit
-h, --help show this help message and exit
-s, --suppress_errors
Suppress warnings about missing files [default: False]

Count the number of sequences in a set of fasta files in a hypothetical aln_size_tests directory:


> count_seqs aln_size_tests/*fasta not_a_real_file1.fasta /data/not_a_file.txt

10 : 10_seq.fasta
160 : 160_seq.fasta
1 : 1_seq.fasta
20 : 20_seq.fasta
2 : 2_seq.fasta
320 : 320_seq.fasta
3 : 3_seq.fasta
40 : 40_seq.fasta
4 : 4_seq.fasta
5 : 5_seq.fasta
80 : 80_seq.fasta
645 : Total

Some files were not accessible. Do they exist? Do you have read permission?
not_a_real_file1.fasta
/data/not_a_file.txt

If you’re not interested in a report about files which couldn’t be found, pass the -s option to suppress error messages.

[email protected] aln_size_tests> count_seqs -s *fasta not_a_real_file1.fasta /data/not_a_file.txt

10 : 10_seq.fasta
160 : 160_seq.fasta
1 : 1_seq.fasta
20 : 20_seq.fasta
2 : 2_seq.fasta
320 : 320_seq.fasta
3 : 3_seq.fasta
40 : 40_seq.fasta
4 : 4_seq.fasta
5 : 5_seq.fasta
80 : 80_seq.fasta
645 : Total

Note that the sequence counting functions exactly as using egrep -c to search for lines beginning with >. For example:


> egrep -c '^>' aln_size_tests/320_seq.fasta
320

Wrapping it in a python script allows for convenient application to a list of files, and convenient totaling of different sequence files. That’s useful, for example, if you have a 16S database split on domains of life, and you want to see how many fall into each category and how many there are total. Here’s an example using a local copy of the Silva 16S data set:


> count_seqs silva/*fasta

12635 : silva/archaeal.fasta
296754 : silva/bacterial.fasta
42239 : silva/eukaryotic.fasta
0 : silva/unclassified.fasta
351628 : Total

I hope this is useful!

Leave a Reply