There are often times in bioinformatics when I find myself needing to sample large fasta files randomly. RefSeq is currently about 250GB compressed; this is not the sort of file you want your scripts making multiple passes through.
Shirley et al. published a pre-print in 2015 announcing a the python package Pyfaidx. The idea behind the package is to index fasta files in the faidx format used by Samtools, thereby allowing for random seek-based access to the file. They have continued to maintain the package and I’ve found it to be fast and pretty straightforward to use.
- Github: https://github.com/mdshw5/pyfaidx
- PyPi: https://pypi.python.org/pypi/pyfaidx
- Anaconda: https://anaconda.org/bioconda/pyfaidx
In its most basic usage you create a Pyfaidx fasta object:
import pyfaidx genes = pyfaidx.Fasta('tests/data/genes.fasta')
The file you import needs to have consistent length sequence lines or else Pyfaidx cannot index it. If you need to fix your file I’d recommend Refomat from bbtools.
By default Pyfaidx keeps the sequence name but throws away the description. You can
retain the description by adding
read_long_names=True, but this option
only works with uncompressed input data.
The Pyfaidx fasta object is a dictionary-like object with some caveats. I say
dictionary-like because the api is a bit different, for instance to access an
record we would need to put in an index range
genes['NM_001282543.1']. I would expect to use coordinates
only for the
.seq attribute but it is a minor thing to remember.
- Access by key
- Slicing of sequences
- Filtering based on keys
- Common operations like reverse / complement
- Support for Fasta Variants (which I did not test)
- A command line interface in addition to the python package
More random access
By default Pyfaidx returns keys in the order they were stored in the dictionary. While this is not in any particular order if will return the same order with each call if you need pseudorandom sampling you can simply randomize the keys and call the records.
import random import pyfaidx def shuffle_keys(fastaobj): """take the Pyfaidx file and return a shuffled list of the keys""" keylist =  for key in fastaobj.keys(): keylist.append(key) kls = random.shuffle(keylist) return keylist genes = pyfaidx.Fasta('tests/data/genes.fasta') rand_keys = shuffle_keys(genes) for key in rand_keys: print(gene[key][:])
For benchmarking data is available in the
preprint. I found it to be fast enough
with access speeds that were comparable to sequential access or Biopython SeqIO
in-memory access, but without consuming memory. If you need to use the library
for sequential access there is also an option to read ahead into a
genes = pyfaidx.Fasta('tests/data/genes.fasta', number=10000).
Reading compressed data
If you are working with fastas large enough to warrant Pyfaidx chances are you don’t want to unzip those files, plus gzipped files can be processed more quickly for IO bound processes. Pyfaidx can read blocked gzip format used by Samtools but if can take a long time to convert a large fasta to .bfgz format. Fortunately this can be done more quickly with pbgzip.
Parallel blocked format gzip (pbgzip)
Pbgzip is a multithreaded implementation of bgzip. It compresses files in a fraction of the time required for the non-parallel version. using pbgzip is simple.
gzip -d -c sequences.fasta.gz | pbgzip -c -n [number_of_threads] -6 > sequences.fasta.bfgz
Be sure to save a thread for gzip when you specify ‘-n’ for pbgzip. Now you have a file compatible with Pyfaidx. you can delete your original file too since .bfgz can be read by gzip.