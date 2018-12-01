For a full list of BASHing data blog posts, see the index page.

Unwrap your fasta

DNA sequences are often stored as plain text in FASTA files. The formatting of a FASTA or fasta file is simple: each sequence is preceded by a description line beginning with the greater-than character (>) and containing information that identifies the sequence. A fasta file can contain many sequences, each with its own description line. Here's a 3-sequence example, sample.fasta:

>sequence1

gatgatacacgctgtcactaagcacattttctgtttctaatccataaccagaggcccggc

cccgctttttgcggatgcacaccatg-gataccttcgcggttctagcgatcctcgtac

gagaggtgctggtaaagtaacagtaatgggtactcccgaagtatactgttaccatcatac

accgccgtatttagcgaccgttacttgctcacctagatcctgtgcgacctcttggtcgat

ggtaatctctccactgcaagaagcccttattccgctgccacgactcatagcagaggtcat

acgtaacgaatctaaatccactactcacggctatggggtgaaatatt

>sequence2

gtcttgtatctagcccgctccggaacggcagctacgccacaggcaaacaatccatgcgat

gatcactt-cacaactggtgaatgatcgggctcaccaggacgtacagagaaagtggattg

acggagcatggtcgcaataagatatatacactggacccgatatgccctccattcctttta

cacaccctttactatggggtagttcggaggttgataatttgcggtc-ggtccaacggctt

cacg

>sequence3

gtgaaggtagttttgtgaatactgatgatcaagatgatatctcatcatcgcgttatgcgc

tgcaggctggttagcgtacgact-acaaaccagcacatgtacaagggtatgaccctctcc

aatgtggaacccttcggtagccccacaaag-



There are many specialised software packages for manipulating fasta files, not to mention online services that accepted uploaded or pasted-in files. There are also command-line fasta manipulators in the Linux repositories, in R and Python libraries and on GitHub. But since a fasta file is just plain text, you can do many of the common fasta manipulations with everyday text-processing tools like AWK and sed.

In this post I look at 3 AWK-ish ways to do one very useful manipulation, namely converting sequences spread over several lines into one-line sequences. In sample.fasta, above, the sequences are wrapped at a maximum of 60 characters per line. How to put each of the 3 sequences into a single line preceded by its definition line?

Method 1. I published this one by Bruce Horrocks in 2016 on the Linux Rain blog.

awk 'BEGIN {RS=">";FS="

";OFS=""} NR>1 {print ">"$1; $1=""; print}' sample.fasta

Method 2. A similar method was suggested by "melissamlwong" on the Cheatography site in 2015:

awk 'BEGIN {RS=">";FS="

"} NR>1 {seq=""; for (i=2;i<=NF;i++) seq=seq$i; print ">"$1"

"seq}' sample.fasta

Method 3. The third method was suggested by contributor "Johnsyweb" on Stack Overflow in 2013:

awk '!/^>/ {printf "%s",$0; n="

"} /^>/ {print n$0; n=""} END {printf "%s",n}' sample.fasta

Caution! These methods won't work if the wrapped-line fasta file comes from a Windows environment and has a carriage return plus a UNIX linefeed for its line endings; i.e. "\r

" instead of just "

". Get rid of the carriage returns with tr -d '\r' < sample.fasta before piping the result to the AWK command.

