bash - Script for finding a longest and shortest word or string in a file? - Ask Ubuntu


i performing genomics. have file fasta format reads. these genes. each gene called read or contig. each contig starts header , followed alphabets or nusleotides eg: actg , of specific length. want determine longest contig , shortest contig or read or gene in file. please tell me ubuntu script find such contigs. each contig or read in fasta format follows:

>locus_1000_transcript_1/1_confidence_0.000_length_648 ftbs=645 (header) ccgccttggtaacctcgccagcatattgagctttggatccggatggtcgtagaatggcaag gcaggagagagtgtctaatgtggcgccgctctgtacccggggggtaacaatgaatttgcga cgacgtggtatgcccttcgttgaaacccttattagttggagccgctatgtggcggtccaat tatcaagtatttcccacatcttgaagcgcttctggatgtacgcatactatgggttgacgtt agtgtagccgagatttcacagtagctccgaacggtggtagcagacgcccgttcacaaaaac 

the header has defined format shows gene loci , number of genes , there space between each contig or read. each of read or contig in file start header of same type mentioned above, values may differ. each contig or read starts > sign. there may contigs of same lengths. – science 3 mins ago

assuming length values in fasta headers correct, extract them there:

sed -nre 's/^>.*_length_([0-9]+) .*/\1/p' \ 

then sort them numerically

| sort -n \ 

then output first , last line

| sed -ne '1p;$p' 

in 1 statement:

sed -nre 's/^>.*length_([0-9]+) .*/\1/p' | sort -n | sed -ne '1p;$p' 

if lengths declared in headers cannot trusted, count length of fasta sequences, first convert them unfasta, print line length of every second line same sort | sed filter above:

uf | awk 'nr%2==0 {print length}' | sort -n | sed -n '1p;$p' 

where uf simple bash script found here.


note: both one-liners filters, read input standard input , write standard output. use cat feed them files (or wget -o - feed them off internet).


Comments

Popular posts from this blog

download - Firefox cannot save files (most of the time), how to solve? - Super User

windows - "-2146893807 NTE_NOT_FOUND" when repair certificate store - Super User

sql server - "Configuration file does not exist", Event ID 274 - Super User