bash - Script for finding a longest and shortest word or string in a file? - Ask Ubuntu
i performing genomics. have file fasta format reads. these genes. each gene called read or contig. each contig starts header , followed alphabets or nusleotides eg: actg , of specific length. want determine longest contig , shortest contig or read or gene in file. please tell me ubuntu script find such contigs. each contig or read in fasta format follows:
>locus_1000_transcript_1/1_confidence_0.000_length_648 ftbs=645 (header) ccgccttggtaacctcgccagcatattgagctttggatccggatggtcgtagaatggcaag gcaggagagagtgtctaatgtggcgccgctctgtacccggggggtaacaatgaatttgcga cgacgtggtatgcccttcgttgaaacccttattagttggagccgctatgtggcggtccaat tatcaagtatttcccacatcttgaagcgcttctggatgtacgcatactatgggttgacgtt agtgtagccgagatttcacagtagctccgaacggtggtagcagacgcccgttcacaaaaac
the header has defined format shows gene loci , number of genes , there space between each contig or read. each of read or contig in file start header of same type mentioned above, values may differ. each contig or read starts > sign. there may contigs of same lengths. – science 3 mins ago
assuming length
values in fasta headers correct, extract them there:
sed -nre 's/^>.*_length_([0-9]+) .*/\1/p' \
then sort them numerically
| sort -n \
then output first , last line
| sed -ne '1p;$p'
in 1 statement:
sed -nre 's/^>.*length_([0-9]+) .*/\1/p' | sort -n | sed -ne '1p;$p'
if lengths declared in headers cannot trusted, count length of fasta sequences, first convert them unfasta, print line length of every second line same sort | sed
filter above:
uf | awk 'nr%2==0 {print length}' | sort -n | sed -n '1p;$p'
where uf
simple bash script found here.
note: both one-liners filters, read input standard input , write standard output. use cat
feed them files (or wget -o -
feed them off internet).
Comments
Post a Comment