command line - Intergenic Regions? - Ask Ubuntu
i have bed file genes on plus , minus strands. want intergenic regions between genes on positive , negative strands. there way awk command? much!
edit: second , third columns gene start , stop sites. want subtract number in third column , number in preceding second row if there + sign starting bottom. if last column has - sign want start bottom subtract number in second column , number in preceding third row.
desired output
chr1 820983 (860759-820983) ensg00000269308 591 + chr1 818542 (818542-369634) ensg00000235249 587 + ch1 738637 (738637-623056) ensg00000185097 589 - ch1 621546 (621546-140379) ensg00000237683 586 -
input
chr1 13885 140379 ensg00000237683 586 - chr1 36854 369634 ensg00000235249 587 + chr1 621546 623056 ensg00000185097 589 - chr1 738637 740137 ensg00000269831 590 - chr1 818542 820983 ensg00000269308 591 + chr1 860759 875671 ensg00000187634 591 +
i still don't understand explanation , desired output (why column 1 values chr1
, ch1
? why column 2 in first row 820983 rather 860759?), may enough started.
if don't care order of output, simplest thing seem to reverse order of lines in file, keep rolling record of recent column 2 values +
, -
entries, , print/subtract them when next +
or -
encountered:
$ tac file.bed | awk ' > $nf ~ /+/ {if(last2p) print $1, last2p, "(" last2p "-" $3 ")", $4, $5, $6; last2p = $2} > $nf ~ /-/ {if(last2m) print $1, last2m, "(" last2m "-" $3 ")", $4, $5, $6; last2m = $2} > ' chr1 860759 (860759-820983) ensg00000269308 591 + chr1 738637 (738637-623056) ensg00000185097 589 - chr1 818542 (818542-369634) ensg00000235249 587 + chr1 621546 (621546-140379) ensg00000237683 586 -
if do care output order, can construct pair of arrays plus
, minus
say, iterate on them in reverse looking "up" next matching +
or -
. big one-liner presented here executable awk script:
$ cat chr.awk #!/usr/bin/gawk -f function fooprint(a,i, j,p,q) { split(a[i], p); for(j=i-1;j>0;j--) { if(j in a) { split(a[j], q); print q[1], p[2], "(" p[2] "-" q[3] ")", q[4], q[5], q[6]; break; } } } $nf ~ /+/ {plus[fnr] = $0} $nf ~ /-/ {minus[fnr] = $0} end { for(i=nr; i>1; i--) { if (i in plus) fooprint(plus,i); else if (i in minus) fooprint(minus,i); } }
then
$ ./chr.awk file.bed chr1 860759 (860759-820983) ensg00000269308 591 + chr1 818542 (818542-369634) ensg00000235249 587 + chr1 738637 (738637-623056) ensg00000185097 589 - chr1 621546 (621546-140379) ensg00000237683 586 -
Comments
Post a Comment