cutting rows and sorting them in order

Asked by avkapur on 2008-07-15

I have a list of 13 gene identifiersi.e. (>ENSMUSG00000020061). Under each gene identifier there are rows of information and each row has seven alpha-numeric columns. My problem is I want to take all the rows under one gene, sort them by the last column which is numeric, and then go on to the next gene. At the last row of a gene there is a blank space followed by the next gene identifier i.e. (>ENSMUSG00000048285)

Question information

Language:
English Edit question
Status:
Solved
For:
Ubuntu gnome-terminal Edit question
Assignee:
No assignee Edit question
Solved by:
avkapur
Solved:
2008-07-16
Last query:
2008-07-16
Last reply:
2008-07-15
Tobias Wolf (towolf) said : #1

I have a hard time imagining how your text looks like. Can you post a snippet? It would help.

avkapur (avkapur) said : #2

Here you go, hope it helps, thanks.
>ENSMUSG00000000159
Hoxb1(3) 4 - 13 - tcatttctct 3.36
 Hoxb1(2) 4 - 13 - tcatttctct 3.19
 Hoxb1(3) 76 - 85 - acctctatcc 3.51
 Hoxb1(2) 76 - 85 - acctctatcc 3.34
 Hoxb1(2) 80 - 89 - ctatccaggc 2.73
 Hoxb1(3) 80 - 89 - ctatccaggc 2.91
 Hoxb1(2) 84 - 93 - ccaggctcct 2.52
 Hoxb1(3) 84 - 93 - ccaggctcct 2.69
 Hoxb1(2) 91 - 100 - ccttcctcct 4.84
 Hoxb1(3) 91 - 100 - ccttcctcct 4.75
Hoxb1(1) 91 - 100 - ccttcctcct 4.22
 Hoxb1(3) 111 - 120 + agacgcctgc 2.59
 Hoxb1(2) 111 - 120 + agacgcctgc 2.42
 Hoxb1(2) 133 - 142 - caaacgctct 3.24
 Hoxb1(3) 133 - 142 - caaacgctct 3.41
 Hoxb1(3) 162 - 171 + agacaaaagg 3.43
 Hoxb1(2) 162 - 171 + agacaaaagg 3.26
 Hoxb1(2) 166 - 175 + aaaaggatct 3.38
 Hoxb1(3) 166 - 175 + aaaaggatct 3.55
 Hoxb1(3) 172 - 181 - atctctatct 3.55
 Hoxb1(2) 172 - 181 - atctctatct 3.38
 Hoxb1(3) 176 - 185 - ctatctagtt 3.26
 Hoxb1(2) 176 - 185 - ctatctagtt 3.09
 Hoxb1(3) 614 - 623 - tcatgctttt 3.07
 Hoxb1(2) 614 - 623 - tcatgctttt 2.9
 Hoxb1(3) 619 - 628 - cttttcctct 3.14
 Hoxb1(2) 619 - 628 - cttttcctct 2.97
 Hoxb1(2) 670 - 679 - ccagctcact 3.02
 Hoxb1(3) 670 - 679 - ccagctcact 3.19
 Hoxb1(3) 675 - 684 - tcacttatct 3.55
 Hoxb1(2) 675 - 684 - tcacttatct 3.38
 Hoxb1(2) 681 - 690 + atctcgaagg 2.66
 Hoxb1(3) 681 - 690 + atctcgaagg 2.83
 Hoxb1(2) 685 - 694 + cgaaggctgt 2.16
 Hoxb1(3) 685 - 694 + cgaaggctgt 2.33
 Hoxb1(3) 711 - 720 + agagggaacc 2.71
 Hoxb1(2) 711 - 720 + agagggaacc 2.54
 Hoxb1(2) 748 - 757 + agaggcaggt 2.06
 Hoxb1(3) 748 - 757 + agaggcaggt 2.23
Hoxb1(1) 754 - 763 + aggtggatgg 8.4
 Hoxb1(3) 754 - 763 + aggtggatgg 6.69
 Hoxb1(2) 754 - 763 + aggtggatgg 7.32
 Hoxb1(3) 758 - 767 + ggatggccag 2.02
 Hoxb1(3) 831 - 840 + agggcgattg 2.04
 Hoxb1(2) 848 - 857 + atatgggcag 2.37
 Hoxb1(3) 848 - 857 + atatgggcag 2.54
 Hoxb1(2) 879 - 888 + aaaaggctcg 2.85
 Hoxb1(3) 879 - 888 + aaaaggctcg 3.02
 Hoxb1(2) 967 - 976 + agaaggaaag 4.88
 Hoxb1(3) 967 - 976 + agaaggaaag 4.79
Hoxb1(1) 967 - 976 + agaaggaaag 4.26

>ENSMUSG00000023800
 Hoxb1(3) 72 - 81 - ctattcatca 4.01
 Hoxb1(2) 72 - 81 - ctattcatca 4.11
Hoxb1(1) 72 - 81 - ctattcatca 3.49
 Hoxb1(2) 340 - 349 - acatctttca 2.19
 Hoxb1(3) 340 - 349 - acatctttca 2.36
 Hoxb1(3) 343 - 352 - tctttcatca 2.41
 Hoxb1(2) 343 - 352 - tctttcatca 2.24
 Hoxb1(3) 347 - 356 - tcatcaatgg 2.53
 Hoxb1(2) 347 - 356 - tcatcaatgg 2.36
 Hoxb1(2) 347 - 356 + tcatcaatgg 2.36
 Hoxb1(3) 347 - 356 + tcatcaatgg 2.53
 Hoxb1(3) 378 - 387 - ctctgcctct 3.36
 Hoxb1(2) 378 - 387 - ctctgcctct 3.19
 Hoxb1(2) 387 - 396 + tgttgtatag 2.03
 Hoxb1(3) 387 - 396 + tgttgtatag 2.2
 Hoxb1(2) 433 - 442 + aaatggctcc 2.67
 Hoxb1(3) 433 - 442 + aaatggctcc 2.84
 Hoxb1(3) 493 - 502 + aggggggtgg 5.09
 Hoxb1(2) 493 - 502 + aggggggtgg 5.18
Hoxb1(1) 493 - 502 + aggggggtgg 4.56
 Hoxb1(3) 494 - 503 + ggggggtggg 2.07
 Hoxb1(3) 528 - 537 + agaatgaggg 4.23
 Hoxb1(2) 528 - 537 + agaatgaggg 4.33
Hoxb1(1) 528 - 537 + agaatgaggg 3.71
 Hoxb1(2) 554 - 563 - caaaccatgg 2.62
 Hoxb1(3) 554 - 563 - caaaccatgg 2.79
 Hoxb1(2) 606 - 615 - cattccatct 6.38
 Hoxb1(3) 606 - 615 - cattccatct 6.01
Hoxb1(1) 606 - 615 - cattccatct 6.61
 Hoxb1(3) 610 - 619 - ccatctgtgg 3.17
 Hoxb1(2) 610 - 619 - ccatctgtgg 3
Hoxb1(1) 640 - 649 + acaaggatga 3.49
 Hoxb1(3) 640 - 649 + acaaggatga 4.02
 Hoxb1(2) 640 - 649 + acaaggatga 4.11
 Hoxb1(3) 643 - 652 + aggatgatag 2.34
 Hoxb1(2) 643 - 652 + aggatgatag 2.17
 Hoxb1(3) 644 - 653 + ggatgatagg 2.63
 Hoxb1(2) 644 - 653 + ggatgatagg 2.46
 Hoxb1(3) 647 - 656 + tgataggtgt 2.44
 Hoxb1(2) 647 - 656 + tgataggtgt 2.27
 Hoxb1(3) 696 - 705 - gcagccctgg 2.07
 Hoxb1(2) 737 - 746 - ccagactcct 3.09
 Hoxb1(3) 737 - 746 - ccagactcct 3.26
 Hoxb1(3) 745 - 754 - ctatggaact 2.53
 Hoxb1(2) 745 - 754 - ctatggaact 2.36
 Hoxb1(2) 752 - 761 + acttcggtgg 2.93
 Hoxb1(3) 752 - 761 + acttcggtgg 3.1
 Hoxb1(2) 765 - 774 + tggtggattg 4.47
Hoxb1(1) 765 - 774 + tggtggattg 3.85
 Hoxb1(3) 765 - 774 + tggtggattg 4.37
Hoxb1(1) 769 - 778 + ggattgctgg 4.15
 Hoxb1(3) 769 - 778 + ggattgctgg 4.68
 Hoxb1(2) 769 - 778 + ggattgctgg 4.78
 Hoxb1(3) 801 - 810 - gcagccccgt 2.13
 Hoxb1(2) 807 - 816 - ccgtgtctct 3.12
 Hoxb1(3) 807 - 816 - ccgtgtctct 3.29
 Hoxb1(3) 823 - 832 + agcatggtgg 2.98
 Hoxb1(2) 823 - 832 + agcatggtgg 2.81
 Hoxb1(3) 827 - 836 + tggtggcagg 3.27
 Hoxb1(2) 827 - 836 + tggtggcagg 3.1
 Hoxb1(2) 855 - 864 - ccaggcagcc 3.57
 Hoxb1(3) 855 - 864 - ccaggcagcc 3.74
Hoxb1(1) 855 - 864 - ccaggcagcc 2.09
 Hoxb1(3) 903 - 912 - cgttccatca 4.6
 Hoxb1(2) 903 - 912 - cgttccatca 4.7
Hoxb1(1) 903 - 912 - cgttccatca 4.08
 Hoxb1(3) 907 - 916 + ccatcaatgg 2.84
 Hoxb1(2) 907 - 916 - ccatcaatgg 4.58
 Hoxb1(2) 907 - 916 + ccatcaatgg 2.67
Hoxb1(1) 907 - 916 - ccatcaatgg 3.96
 Hoxb1(3) 907 - 916 - ccatcaatgg 4.49
 Hoxb1(2) 961 - 970 - atatccagca 2.38
 Hoxb1(3) 961 - 970 - atatccagca 2.55
 Hoxb1(2) 978 - 987 - ggatccgttt 2.69
 Hoxb1(3) 978 - 987 - ggatccgttt 2.86

>ENSMUSG00000024087
 Hoxb1(3) 72 - 81 + agggagatga 2.6
 Hoxb1(2) 72 - 81 + agggagatga 2.43
 Hoxb1(3) 76 - 85 + agatgactgg 6.04
 Hoxb1(2) 76 - 85 + agatgactgg 6.4
Hoxb1(1) 76 - 85 + agatgactgg 6.63
 Hoxb1(3) 159 - 168 - caacccaacc 2.63
 Hoxb1(2) 159 - 168 - caacccaacc 2.46
 Hoxb1(3) 163 - 172 - ccaaccgttt 4.69
Hoxb1(1) 163 - 172 - ccaaccgttt 4.16
 Hoxb1(2) 163 - 172 - ccaaccgttt 4.78
 Hoxb1(3) 168 - 177 + cgtttgttgg 2.93
 Hoxb1(2) 168 - 177 + cgtttgttgg 2.76
 Hoxb1(3) 189 - 198 - gcctcctcct 2.44
 Hoxb1(2) 189 - 198 - gcctcctcct 2.27
 Hoxb1(3) 195 - 204 - tcctcccact 2.82
 Hoxb1(2) 195 - 204 - tcctcccact 2.65
 Hoxb1(3) 200 - 209 - ccactcaccg 2.32
 Hoxb1(2) 200 - 209 - ccactcaccg 2.14
 Hoxb1(3) 208 - 217 - cgccccctct 2.08
 Hoxb1(2) 301 - 310 + aggtggcaga 2.13
 Hoxb1(3) 301 - 310 + aggtggcaga 2.3
 Hoxb1(3) 337 - 346 - cttgccttct 3.16
 Hoxb1(2) 337 - 346 - cttgccttct 2.99
 Hoxb1(2) 362 - 371 + agaaggatgg 8.17
 Hoxb1(3) 362 - 371 + agaaggatgg 7.55
Hoxb1(1) 362 - 371 + agaaggatgg 9.25
 Hoxb1(3) 402 - 411 - gtttccaaat 2.05
 Hoxb1(2) 415 - 424 + agaggtagag 2.43
 Hoxb1(3) 415 - 424 + agaggtagag 2.6
 Hoxb1(2) 423 - 432 + agatctgtgc 2.71
 Hoxb1(3) 423 - 432 + agatctgtgc 2.88
 Hoxb1(2) 459 - 468 + aaatgctttg 3.45
 Hoxb1(3) 459 - 468 + aaatgctttg 3.62
 Hoxb1(3) 479 - 488 + agataaagat 2.14
 Hoxb1(3) 481 - 490 + ataaagataa 2.51
 Hoxb1(2) 481 - 490 + ataaagataa 2.08
 Hoxb1(2) 485 - 494 + agataaatgc 5.23
 Hoxb1(3) 485 - 494 + agataaatgc 5.14
Hoxb1(1) 485 - 494 + agataaatgc 4.61
 Hoxb1(3) 489 - 498 + aaatgcaagg 4.7
Hoxb1(1) 489 - 498 + aaatgcaagg 4.17
 Hoxb1(2) 489 - 498 + aaatgcaagg 4.79
 Hoxb1(2) 571 - 580 - gcaccaaact 2.74
 Hoxb1(3) 571 - 580 - gcaccaaact 2.91
 Hoxb1(2) 581 - 590 - cgagtcacct 2.43
 Hoxb1(3) 581 - 590 - cgagtcacct 2.6
 Hoxb1(2) 587 - 596 + acctggagga 2.21
 Hoxb1(3) 587 - 596 + acctggagga 2.38
 Hoxb1(2) 593 - 602 + aggaggaggg 3.53
Hoxb1(1) 593 - 602 + aggaggaggg 2.91
 Hoxb1(3) 593 - 602 + aggaggaggg 3.43
 Hoxb1(2) 635 - 644 + agaggcgttg 2.19
 Hoxb1(3) 635 - 644 + agaggcgttg 2.36
 Hoxb1(2) 709 - 718 + gggaggatga 2.05
 Hoxb1(3) 709 - 718 + gggaggatga 2.22
 Hoxb1(2) 713 - 722 + ggatgactga 2.57
 Hoxb1(3) 713 - 722 + ggatgactga 2.74
 Hoxb1(3) 757 - 766 + aggcggatcc 2.08
 Hoxb1(3) 774 - 783 - gctgccagct 2.22
 Hoxb1(2) 774 - 783 - gctgccagct 2.05
 Hoxb1(3) 778 - 787 - ccagctattg 2.96
 Hoxb1(2) 778 - 787 - ccagctattg 2.79
 Hoxb1(3) 801 - 810 - ccagccagag 2.1
 Hoxb1(3) 847 - 856 + gcacggatgt 2.44
 Hoxb1(2) 847 - 856 + gcacggatgt 2.27
 Hoxb1(3) 858 - 867 - agacccaact 2.91
 Hoxb1(2) 858 - 867 - agacccaact 2.74
 Hoxb1(3) 867 - 876 - tcacagatct 3.33
 Hoxb1(2) 867 - 876 - tcacagatct 3.16
 Hoxb1(3) 871 - 880 + agatctgtgg 4.45
 Hoxb1(2) 871 - 880 + agatctgtgg 4.55
Hoxb1(1) 871 - 880 + agatctgtgg 3.93
 Hoxb1(3) 897 - 906 - ccgcccctca 2.02
 Hoxb1(3) 910 - 919 - cccaccccct 2.1
 Hoxb1(3) 932 - 941 - cctccagtct 2.74
 Hoxb1(2) 932 - 941 - cctccagtct 2.57
 Hoxb1(3) 968 - 977 + agagggatgg 7.17
 Hoxb1(2) 968 - 977 + agagggatgg 7.8
Hoxb1(1) 968 - 977 + agagggatgg 8.88

Harvey Muller (hlmuller) said : #3

avkapur,

I am assuming this is text in a console, and not data in openoffice. The solution is awk, and my awk is a little rusty.

If you send a small sample of the data to me, I may be able to work up a solution for you tonight.

You can scrub the data, just make sure the respective fields contain similar representative characters. I.e. numbers where there should be numbers, alphabetic characters where there should be alphabetic characters. Use the same punctuation and whitespace that is used in the original data.

Two gene identifiers with 5 rows of data apiece should be sufficient. Make sure the last row (per gene identifier) is similar to the last row of the actual data.

You can find my email address by clicking my name, and then the overview tab.

Thanks,

Harvey

avkapur (avkapur) said : #4

just posted it, thanks.

avkapur (avkapur) said : #5

The original message was received at Tue, 15 Jul 2008 12:44:22 -0400 (EDT)
from ms4.nss.udel.edu [128.175.1.24]

   ----- The following addresses had permanent delivery errors -----
<hmuller@mlab-1420>

   ----- Transcript of session is unavailable -----

avkapur (avkapur) said : #6

>ENSMUSG00000000159
Hoxb1(3) 4 - 13 - tcatttctct 3.36
 Hoxb1(2) 4 - 13 - tcatttctct 3.19
 Hoxb1(3) 76 - 85 - acctctatcc 3.51
 Hoxb1(2) 76 - 85 - acctctatcc 3.34
 Hoxb1(2) 80 - 89 - ctatccaggc 2.73

>ENSMUSG00000023800
 Hoxb1(3) 72 - 81 - ctattcatca 4.01
 Hoxb1(2) 72 - 81 - ctattcatca 4.11
Hoxb1(1) 72 - 81 - ctattcatca 3.49
 Hoxb1(2) 340 - 349 - acatctttca 2.19
 Hoxb1(3) 340 - 349 - acatctttca 2.36
 Hoxb1(3) 343 - 352 - tctttcatca 2.41

Tobias Wolf (towolf) said : #7

Ya, that was his SSH key id, I think. mlab must be his machine.

Anyhow, I think sorting with Awk is not so quick to implement. Maybe it’s cleaner, but working on the shell, I usually do it quick and dirty. So here is my proposal, maybe Harv can top this.

Unix can sort with "sort". But you have blocks to sort, so break them up and sort those.

rm -f sorted_file
cat unsorted_file | sed -e's/[[:blank:]]\+/\t/g' -e's/^[[:blank:]]*H/ H/' | csplit -k - "/>ENSMU/" {*}
for geneid in xx*
do
  sort -k7 $geneid >> sorted_file
done
rm -v xx???

1. the "sed" part normalizes the white space, you sample was not well aligned in columns. I make them tab separated here.
2. the csplit part now writes a file for every gene id block (xx00, xx01, xx02,...). This may be seen as a disadvantage.
3. sort every small file (i.e., xx* ) and append sorted lines to the new file named sorted_file, or similar.

This is how I'd do it after thinking 10minutes about it.

Tobias Wolf (towolf) said : #8

Maybe a one-liner is better for you?

ruby -00ne'puts split("\n").sort_by{|x|x[/\s\S+$/].to_f};puts' unsorted file.

I googled this here:
http://www.unix.com/shell-programming-scripting/21536-sort-data.html#post83542

And it's so neat and compact that I like it. You need to install the package ruby for this, of course.

avkapur (avkapur) said : #9

thanks guys, this worked really well.

Harvey Muller (hlmuller) said : #10

Great solution Tobias! I'm a basic shell scriptor. Python, perl, and ruby, are scripting languages I have yet to tackle.

avkapur, that was my ssh key id, as Tobias mentioned. If you don't mind, please forward the text again to hlmuller -AT- yahoo -DOT- com.

I still want to refresh my awk skills, and the raw text as an attachment is the best to experiment with. I'll add the awk solution as a bonus.

Thanks!

Harvey