# Statistic calculs in a MultiMap

• 07-15-2010, 02:11 PM
Posthume82
Statistic calculs in a MultiMap
Hi everyone,

I'm new to Java and i'm facing some problems with Multimap tools.
I've got a text file that i parsed to collect some datas. In every single line, i collect the sequence Id, the gene names and its corresponding alleles and optionnaly the comments about the sequences (if there's one).

The aim of my work is first to sort the alleles out according to their corresponding genes (which is easy with the MultiMap by taking as a parameter an Arraylist which contains the list of all the alleles).

There is an example of my text file (i just made it more simple so its easier to understand):

Code:

```Séquence 1 Gène A    Allèle 1, Allèle 2, Allèle 3    Comments Séquence 1 Gène A    Allèle 1, Allèle 2, Allèle 3    Comments Séquence 2 Gène B    Allèle 1, Allèle 2, Allèle 3    None Séquence 3 Gène C    Allèle 1, Allèle 2, Allèle 3    Comments Séquence 4 Gène D    Allèle 1, Allèle 2, Allèle 3    Comments Séquence 5 Gène A    Allèle 1, Allèle 5, Allèle 6    Comments Séquence 6 Gène E    Allèle 1, Allèle 2, Allèle 3    None Séquence 7 Gène C    Allèle 4, Allèle 5, Allèle 6    Comments```
So i need to get something like that :

Code:

```Gène A =[Allèle 1, Allèle 2, Allèle 3, Allèle 5, Allèle 6] Gène C =[Allèle 1, Allèle 2, Allèle 3, Allèle 4, Allèle 5, Allèle 6] etc```
which i could do it.

The problem starts here: For every different allele, i need to calculate:
-the number of the total sequences in which the allele appears
-the number of the redundant sequences
-the number of the non-redundants sequences
-the number of the sequences which contain a comment

To sum up, when i finished to read the file, i need to be able to say for every allele, how many sequences are associated to this allele and among those sequences, i need to be able to say how many are redundant and how many are not as well as how many contain a comment.
All this, while keeping the order defined first, which means the alleles sorted out according to their corresponding genes.

For example, for the allele 1 of the Gene A, i need to get as an output, something like this:

Code:

```Gène A       Allèle 1: Total sequences : 3                   Redundant sequences : 1                   Non-redundant sequences : 1                   Comments : 2```

What would you propose as solutions to my problem please?

Any help will be really appreciated.

Ps: sorry for my english, i'm french :o
• 07-15-2010, 03:19 PM
Norm
This sounds like a program design problem, not a java programming problem.
Your program is to read in some data, search thru it and organize it.
Do you have a design that you are having trouble writing/implementing in java?
If so, please explain what the coding problem is.
• 07-15-2010, 03:47 PM
Posthume82

I actually do think you are right. i tried to create a super class where i am going to put all the methods that i will need in parsing the text file so it make it easier when it comes to treat it but as i am new, its something a bite hard for me to do.

There is my file (a part because its really long)

Code:

```Sequence number        Sequence ID        Functionality        V-GENE and allele        V-REGION score        V-REGION identity %        V-REGION identity nt        J-GENE and allele        J-REGION score        J-REGION identity %        J-REGION identity nt        D-GENE and allele        D-REGION reading frame        CDR1-IMGT length        CDR2-IMGT length        CDR3-IMGT length        CDR-IMGT lengths        FR-IMGT lengths        AA JUNCTION        JUNCTION frame        Orientation        Functionality comment        V-REGION potential ins/del        J-GENE and allele comment        Sequence        1        imgtligm_A03900_A03900_H.sapiens_HuV(NP)_gene____u        productive        IGHV4-39*07, or IGHV4-4*07 or IGHV4-59*04 or IGHV4-59*05 or IGHV4-b*02 (see comment)        619        68.77        196/285 nt        IGHJ4*03        195        89.58        43/48 nt        IGHD3-22*01        2        8        8        13        8.8.13        [25.17.38.11]        CARYDYYGSSYFDYW        in-frame        +                The submitted sequence and the closest germline V-GENE allele show different CDR1-IMGT amino acid lengths (8 AA in imgtligm_A03900_A03900_H.sapiens_HuV(NP)_gene____u ; 10 AA in  IGHV4-39*07), different CDR2-IMGT amino acid lengths (8 AA in imgtligm_A03900_A03900_H.sapiens_HuV(NP)_gene____u ;  7 AA in  IGHV4-39*07), and low V-REGION identity (68.77% )                atgcaaatcctctgaatctacatggtaaatataggtttgtctataccacaaacagaaaaacatgagatcacagttctctctacagttactgagcacacaggacctcaccatgggatggagctgtatcatcctcttcttggtagcaacagctacaggtaaggggctcacagtagcaggcttgaggtctggacatatatatgggtgacaatgacatccactttgcctttctctccacaggtgtccactcccaggtccaactgcaggagagcggtccaggtcttgtgagacctagccagaccctgagcctgacctgcaccgtgtctggcagcaccttcagcagctactggatgcactgggtgagacagccacctggacgaggtcttgagtggattggaaggattgatcctaatagtggtggtactaagtacaatgagaagttcaagagcagagtgacaatgctggtagacaccagcaagaaccagttcagcctgagactcagcagcgtgacagccgccgacaccgcggtctattattgtgcaagatacgattactacggtagtagctactttgactactggggtcaaggcagcctcgtcacagtctcctcaggt        2        imgtligm_A03907_A03907_H.sapiens_antibody_D1.3_var        productive        IGHV2-5*08, or IGHV2-70*01 (see comment)        628        69.12        197/285 nt        IGHJ4*01        136        76.6        36/47 nt        IGHD3-10*01        2        8        7        10        8.7.10        [25.17.38.11]        CARERDYRLDYW        in-frame        +                The submitted sequence and the closest germline V-GENE allele show different CDR1-IMGT amino acid lengths (8 AA in imgtligm_A03907_A03907_H.sapiens_antibody_D1.3_var ; 10 AA in  IGHV2-5*08), and low V-REGION identity (69.12% )                tcagagcatggctgtcctggcattactcttctgcctggtaacattcccaagctgtatcctttcccaggtgcagctgaaggagtcaggacctggcctggtggcgccctcacagagcctgtccatcacatgcaccgtctcagggttctcattaaccggctatggtgtaaactgggttcgccagcctccaggaaagggtctggagtggctgggaatgatttggggtgatggaaacacagactataattcagctctcaaatccagactgagcatcagcaaggacaactccaagagccaagttttcttaaaaatgaacagtctgcacactgatgacacagccaggtactactgtgccagagagagagattataggcttgactactggggccaaggcaccactctcacagtctcctca        3        imgtligm_A18395_A18395_Human_uPA_cDNA____unassigne        productive        IGHV2-5*08 (see comment)        524        65.14        185/284 nt        IGHJ4*01        159        81.25        39/48 nt        IGHD2-15*01        2        8        7        11        8.7.11        [24.17.38.11]        CARNYWGTSMDYW        in-frame        +                The submitted sequence and the closest germline V-GENE allele show different CDR1-IMGT amino acid lengths (8 AA in imgtligm_A18395_A18395_Human_uPA_cDNA____unassigne ; 10 AA in  IGHV2-5*08), and low V-REGION identity (65.14% )                ctgcaggaatgaagcagtcaggacctggcctagtgcagccctcacagagcctgtccatcacctgcacagtctctggtttctcattaactacctatggtgtacactggattcgccagtctccaggaaagggtctggagtggctgggagtgatatggagtggtggaagcacagactataatgcagctttcatatccagactgagcatcaacaaggacaattccaagagccaagttttctttaaaatgaacagtctgcaagctaatgacacagccatatattactgtgccagaaattattggggaacctctatggactactggggtcaaggaacctcagtcaccgtctcctcagccaaaacgacacccccatctgtctatccactggaattcgatatcaagctt        4        imgtligm_A25486_A25486_H.sapiens_mRNA_for_T-cell_r        No results        5        imgtligm_A25487_A25487_H.sapiens_mRNA_for_T-cell_r        No results        6        imgtligm_A25488_A25488_H.sapiens_mRNA_for_T-cell_r        No results        7        imgtligm_A25489_A25489_H.sapiens_mRNA_for_T-cell_r        No results        8        imgtligm_A25490_A25490_H.sapiens_mRNA_for_T-cell_r        No results```
There is my class that parses all my text file and clean the data (a part)

Code:

``` import java.io.BufferedReader; import java.io.FileReader; import java.io.IOException; import java.util.Vector; import java.util.regex.Matcher; import java.util.regex.Pattern; import java.sql.*; public class Parser {            public static void main(String[] args) throws IOException     {                FileReader fr = new FileReader("/home/nadia/Bureau/Summary/test.txt");       BufferedReader br = new BufferedReader(fr);             String ligne;           while (( ligne = br.readLine())!= null)           {                        Pattern pattern = Pattern.compile("[0-9]+\timgtligm_(.*)");           Matcher matcher = pattern.matcher(ligne);           while (matcher.find())               {                      String s=ligne.trim();           String[] data = s.split("\\t");                                      //Id Sequence           String[] idSequence = data[1].split("_");           System.out.println (idSequence[1]);                       if (data.length > 4)           {                    String[] nomV = data[3].split ("or");                        //Group V Gene and its alleles           String[] geneV = nomV[0].split("\\*");            if (nomV[0].indexOf("(see comment)") > 0)           {System.out.println(geneV[0] + "    " + nomV[0].substring (0,11));           }           else                   {                   if (nomV[0].indexOf(",") > 0)                   {                     String[] nomV1 = nomV[0].split(",");                     System.out.println(geneV[0] + "    " + nomV1[0]);                     }                         else { System.out.println (geneV[0] + "    " + nomV[0]);                     }                   }                             for (int i=1; i<nomV.length-1;i++)           {                                String[] geneVi = nomV[i].split("\\*");                                        System.out.println(geneVi[0] + "    " + nomV[i]);           }                     String lastOneV = nomV[nomV.length-1];           String[] geneVlast = lastOneV.split("\\*");              if (lastOneV.indexOf("(see comment)") > 0 && nomV.length > 1)           {            System.out.println(geneVlast[0] + "    " + lastOneV.substring (0,12));           }                     else if (nomV.length > 1)           { System.out.println(geneVlast[0] + "    " + lastOneV);           }                      }                     //Comments           if (data.length > 20)           {                    String comments = (data[22]);                     if (comments.indexOf("show different CDR") > 0)           {           String seqID = "SÃ©quence avec des CDR1 et CDR2 de longueur diffÃ©rente";                    System.out.println(seqID);           }                     if (comments.indexOf("low V-REGION identity") == 0)           {           String seqCDR = "Séquence ayant un pourcentage d'identité > 85 %";                    System.out.println(seqCDR);           }                            }                                       }                                                                  }           br.close();               } }```
Now i am stuck in trying to redesign the architecture of my codes so i can easily store my data in an array where i will be able to count what i just enumerated above.

Thank you if you have any indication where i can start.