Results 1 to 3 of 3
  1. #1
    Posthume82 is offline Member
    Join Date
    Jul 2010
    Posts
    2
    Rep Power
    0

    Default Statistic calculs in a MultiMap

    Hi everyone,

    I'm new to Java and i'm facing some problems with Multimap tools.
    I've got a text file that i parsed to collect some datas. In every single line, i collect the sequence Id, the gene names and its corresponding alleles and optionnaly the comments about the sequences (if there's one).

    The aim of my work is first to sort the alleles out according to their corresponding genes (which is easy with the MultiMap by taking as a parameter an Arraylist which contains the list of all the alleles).

    There is an example of my text file (i just made it more simple so its easier to understand):

    Java Code:
    Séquence 1 Gène A    Allèle 1, Allèle 2, Allèle 3    Comments
    Séquence 1 Gène A    Allèle 1, Allèle 2, Allèle 3    Comments
    Séquence 2 Gène B    Allèle 1, Allèle 2, Allèle 3    None
    Séquence 3 Gène C    Allèle 1, Allèle 2, Allèle 3    Comments 
    Séquence 4 Gène D    Allèle 1, Allèle 2, Allèle 3    Comments
    Séquence 5 Gène A    Allèle 1, Allèle 5, Allèle 6    Comments
    Séquence 6 Gène E    Allèle 1, Allèle 2, Allèle 3    None
    Séquence 7 Gène C    Allèle 4, Allèle 5, Allèle 6    Comments
    So i need to get something like that :

    Java Code:
    Gène A =[Allèle 1, Allèle 2, Allèle 3, Allèle 5, Allèle 6] 
    Gène C =[Allèle 1, Allèle 2, Allèle 3, Allèle 4, Allèle 5, Allèle 6] 
    etc
    which i could do it.

    The problem starts here: For every different allele, i need to calculate:
    -the number of the total sequences in which the allele appears
    -the number of the redundant sequences
    -the number of the non-redundants sequences
    -the number of the sequences which contain a comment


    To sum up, when i finished to read the file, i need to be able to say for every allele, how many sequences are associated to this allele and among those sequences, i need to be able to say how many are redundant and how many are not as well as how many contain a comment.
    All this, while keeping the order defined first, which means the alleles sorted out according to their corresponding genes.

    For example, for the allele 1 of the Gene A, i need to get as an output, something like this:

    Java Code:
    Gène A 
          Allèle 1: Total sequences : 3
                      Redundant sequences : 1
                      Non-redundant sequences : 1
                      Comments : 2

    What would you propose as solutions to my problem please?

    Any help will be really appreciated.

    Ps: sorry for my english, i'm french :o

  2. #2
    Norm's Avatar
    Norm is offline Moderator
    Join Date
    Jun 2008
    Location
    SW Missouri
    Posts
    17,331
    Rep Power
    25

    Default

    This sounds like a program design problem, not a java programming problem.
    Your program is to read in some data, search thru it and organize it.
    Do you have a design that you are having trouble writing/implementing in java?
    If so, please explain what the coding problem is.

  3. #3
    Posthume82 is offline Member
    Join Date
    Jul 2010
    Posts
    2
    Rep Power
    0

    Default

    Hi Norm and thank you very much for your reply.

    I actually do think you are right. i tried to create a super class where i am going to put all the methods that i will need in parsing the text file so it make it easier when it comes to treat it but as i am new, its something a bite hard for me to do.

    There is my file (a part because its really long)

    Java Code:
    Sequence number	Sequence ID	Functionality	V-GENE and allele	V-REGION score	V-REGION identity %	V-REGION identity nt	J-GENE and allele	J-REGION score	J-REGION identity %	J-REGION identity nt	D-GENE and allele	D-REGION reading frame	CDR1-IMGT length	CDR2-IMGT length	CDR3-IMGT length	CDR-IMGT lengths	FR-IMGT lengths	AA JUNCTION	JUNCTION frame	Orientation	Functionality comment	V-REGION potential ins/del	J-GENE and allele comment	Sequence	
    1	imgtligm_A03900_A03900_H.sapiens_HuV(NP)_gene____u	productive	IGHV4-39*07, or IGHV4-4*07 or IGHV4-59*04 or IGHV4-59*05 or IGHV4-b*02 (see comment)	619	68.77	196/285 nt	IGHJ4*03	195	89.58	43/48 nt	IGHD3-22*01	2	8	8	13	8.8.13	[25.17.38.11]	CARYDYYGSSYFDYW	in-frame	+		The submitted sequence and the closest germline V-GENE allele show different CDR1-IMGT amino acid lengths (8 AA in imgtligm_A03900_A03900_H.sapiens_HuV(NP)_gene____u ; 10 AA in  IGHV4-39*07), different CDR2-IMGT amino acid lengths (8 AA in imgtligm_A03900_A03900_H.sapiens_HuV(NP)_gene____u ;  7 AA in  IGHV4-39*07), and low V-REGION identity (68.77% )		atgcaaatcctctgaatctacatggtaaatataggtttgtctataccacaaacagaaaaacatgagatcacagttctctctacagttactgagcacacaggacctcaccatgggatggagctgtatcatcctcttcttggtagcaacagctacaggtaaggggctcacagtagcaggcttgaggtctggacatatatatgggtgacaatgacatccactttgcctttctctccacaggtgtccactcccaggtccaactgcaggagagcggtccaggtcttgtgagacctagccagaccctgagcctgacctgcaccgtgtctggcagcaccttcagcagctactggatgcactgggtgagacagccacctggacgaggtcttgagtggattggaaggattgatcctaatagtggtggtactaagtacaatgagaagttcaagagcagagtgacaatgctggtagacaccagcaagaaccagttcagcctgagactcagcagcgtgacagccgccgacaccgcggtctattattgtgcaagatacgattactacggtagtagctactttgactactggggtcaaggcagcctcgtcacagtctcctcaggt	
    2	imgtligm_A03907_A03907_H.sapiens_antibody_D1.3_var	productive	IGHV2-5*08, or IGHV2-70*01 (see comment)	628	69.12	197/285 nt	IGHJ4*01	136	76.6	36/47 nt	IGHD3-10*01	2	8	7	10	8.7.10	[25.17.38.11]	CARERDYRLDYW	in-frame	+		The submitted sequence and the closest germline V-GENE allele show different CDR1-IMGT amino acid lengths (8 AA in imgtligm_A03907_A03907_H.sapiens_antibody_D1.3_var ; 10 AA in  IGHV2-5*08), and low V-REGION identity (69.12% )		tcagagcatggctgtcctggcattactcttctgcctggtaacattcccaagctgtatcctttcccaggtgcagctgaaggagtcaggacctggcctggtggcgccctcacagagcctgtccatcacatgcaccgtctcagggttctcattaaccggctatggtgtaaactgggttcgccagcctccaggaaagggtctggagtggctgggaatgatttggggtgatggaaacacagactataattcagctctcaaatccagactgagcatcagcaaggacaactccaagagccaagttttcttaaaaatgaacagtctgcacactgatgacacagccaggtactactgtgccagagagagagattataggcttgactactggggccaaggcaccactctcacagtctcctca	
    3	imgtligm_A18395_A18395_Human_uPA_cDNA____unassigne	productive	IGHV2-5*08 (see comment)	524	65.14	185/284 nt	IGHJ4*01	159	81.25	39/48 nt	IGHD2-15*01	2	8	7	11	8.7.11	[24.17.38.11]	CARNYWGTSMDYW	in-frame	+		The submitted sequence and the closest germline V-GENE allele show different CDR1-IMGT amino acid lengths (8 AA in imgtligm_A18395_A18395_Human_uPA_cDNA____unassigne ; 10 AA in  IGHV2-5*08), and low V-REGION identity (65.14% )		ctgcaggaatgaagcagtcaggacctggcctagtgcagccctcacagagcctgtccatcacctgcacagtctctggtttctcattaactacctatggtgtacactggattcgccagtctccaggaaagggtctggagtggctgggagtgatatggagtggtggaagcacagactataatgcagctttcatatccagactgagcatcaacaaggacaattccaagagccaagttttctttaaaatgaacagtctgcaagctaatgacacagccatatattactgtgccagaaattattggggaacctctatggactactggggtcaaggaacctcagtcaccgtctcctcagccaaaacgacacccccatctgtctatccactggaattcgatatcaagctt	
    4	imgtligm_A25486_A25486_H.sapiens_mRNA_for_T-cell_r	No results	
    5	imgtligm_A25487_A25487_H.sapiens_mRNA_for_T-cell_r	No results	
    6	imgtligm_A25488_A25488_H.sapiens_mRNA_for_T-cell_r	No results	
    7	imgtligm_A25489_A25489_H.sapiens_mRNA_for_T-cell_r	No results	
    8	imgtligm_A25490_A25490_H.sapiens_mRNA_for_T-cell_r	No results
    There is my class that parses all my text file and clean the data (a part)

    Java Code:
    import java.io.BufferedReader;
    import java.io.FileReader;
    import java.io.IOException;
    import java.util.Vector;
    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    import java.sql.*;
    
    public class Parser 
    {	
        public static void main(String[] args) throws IOException
        {   	
           FileReader fr = new FileReader("/home/nadia/Bureau/Summary/test.txt");
           BufferedReader br = new BufferedReader(fr);
           
           String ligne;
    	   while (( ligne = br.readLine())!= null)
              {    	  
              Pattern pattern = Pattern.compile("[0-9]+\timgtligm_(.*)");
              Matcher matcher = pattern.matcher(ligne);
              while (matcher.find())
        	  {            
              String s=ligne.trim();
    
              String[] data = s.split("\\t");	          
              
              //Id Sequence
              String[] idSequence = data[1].split("_");
              System.out.println (idSequence[1]);
                
    
              if (data.length > 4)
              {	  
    
              String[] nomV = data[3].split ("or");     
              
              //Group V Gene and its alleles
              String[] geneV = nomV[0].split("\\*");  
    
              if (nomV[0].indexOf("(see comment)") > 0)
              {System.out.println(geneV[0] + "    " + nomV[0].substring (0,11));
              } 
              else 
            	  { 
            	   if (nomV[0].indexOf(",") > 0)
                       {
            	    String[] nomV1 = nomV[0].split(","); 
            	    System.out.println(geneV[0] + "    " + nomV1[0]);
    
            	    }
              
                   else { System.out.println (geneV[0] + "    " + nomV[0]);
                        }
            	  }
                       
              for (int i=1; i<nomV.length-1;i++)
              {   
            	
                  String[] geneVi = nomV[i].split("\\*");    
            	  
            	  System.out.println(geneVi[0] + "    " + nomV[i]); 
              }
              
              String lastOneV = nomV[nomV.length-1];
    
              String[] geneVlast = lastOneV.split("\\*");     
    
              if (lastOneV.indexOf("(see comment)") > 0 && nomV.length > 1)
              {  
               System.out.println(geneVlast[0] + "    " + lastOneV.substring (0,12));
              }
              
              else if (nomV.length > 1) 
              { System.out.println(geneVlast[0] + "    " + lastOneV);
              }  
              
              } 
              
              //Comments
              if (data.length > 20) 
              {	  
              String comments = (data[22]);
              
              if (comments.indexOf("show different CDR") > 0)
              {
               String seqID = "Séquence avec des CDR1 et CDR2 de longueur différente";	  
               System.out.println(seqID);
              }
              
              if (comments.indexOf("low V-REGION identity") == 0)
              {
              String seqCDR = "Séquence ayant un pourcentage d'identité > 85 %";	  
              System.out.println(seqCDR);
              }	
              
              } 
                       
              
              }       
                                            
        	  
           } 
    
    	  br.close();
    	  
         } 
    }
    Now i am stuck in trying to redesign the architecture of my codes so i can easily store my data in an array where i will be able to count what i just enumerated above.

    Thank you if you have any indication where i can start.

Similar Threads

  1. Replies: 0
    Last Post: 05-13-2009, 03:55 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •