Results 1 to 3 of 3
Thread: Statistic calculs in a MultiMap
- 07-15-2010, 02:11 PM #1
Member
- Join Date
- Jul 2010
- Posts
- 2
- Rep Power
- 0
Statistic calculs in a MultiMap
Hi everyone,
I'm new to Java and i'm facing some problems with Multimap tools.
I've got a text file that i parsed to collect some datas. In every single line, i collect the sequence Id, the gene names and its corresponding alleles and optionnaly the comments about the sequences (if there's one).
The aim of my work is first to sort the alleles out according to their corresponding genes (which is easy with the MultiMap by taking as a parameter an Arraylist which contains the list of all the alleles).
There is an example of my text file (i just made it more simple so its easier to understand):
So i need to get something like that :Java Code:Séquence 1 Gène A Allèle 1, Allèle 2, Allèle 3 Comments Séquence 1 Gène A Allèle 1, Allèle 2, Allèle 3 Comments Séquence 2 Gène B Allèle 1, Allèle 2, Allèle 3 None Séquence 3 Gène C Allèle 1, Allèle 2, Allèle 3 Comments Séquence 4 Gène D Allèle 1, Allèle 2, Allèle 3 Comments Séquence 5 Gène A Allèle 1, Allèle 5, Allèle 6 Comments Séquence 6 Gène E Allèle 1, Allèle 2, Allèle 3 None Séquence 7 Gène C Allèle 4, Allèle 5, Allèle 6 Comments
which i could do it.Java Code:Gène A =[Allèle 1, Allèle 2, Allèle 3, Allèle 5, Allèle 6] Gène C =[Allèle 1, Allèle 2, Allèle 3, Allèle 4, Allèle 5, Allèle 6] etc
The problem starts here: For every different allele, i need to calculate:
-the number of the total sequences in which the allele appears
-the number of the redundant sequences
-the number of the non-redundants sequences
-the number of the sequences which contain a comment
To sum up, when i finished to read the file, i need to be able to say for every allele, how many sequences are associated to this allele and among those sequences, i need to be able to say how many are redundant and how many are not as well as how many contain a comment.
All this, while keeping the order defined first, which means the alleles sorted out according to their corresponding genes.
For example, for the allele 1 of the Gene A, i need to get as an output, something like this:
Java Code:Gène A Allèle 1: Total sequences : 3 Redundant sequences : 1 Non-redundant sequences : 1 Comments : 2
What would you propose as solutions to my problem please?
Any help will be really appreciated.
Ps: sorry for my english, i'm french :o
- 07-15-2010, 03:19 PM #2
This sounds like a program design problem, not a java programming problem.
Your program is to read in some data, search thru it and organize it.
Do you have a design that you are having trouble writing/implementing in java?
If so, please explain what the coding problem is.
- 07-15-2010, 03:47 PM #3
Member
- Join Date
- Jul 2010
- Posts
- 2
- Rep Power
- 0
Hi Norm and thank you very much for your reply.
I actually do think you are right. i tried to create a super class where i am going to put all the methods that i will need in parsing the text file so it make it easier when it comes to treat it but as i am new, its something a bite hard for me to do.
There is my file (a part because its really long)
There is my class that parses all my text file and clean the data (a part)Java Code:Sequence number Sequence ID Functionality V-GENE and allele V-REGION score V-REGION identity % V-REGION identity nt J-GENE and allele J-REGION score J-REGION identity % J-REGION identity nt D-GENE and allele D-REGION reading frame CDR1-IMGT length CDR2-IMGT length CDR3-IMGT length CDR-IMGT lengths FR-IMGT lengths AA JUNCTION JUNCTION frame Orientation Functionality comment V-REGION potential ins/del J-GENE and allele comment Sequence 1 imgtligm_A03900_A03900_H.sapiens_HuV(NP)_gene____u productive IGHV4-39*07, or IGHV4-4*07 or IGHV4-59*04 or IGHV4-59*05 or IGHV4-b*02 (see comment) 619 68.77 196/285 nt IGHJ4*03 195 89.58 43/48 nt IGHD3-22*01 2 8 8 13 8.8.13 [25.17.38.11] CARYDYYGSSYFDYW in-frame + The submitted sequence and the closest germline V-GENE allele show different CDR1-IMGT amino acid lengths (8 AA in imgtligm_A03900_A03900_H.sapiens_HuV(NP)_gene____u ; 10 AA in IGHV4-39*07), different CDR2-IMGT amino acid lengths (8 AA in imgtligm_A03900_A03900_H.sapiens_HuV(NP)_gene____u ; 7 AA in IGHV4-39*07), and low V-REGION identity (68.77% ) atgcaaatcctctgaatctacatggtaaatataggtttgtctataccacaaacagaaaaacatgagatcacagttctctctacagttactgagcacacaggacctcaccatgggatggagctgtatcatcctcttcttggtagcaacagctacaggtaaggggctcacagtagcaggcttgaggtctggacatatatatgggtgacaatgacatccactttgcctttctctccacaggtgtccactcccaggtccaactgcaggagagcggtccaggtcttgtgagacctagccagaccctgagcctgacctgcaccgtgtctggcagcaccttcagcagctactggatgcactgggtgagacagccacctggacgaggtcttgagtggattggaaggattgatcctaatagtggtggtactaagtacaatgagaagttcaagagcagagtgacaatgctggtagacaccagcaagaaccagttcagcctgagactcagcagcgtgacagccgccgacaccgcggtctattattgtgcaagatacgattactacggtagtagctactttgactactggggtcaaggcagcctcgtcacagtctcctcaggt 2 imgtligm_A03907_A03907_H.sapiens_antibody_D1.3_var productive IGHV2-5*08, or IGHV2-70*01 (see comment) 628 69.12 197/285 nt IGHJ4*01 136 76.6 36/47 nt IGHD3-10*01 2 8 7 10 8.7.10 [25.17.38.11] CARERDYRLDYW in-frame + The submitted sequence and the closest germline V-GENE allele show different CDR1-IMGT amino acid lengths (8 AA in imgtligm_A03907_A03907_H.sapiens_antibody_D1.3_var ; 10 AA in IGHV2-5*08), and low V-REGION identity (69.12% ) tcagagcatggctgtcctggcattactcttctgcctggtaacattcccaagctgtatcctttcccaggtgcagctgaaggagtcaggacctggcctggtggcgccctcacagagcctgtccatcacatgcaccgtctcagggttctcattaaccggctatggtgtaaactgggttcgccagcctccaggaaagggtctggagtggctgggaatgatttggggtgatggaaacacagactataattcagctctcaaatccagactgagcatcagcaaggacaactccaagagccaagttttcttaaaaatgaacagtctgcacactgatgacacagccaggtactactgtgccagagagagagattataggcttgactactggggccaaggcaccactctcacagtctcctca 3 imgtligm_A18395_A18395_Human_uPA_cDNA____unassigne productive IGHV2-5*08 (see comment) 524 65.14 185/284 nt IGHJ4*01 159 81.25 39/48 nt IGHD2-15*01 2 8 7 11 8.7.11 [24.17.38.11] CARNYWGTSMDYW in-frame + The submitted sequence and the closest germline V-GENE allele show different CDR1-IMGT amino acid lengths (8 AA in imgtligm_A18395_A18395_Human_uPA_cDNA____unassigne ; 10 AA in IGHV2-5*08), and low V-REGION identity (65.14% ) ctgcaggaatgaagcagtcaggacctggcctagtgcagccctcacagagcctgtccatcacctgcacagtctctggtttctcattaactacctatggtgtacactggattcgccagtctccaggaaagggtctggagtggctgggagtgatatggagtggtggaagcacagactataatgcagctttcatatccagactgagcatcaacaaggacaattccaagagccaagttttctttaaaatgaacagtctgcaagctaatgacacagccatatattactgtgccagaaattattggggaacctctatggactactggggtcaaggaacctcagtcaccgtctcctcagccaaaacgacacccccatctgtctatccactggaattcgatatcaagctt 4 imgtligm_A25486_A25486_H.sapiens_mRNA_for_T-cell_r No results 5 imgtligm_A25487_A25487_H.sapiens_mRNA_for_T-cell_r No results 6 imgtligm_A25488_A25488_H.sapiens_mRNA_for_T-cell_r No results 7 imgtligm_A25489_A25489_H.sapiens_mRNA_for_T-cell_r No results 8 imgtligm_A25490_A25490_H.sapiens_mRNA_for_T-cell_r No results
Now i am stuck in trying to redesign the architecture of my codes so i can easily store my data in an array where i will be able to count what i just enumerated above.Java Code:import java.io.BufferedReader; import java.io.FileReader; import java.io.IOException; import java.util.Vector; import java.util.regex.Matcher; import java.util.regex.Pattern; import java.sql.*; public class Parser { public static void main(String[] args) throws IOException { FileReader fr = new FileReader("/home/nadia/Bureau/Summary/test.txt"); BufferedReader br = new BufferedReader(fr); String ligne; while (( ligne = br.readLine())!= null) { Pattern pattern = Pattern.compile("[0-9]+\timgtligm_(.*)"); Matcher matcher = pattern.matcher(ligne); while (matcher.find()) { String s=ligne.trim(); String[] data = s.split("\\t"); //Id Sequence String[] idSequence = data[1].split("_"); System.out.println (idSequence[1]); if (data.length > 4) { String[] nomV = data[3].split ("or"); //Group V Gene and its alleles String[] geneV = nomV[0].split("\\*"); if (nomV[0].indexOf("(see comment)") > 0) {System.out.println(geneV[0] + " " + nomV[0].substring (0,11)); } else { if (nomV[0].indexOf(",") > 0) { String[] nomV1 = nomV[0].split(","); System.out.println(geneV[0] + " " + nomV1[0]); } else { System.out.println (geneV[0] + " " + nomV[0]); } } for (int i=1; i<nomV.length-1;i++) { String[] geneVi = nomV[i].split("\\*"); System.out.println(geneVi[0] + " " + nomV[i]); } String lastOneV = nomV[nomV.length-1]; String[] geneVlast = lastOneV.split("\\*"); if (lastOneV.indexOf("(see comment)") > 0 && nomV.length > 1) { System.out.println(geneVlast[0] + " " + lastOneV.substring (0,12)); } else if (nomV.length > 1) { System.out.println(geneVlast[0] + " " + lastOneV); } } //Comments if (data.length > 20) { String comments = (data[22]); if (comments.indexOf("show different CDR") > 0) { String seqID = "Séquence avec des CDR1 et CDR2 de longueur différente"; System.out.println(seqID); } if (comments.indexOf("low V-REGION identity") == 0) { String seqCDR = "Séquence ayant un pourcentage d'identité > 85 %"; System.out.println(seqCDR); } } } } br.close(); } }
Thank you if you have any indication where i can start.
Similar Threads
-
Getting all the keys for a specific value from a multimap??
By Ms.Ranjan in forum New To JavaReplies: 0Last Post: 05-13-2009, 03:55 PM


LinkBack URL
About LinkBacks
Reply With Quote
Bookmarks