# Thread: Statistic calculs in a MultiMap

1. Member
Join Date
Jul 2010
Posts
2
Rep Power
0

## Statistic calculs in a MultiMap

Hi everyone,

I'm new to Java and i'm facing some problems with Multimap tools.
I've got a text file that i parsed to collect some datas. In every single line, i collect the sequence Id, the gene names and its corresponding alleles and optionnaly the comments about the sequences (if there's one).

The aim of my work is first to sort the alleles out according to their corresponding genes (which is easy with the MultiMap by taking as a parameter an Arraylist which contains the list of all the alleles).

There is an example of my text file (i just made it more simple so its easier to understand):

Java Code:
Séquence 1 Gène A    Allèle 1, Allèle 2, Allèle 3    Comments
Séquence 1 Gène A    Allèle 1, Allèle 2, Allèle 3    Comments
Séquence 2 Gène B    Allèle 1, Allèle 2, Allèle 3    None
Séquence 3 Gène C    Allèle 1, Allèle 2, Allèle 3    Comments
Séquence 4 Gène D    Allèle 1, Allèle 2, Allèle 3    Comments
Séquence 5 Gène A    Allèle 1, Allèle 5, Allèle 6    Comments
Séquence 6 Gène E    Allèle 1, Allèle 2, Allèle 3    None
Séquence 7 Gène C    Allèle 4, Allèle 5, Allèle 6    Comments
So i need to get something like that :

Java Code:
Gène A =[Allèle 1, Allèle 2, Allèle 3, Allèle 5, Allèle 6]
Gène C =[Allèle 1, Allèle 2, Allèle 3, Allèle 4, Allèle 5, Allèle 6]
etc
which i could do it.

The problem starts here: For every different allele, i need to calculate:
-the number of the total sequences in which the allele appears
-the number of the redundant sequences
-the number of the non-redundants sequences
-the number of the sequences which contain a comment

To sum up, when i finished to read the file, i need to be able to say for every allele, how many sequences are associated to this allele and among those sequences, i need to be able to say how many are redundant and how many are not as well as how many contain a comment.
All this, while keeping the order defined first, which means the alleles sorted out according to their corresponding genes.

For example, for the allele 1 of the Gene A, i need to get as an output, something like this:

Java Code:
Gène A
Allèle 1: Total sequences : 3
Redundant sequences : 1
Non-redundant sequences : 1

What would you propose as solutions to my problem please?

Any help will be really appreciated.

Ps: sorry for my english, i'm french :o

2. This sounds like a program design problem, not a java programming problem.
Your program is to read in some data, search thru it and organize it.
Do you have a design that you are having trouble writing/implementing in java?
If so, please explain what the coding problem is.

3. Member
Join Date
Jul 2010
Posts
2
Rep Power
0

I actually do think you are right. i tried to create a super class where i am going to put all the methods that i will need in parsing the text file so it make it easier when it comes to treat it but as i am new, its something a bite hard for me to do.

There is my file (a part because its really long)

Java Code:
Sequence number	Sequence ID	Functionality	V-GENE and allele	V-REGION score	V-REGION identity %	V-REGION identity nt	J-GENE and allele	J-REGION score	J-REGION identity %	J-REGION identity nt	D-GENE and allele	D-REGION reading frame	CDR1-IMGT length	CDR2-IMGT length	CDR3-IMGT length	CDR-IMGT lengths	FR-IMGT lengths	AA JUNCTION	JUNCTION frame	Orientation	Functionality comment	V-REGION potential ins/del	J-GENE and allele comment	Sequence
1	imgtligm_A03900_A03900_H.sapiens_HuV(NP)_gene____u	productive	IGHV4-39*07, or IGHV4-4*07 or IGHV4-59*04 or IGHV4-59*05 or IGHV4-b*02 (see comment)	619	68.77	196/285 nt	IGHJ4*03	195	89.58	43/48 nt	IGHD3-22*01	2	8	8	13	8.8.13	[25.17.38.11]	CARYDYYGSSYFDYW	in-frame	+		The submitted sequence and the closest germline V-GENE allele show different CDR1-IMGT amino acid lengths (8 AA in imgtligm_A03900_A03900_H.sapiens_HuV(NP)_gene____u ; 10 AA in  IGHV4-39*07), different CDR2-IMGT amino acid lengths (8 AA in imgtligm_A03900_A03900_H.sapiens_HuV(NP)_gene____u ;  7 AA in  IGHV4-39*07), and low V-REGION identity (68.77% )		atgcaaatcctctgaatctacatggtaaatataggtttgtctataccacaaacagaaaaacatgagatcacagttctctctacagttactgagcacacaggacctcaccatgggatggagctgtatcatcctcttcttggtagcaacagctacaggtaaggggctcacagtagcaggcttgaggtctggacatatatatgggtgacaatgacatccactttgcctttctctccacaggtgtccactcccaggtccaactgcaggagagcggtccaggtcttgtgagacctagccagaccctgagcctgacctgcaccgtgtctggcagcaccttcagcagctactggatgcactgggtgagacagccacctggacgaggtcttgagtggattggaaggattgatcctaatagtggtggtactaagtacaatgagaagttcaagagcagagtgacaatgctggtagacaccagcaagaaccagttcagcctgagactcagcagcgtgacagccgccgacaccgcggtctattattgtgcaagatacgattactacggtagtagctactttgactactggggtcaaggcagcctcgtcacagtctcctcaggt
2	imgtligm_A03907_A03907_H.sapiens_antibody_D1.3_var	productive	IGHV2-5*08, or IGHV2-70*01 (see comment)	628	69.12	197/285 nt	IGHJ4*01	136	76.6	36/47 nt	IGHD3-10*01	2	8	7	10	8.7.10	[25.17.38.11]	CARERDYRLDYW	in-frame	+		The submitted sequence and the closest germline V-GENE allele show different CDR1-IMGT amino acid lengths (8 AA in imgtligm_A03907_A03907_H.sapiens_antibody_D1.3_var ; 10 AA in  IGHV2-5*08), and low V-REGION identity (69.12% )		tcagagcatggctgtcctggcattactcttctgcctggtaacattcccaagctgtatcctttcccaggtgcagctgaaggagtcaggacctggcctggtggcgccctcacagagcctgtccatcacatgcaccgtctcagggttctcattaaccggctatggtgtaaactgggttcgccagcctccaggaaagggtctggagtggctgggaatgatttggggtgatggaaacacagactataattcagctctcaaatccagactgagcatcagcaaggacaactccaagagccaagttttcttaaaaatgaacagtctgcacactgatgacacagccaggtactactgtgccagagagagagattataggcttgactactggggccaaggcaccactctcacagtctcctca
3	imgtligm_A18395_A18395_Human_uPA_cDNA____unassigne	productive	IGHV2-5*08 (see comment)	524	65.14	185/284 nt	IGHJ4*01	159	81.25	39/48 nt	IGHD2-15*01	2	8	7	11	8.7.11	[24.17.38.11]	CARNYWGTSMDYW	in-frame	+		The submitted sequence and the closest germline V-GENE allele show different CDR1-IMGT amino acid lengths (8 AA in imgtligm_A18395_A18395_Human_uPA_cDNA____unassigne ; 10 AA in  IGHV2-5*08), and low V-REGION identity (65.14% )		ctgcaggaatgaagcagtcaggacctggcctagtgcagccctcacagagcctgtccatcacctgcacagtctctggtttctcattaactacctatggtgtacactggattcgccagtctccaggaaagggtctggagtggctgggagtgatatggagtggtggaagcacagactataatgcagctttcatatccagactgagcatcaacaaggacaattccaagagccaagttttctttaaaatgaacagtctgcaagctaatgacacagccatatattactgtgccagaaattattggggaacctctatggactactggggtcaaggaacctcagtcaccgtctcctcagccaaaacgacacccccatctgtctatccactggaattcgatatcaagctt
There is my class that parses all my text file and clean the data (a part)

Java Code:
import java.io.IOException;
import java.util.Vector;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.sql.*;

public class Parser
{
public static void main(String[] args) throws IOException
{

String ligne;
while (( ligne = br.readLine())!= null)
{
Pattern pattern = Pattern.compile("[0-9]+\timgtligm_(.*)");
Matcher matcher = pattern.matcher(ligne);
while (matcher.find())
{
String s=ligne.trim();

String[] data = s.split("\\t");

//Id Sequence
String[] idSequence = data[1].split("_");
System.out.println (idSequence[1]);

if (data.length > 4)
{

String[] nomV = data[3].split ("or");

//Group V Gene and its alleles
String[] geneV = nomV[0].split("\\*");

if (nomV[0].indexOf("(see comment)") > 0)
{System.out.println(geneV[0] + "    " + nomV[0].substring (0,11));
}
else
{
if (nomV[0].indexOf(",") > 0)
{
String[] nomV1 = nomV[0].split(",");
System.out.println(geneV[0] + "    " + nomV1[0]);

}

else { System.out.println (geneV[0] + "    " + nomV[0]);
}
}

for (int i=1; i<nomV.length-1;i++)
{

String[] geneVi = nomV[i].split("\\*");

System.out.println(geneVi[0] + "    " + nomV[i]);
}

String lastOneV = nomV[nomV.length-1];

String[] geneVlast = lastOneV.split("\\*");

if (lastOneV.indexOf("(see comment)") > 0 && nomV.length > 1)
{
System.out.println(geneVlast[0] + "    " + lastOneV.substring (0,12));
}

else if (nomV.length > 1)
{ System.out.println(geneVlast[0] + "    " + lastOneV);
}

}

if (data.length > 20)
{

if (comments.indexOf("show different CDR") > 0)
{
String seqID = "SÃ©quence avec des CDR1 et CDR2 de longueur diffÃ©rente";
System.out.println(seqID);
}

if (comments.indexOf("low V-REGION identity") == 0)
{
String seqCDR = "Séquence ayant un pourcentage d'identité > 85 %";
System.out.println(seqCDR);
}

}

}

}

br.close();

}
}
Now i am stuck in trying to redesign the architecture of my codes so i can easily store my data in an array where i will be able to count what i just enumerated above.

Thank you if you have any indication where i can start.

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•