1. Member
Join Date
Feb 2011
Posts
78
Rep Power
0

## Longest common sequence

I have an assignment in my data structures class to find the longest common sequence of letters of 2 strings. It does not need to be done programatically, but rather with a table. For example with the 2 strings 'MZJAWXU' and XMJYAUZ here is the table. I am not following the logic. Any help appreciated. I looked at the wikipedia page and its still not clicking.

| 0 1 2 3 4 5 6 7
| M Z J A W X U
-----|-----------------
0 | 0 0 0 0 0 0 0 0
1 X | 0 0 0 0 0 0 1 1
2 M | 0 1 1 1 1 1 1 1
3 J | 0 1 1 2 2 2 2 2
4 Y | 0 1 1 2 2 2 2 2
5 A | 0 1 1 2 3 3 3 3
6 U | 0 1 1 2 3 3 3 4
7 Z | 0 1 2 2 3 3 3 4

2. Would you mind posting the assignment details verbatim?

3. Member
Join Date
Feb 2011
Posts
78
Rep Power
0
Find longest common sequence of letters from the 2 strings:
X = "skullandbones" Y = "lullabybabies"

Use the matrix as done in class (referring to a matrix similar to the one found here: Longest common subsequence problem - Wikipedia, the free encyclopedia)

The 'Worked Example' section on that page shows the way to do it, I just dont follow it.

4. i Think i get it...

difference between subsequence and substring,

the longest common substring would be 'ulla',

but the longest common subsequence would be 'ullabes'

the matrix is just a technique to find the longest subsequence progressively

and each step in building the matrix is just a break-down of comparisons of subsequences

1. If we have a sequence called S which contained AGCA, all the possible subsequences would be:
S1 = A
S2 = AG
S3 = AGC
S4 = AGCA

2. When finding the Longest Common Subsequence (LCS) of Two sequences,
to simplify our calculations, we first check the END of the sequence and remove any common last elements.

The example given is two sequences: BANANA and ATANA.
Why bother going through the whole sequence when you can see they both end with 'ANA'. The idea is to save your calculation and confusions, so remove ANA from both sequences and you can find the LCS of the shorter sequences (which would be BAN and AT).

the LCS of BAN and AT is just 'A', so append the last elements you removed and you get the final LCS of A+ANA = 'AANA'.

3. When you get two sequences that do not end in the same symbol,
then you'd never need the last element from one of them.

ATK and BTG don't have a common last element, so the LCS could never include those last elements (K or G). In that case, you're trying to find the LCS of two sequences: ATK and BT - or - the LCS of AT and BTG

if ATK is X and BTG is Y, then you're looking for LCS(Xn,Ym-1) or LCS(Xn-1,Ym) in this case.

4. Make the Matrix Table (one sequence as row header, the other sequence as column header), and for each co-ordinate make comparisons between the two sequences.

5. The two sequences you were given both end in 'ES', so you can remove those from your table, and append 'ES' to your LCS later
Last edited by ozzyman; 04-26-2011 at 05:46 PM.

5. Here's the table as we start off

6. The first comparison is S and L. Not the same, so in this case the longer sequence is taken,

but they're both empty, so 0 is given instead, with arrows pointing to both.

7. The same goes for the next two comparisons; L against K and L against U, they're not the same and neither direction contains a longer subsequence since they are empty

See picture

8. Finally we have a few matches, L and L. In the first case, since there is no longer subsequence, we have to use a diagonal arrow pointing to an empty cell. The resulting subsequence is (0L) = (L).

In the second case, the sequence to the left is longer (1 symbol) than the sequence to the top (empty), so we use an arrow pointing to the longer sequence (left).

See 1st pic.

Now we have a longest common subsequence of LL so far.

Continue this work for the rest of the row and notice that the remaining cells always point Left because the sequence there is longer (LL) against (0).

See 2nd pic.

By the end of the 1st row, we have the longest subsequence of (LL)+(ES) = (LLES)
Last edited by ozzyman; 04-26-2011 at 06:24 PM.

9. ## Why the table?

To be honest I found the table very confusing.
Maybe someone can explain to me why it would be needed, or is it to explain the process? It all works fine with recursion:

Java Code:
```package lcs;
public class LCS_maker{
private String DNA1 = "";
private String DNA2 = "";

public LCS_maker( String dna1, String dna2 ){
DNA1 = dna1;
DNA2 = dna2;
printLCS();
}
public static void main( String[] args ){
new LCS_maker( "skullandbones", "lullabybabies" );
//		new LCS_maker( "AGCAT", "GAC" );
}
private void printLCS(){
int len1 = DNA1.length();
int len2 = DNA2.length();
String lcs = LCS( len1, len2 );
System.out.println( "LCS_maker: \"" + lcs + "\"" );
}
//Don't use Strings, to save space
private String LCS( int len1, int len2 ){
char last1, last2;
String lcs1, lcs2;
int lenLcs1, lenLcs2;
if( len1 <= 0 || len2 <= 0 )return "";
last1 = DNA1.charAt( len1-1 );
last2 = DNA2.charAt( len2-1 );
if ( last1 == last2 ){
return LCS( len1-1, len2-1 ) + last1;
}else{
lcs1 = LCS( len1, len2-1 );
lcs2 = LCS( len1-1, len2 );
lenLcs1 = lcs1.length();
lenLcs2 = lcs2.length();
return lenLcs1 > lenLcs2 ? lcs1 : lcs2;//Could be throwing some solutions away here
}
}
}```
(By the way: thanks for the link. I first did DNA-alignment by an other recursive algorithm, repeatedly chopping it in half, searching and extending.
It worked fine, but this is much more elegant!)
Last edited by Jodokus; 04-29-2011 at 02:02 AM. Reason: spelling

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•