Wednesday, July 30, 2014

Longest Common Subsequence

Here is how the problem is defined. An excerpt from http://en.wikipedia.org/wiki/Longest_common_subsequence_problem

The longest common subsequence (LCS) problem is to find the longest subsequence common to all sequences in a set of sequences (often just two). (Note that a subsequence is different from a substring, for the terms of the former need not be consecutive terms of the original sequence.) It is a classic computer science problem, the basis of file comparison programs such as diff, and has applications in bioinformatics.

Brute Force

The simple solution is to finding all the subsequences of each string, and comparing them to get to the longest common subsequence. Getting all the subsequences is nothing but getting all the subsets or combinations of characters in string - which is exponential. So, the runtime complexity is O(2^n)
 

Using Dynamic Programming

In-order to solve the solution using dynamic programming we need to be able to solve the problem by solving sub-problems and they need to be overlapped so that we don't end up solving exponential sub problems. Fortunately there is one such solution exists and here is how its defined:
 
Say, X and Y are given strings, and Xi is prefix of X (All characters in string from first 1 to i)
And LCS(Xi, Yj) is longest common subsequence length of Xi and Yj.
 
LCS(Xi, Yj) = 0 if i or j is zero (there is no common subsequence if one of the prefixes are empty)
                       LCS(Xi-1, Yj-1) + 1 if X[i] == Y[j]
                       Max(LCS(Xi-1, Yj), LCS(Xi, Yj-1) )

Here is corresponding code

 
 
 
Code Snippet
  1. public int[][] LCS_OfAllPrefixes_Length(string A, string B)
  2.         {
  3.             A.ThrowIfNullOrWhiteSpace("a");
  4.             B.ThrowIfNullOrWhiteSpace("b");
  5.             int[][] DP_LCS_AllPrefixes_Cache = new int[A.Length+1][];
  6.             for(int i = 0;i<DP_LCS_AllPrefixes_Cache.Length; i++)
  7.             {
  8.                 DP_LCS_AllPrefixes_Cache[i] = new int[B.Length + 1];
  9.             }
  10.             for (int rowIndexOfCache = 1; rowIndexOfCache <= A.Length; rowIndexOfCache++)
  11.             {
  12.                 for (int columnIndexOfCache = 1; columnIndexOfCache <= B.Length; columnIndexOfCache++)
  13.                 {
  14.                      //LCS(Ai, Bj) = 0 if i <=0, or j <= 0
  15.                      //              LCS(Ai, Bj) + 1 if Ai == Bj
  16.                      //              Max(LCS(Ai-1, Bj), LCS(Ai, Bj-1))
  17.                     if(A[rowIndexOfCache-1] == B[columnIndexOfCache-1])
  18.                     {
  19.                         DP_LCS_AllPrefixes_Cache[rowIndexOfCache][columnIndexOfCache] = DP_LCS_AllPrefixes_Cache[rowIndexOfCache - 1][columnIndexOfCache - 1] + 1;
  20.                     }
  21.                     else
  22.                     {
  23.                         DP_LCS_AllPrefixes_Cache[rowIndexOfCache][columnIndexOfCache] = Utilities.Max(DP_LCS_AllPrefixes_Cache[rowIndexOfCache - 1][columnIndexOfCache],
  24.                             DP_LCS_AllPrefixes_Cache[rowIndexOfCache][columnIndexOfCache - 1]);
  25.                     }
  26.                 }
  27.             }
  28.             return DP_LCS_AllPrefixes_Cache;
  29.         }

Find the longest common subsequences

Approach 1: essentially stores the LCSs of all the prefixes (i.e. if the strings are longer, it requires more space) You may refer to my answer@ http://stackoverflow.com/questions/7369025/all-longest-common-subsequences/25036245#25036245
 
Code Snippet
  1. class LCS_Prefix
  2.         {
  3.             public int Length = 0;
  4.             public string[] Subsequences = null;
  5.         }
  6.         LCS_Prefix[][] LCS_OfAllPrefixes_Subsequences(string A, string B)
  7.         {
  8.             A.ThrowIfNullOrWhiteSpace("a");
  9.             B.ThrowIfNullOrWhiteSpace("b");
  10.             LCS_Prefix[][] LCS_DP_OfAllPrefixes_Subsequences_Cache = new LCS_Prefix[A.Length + 1][];
  11.             for (int i = 0; i < LCS_DP_OfAllPrefixes_Subsequences_Cache.Length; i++)
  12.             {
  13.                 LCS_DP_OfAllPrefixes_Subsequences_Cache[i] = new LCS_Prefix[B.Length + 1];
  14.                 for(int j = 0; j< LCS_DP_OfAllPrefixes_Subsequences_Cache[i].Length; j++)
  15.                 {
  16.                     LCS_DP_OfAllPrefixes_Subsequences_Cache[i][j] = new LCS_Prefix();
  17.                 }
  18.             }
  19.             for (int rowIndexOfCache = 1; rowIndexOfCache <= A.Length; rowIndexOfCache++)
  20.             {
  21.                 for (int columnIndexOfCache = 1; columnIndexOfCache <= B.Length; columnIndexOfCache++)
  22.                 {
  23.                     //LCS(Ai, Bj) = 0 if i <=0, or j <= 0
  24.                     //              LCS(Ai, Bj) + 1 if Ai == Bj
  25.                     //              Max(LCS(Ai-1, Bj), LCS(Ai, Bj-1))
  26.                     LCS_Prefix lcsPrefix = LCS_DP_OfAllPrefixes_Subsequences_Cache[rowIndexOfCache][columnIndexOfCache];
  27.                     if (A[rowIndexOfCache - 1] == B[columnIndexOfCache - 1])
  28.                     {
  29.                         var lcs_Prefix_Diagnoal = LCS_DP_OfAllPrefixes_Subsequences_Cache[rowIndexOfCache - 1]
  30.                             [columnIndexOfCache - 1];
  31.                         lcsPrefix.Length = lcs_Prefix_Diagnoal.Length + 1;
  32.                         if (lcs_Prefix_Diagnoal.Subsequences == null)
  33.                         {
  34.                             lcsPrefix.Subsequences = new string[] { A[rowIndexOfCache - 1].ToString() };
  35.                         }
  36.                         else
  37.                         {
  38.                             lcsPrefix.Subsequences = lcs_Prefix_Diagnoal.Subsequences
  39.                                 .Select(s => s + A[rowIndexOfCache - 1]).ToArray();
  40.                         }
  41.                     }
  42.                     else
  43.                     {
  44.                         LCS_Prefix prefix1_Upward = LCS_DP_OfAllPrefixes_Subsequences_Cache[rowIndexOfCache - 1][columnIndexOfCache];
  45.                         var prefix2_Leftward = LCS_DP_OfAllPrefixes_Subsequences_Cache[rowIndexOfCache][columnIndexOfCache-1];
  46.                         if(prefix1_Upward.Length == prefix2_Leftward.Length)
  47.                         {
  48.                             Assert.IsTrue(prefix1_Upward.Length == prefix2_Leftward.Length);
  49.                             Assert.IsTrue((prefix1_Upward.Subsequences == null &&
  50.                                             prefix2_Leftward.Subsequences == null)
  51.                                         || (prefix1_Upward.Subsequences != null
  52.                                             && prefix2_Leftward.Subsequences != null));
  53.                             if (prefix1_Upward.Subsequences != null)
  54.                             {
  55.                                 Assert.IsTrue(prefix1_Upward.Subsequences.All(s1 => prefix2_Leftward.Subsequences.Any(s2 => (s2.Length == s1.Length))));
  56.                             }
  57.                             
  58.                             lcsPrefix.Length = prefix1_Upward.Length;
  59.                             if (prefix1_Upward.Subsequences != null)
  60.                             {
  61.                                 lcsPrefix.Subsequences = prefix1_Upward.Subsequences
  62.                                     .Union(prefix2_Leftward.Subsequences).ToArray();
  63.                             }
  64.                             else
  65.                             {
  66.                                 Assert.IsNull(prefix2_Leftward.Subsequences);
  67.                             }
  68.                         }
  69.                         else if(prefix1_Upward.Length > prefix2_Leftward.Length)
  70.                         {
  71.                             lcsPrefix.Length = prefix1_Upward.Length;
  72.                             lcsPrefix.Subsequences = prefix1_Upward.Subsequences;
  73.                         }
  74.                         else
  75.                         {
  76.                             lcsPrefix.Length = prefix2_Leftward.Length;
  77.                             lcsPrefix.Subsequences = prefix2_Leftward.Subsequences;
  78.                         }
  79.                     }
  80.                 }
  81.             }
  82.             return LCS_DP_OfAllPrefixes_Subsequences_Cache;
  83.         }
Approach2
Backtracking using LCS cache table of length.
 
Code Snippet
  1. string[] GetLongestCommonSubsequences(string A, string B, int[][] DP_LCS_AllPrefixes_Cache)
  2.         {
  3.             var r = this.GetLongestCommonSubsequences(A, B, A.Length, B.Length,
  4.                 DP_LCS_AllPrefixes_Cache);
  5.             return r;
  6.         }
  7.         string[] GetLongestCommonSubsequences(string A, string B, int aIndex, int bIndex,
  8.             int[][] DP_LCS_AllPrefixes_Cache)
  9.         {
  10.             if(DP_LCS_AllPrefixes_Cache[aIndex][bIndex] == 0)
  11.             {
  12.                 return null;
  13.             }
  14.             if(A[aIndex-1] == B[bIndex -1])
  15.             {
  16.                 var r = this.GetLongestCommonSubsequences(A, B, aIndex - 1, bIndex - 1,
  17.                     DP_LCS_AllPrefixes_Cache);
  18.                 if(r == null)
  19.                 {
  20.                     return new string[] { A[aIndex - 1].ToString() };
  21.                 }
  22.                 return r.Select(s => s + A[aIndex - 1].ToString()).ToArray();
  23.             }
  24.             int lcs_up_direction = DP_LCS_AllPrefixes_Cache[aIndex - 1][bIndex];
  25.             int lcs_left_direction = DP_LCS_AllPrefixes_Cache[aIndex][bIndex-1];
  26.             if(lcs_up_direction == lcs_left_direction)
  27.             {
  28.                 string[] lcs_up = this.GetLongestCommonSubsequences(A, B, aIndex - 1, bIndex,
  29.                     DP_LCS_AllPrefixes_Cache);
  30.                 string[] lcs_left = this.GetLongestCommonSubsequences(A, B, aIndex, bIndex-1,
  31.                     DP_LCS_AllPrefixes_Cache);
  32.                 return lcs_up.Union(lcs_left).ToArray();
  33.             }
  34.             if(lcs_up_direction > lcs_left_direction)
  35.             {
  36.                 return this.GetLongestCommonSubsequences(A, B, aIndex - 1, bIndex,
  37.                     DP_LCS_AllPrefixes_Cache);
  38.             }
  39.             return this.GetLongestCommonSubsequences(A, B, aIndex, bIndex - 1, DP_LCS_AllPrefixes_Cache);
  40.         }

Finding the diffs

Basically difference of two strings is nothing but remaining characters. So, one can find the diff by finding longest common subsequences and comparing each character, or using below backtracking algorithm. For ex, for GAC and AGCAT, it returns => { { "[G][A]C", "[G]A[C]", "G[A][C]" }, {"A[G]C[A]T", "A[G][C]AT", "[A]G[C]AT"} where GA, GC and AC are longest common subsequences...
 
Code Snippet
  1. string[][] GetDiffs(string A, string B, int aIndex, int bIndex,
  2.             int[][] DP_LCS_AllPrefixes_Cache)
  3.         {
  4.             if((aIndex == 0) && (bIndex ==0))
  5.             {
  6.                 return null;
  7.             }
  8.             if (DP_LCS_AllPrefixes_Cache[aIndex][bIndex] == 0)
  9.             {
  10.                 var r = new string[2][];
  11.                 r[0] = new string[] { A.Substring(0, aIndex) };
  12.                 r[1] = new string[] { B.Substring(0, bIndex) };
  13.                 return r;
  14.             }
  15.             if (A[aIndex - 1] == B[bIndex - 1])
  16.             {
  17.                 var r = this.GetDiffs(A, B, aIndex - 1, bIndex - 1,
  18.                     DP_LCS_AllPrefixes_Cache);
  19.                 string ch = string.Format("[{0}]", A[aIndex - 1]);
  20.                 if (r == null)
  21.                 {
  22.                     r = new string[2][];
  23.                     r[0] = new string[] { ch };
  24.                     r[1] = new string[] { ch };
  25.                 }
  26.                 else
  27.                 {
  28.                     r[0] = r[0].Select(s => s + ch).ToArray();
  29.                     r[1] = r[1].Select(s => s + ch).ToArray();
  30.                 }
  31.                 return r;
  32.             }
  33.             int lcs_up_direction = DP_LCS_AllPrefixes_Cache[aIndex - 1][bIndex];
  34.             int lcs_left_direction = DP_LCS_AllPrefixes_Cache[aIndex][bIndex - 1];
  35.             string[][] lcs_up = null, lcs_left = null;
  36.             if (lcs_up_direction == lcs_left_direction)
  37.             {
  38.                 lcs_up = this.GetDiffs(A, B, aIndex - 1, bIndex,
  39.                     DP_LCS_AllPrefixes_Cache);
  40.                 lcs_left = this.GetDiffs(A, B, aIndex, bIndex - 1,
  41.                     DP_LCS_AllPrefixes_Cache);
  42.             }
  43.             else if (lcs_up_direction > lcs_left_direction)
  44.             {
  45.                 lcs_up = this.GetDiffs(A, B, aIndex - 1, bIndex,
  46.                     DP_LCS_AllPrefixes_Cache);
  47.             }
  48.             else
  49.             {
  50.                 lcs_left = this.GetDiffs(A, B, aIndex, bIndex - 1, DP_LCS_AllPrefixes_Cache);
  51.             }
  52.             char a = A[aIndex - 1], b = B[bIndex - 1];
  53.             string[][] rl = new string[2][];
  54.             rl[0] = new string[0];
  55.             rl[1] = new string[0];
  56.             if(lcs_up != null)
  57.             {
  58.                 //we moved upward, that is we accepted that they differ with 'A' at aIndex-1 (a)
  59.                 rl[0] = lcs_up[0].Select(s => s + a.ToString()).ToArray();
  60.                 rl[1] = lcs_up[1];
  61.             }
  62.             if (lcs_left != null)
  63.             {
  64.                 //we moved left, that is we accepted that they differ with 'B' at bIndex-1 (b)
  65.                 rl[0] = rl[0].Union(lcs_left[0]).ToArray(); ;
  66.                 rl[1] = rl[1].Union(lcs_left[1].Select(s => s + b.ToString())).ToArray();
  67.             }
  68.             return rl.ToArray();
  69.         }
  70.         string[][] GetDiffs(string A, string B, int[][] DP_LCS_AllPrefixes_Cache)
  71.         {
  72.             var r = this.GetDiffs(A, B, A.Length, B.Length,
  73.                 DP_LCS_AllPrefixes_Cache);
  74.             return r;
  75.         }

Optimizations

Reducing Problem set  If only a few items have changed in the middle of the sequence, the beginning and end can be eliminated. This reduces not only the memory requirements for the matrix, but also the number of comparisons that must be done.
 
Reducing comparison time Instead of strings one can use hashes of strings (especially w.r.t source code diff.)

Unit Tests

 
Code Snippet
  1. [TestCategory(Constants.DynamicProgramming)]
  2.         [TestMethod]
  3.         public void LCS_Tests()
  4.         {
  5.             string A = "GAC", B = "AGCAT";
  6.             var DP_LCS_AllPrefixes_Cache = this.LCS_OfAllPrefixes_Length(A, B);
  7.             Assert.IsTrue(DP_LCS_AllPrefixes_Cache[A.Length][B.Length] == 2);
  8.             var lcs_sequences = this.GetLongestCommonSubsequences(A, B, DP_LCS_AllPrefixes_Cache);
  9.             Assert.IsNotNull(lcs_sequences);
  10.             var diffs = this.GetDiffs(A, B, DP_LCS_AllPrefixes_Cache);
  11.             Assert.IsNotNull(diffs);
  12.             Assert.IsTrue(diffs.Length == 2);
  13.             Assert.IsTrue(diffs[0].Length == lcs_sequences.Length);
  14.             Assert.IsTrue(diffs[1].Length == lcs_sequences.Length);
  15.             Assert.IsTrue(lcs_sequences.Any(s => "AC".Equals(s)));
  16.             Assert.IsTrue(lcs_sequences.Any(s => "GC".Equals(s)));
  17.             Assert.IsTrue(lcs_sequences.Any(s => "GA".Equals(s)));
  18.             var DP_LCS_AllPrefixes_Subsequences_Cache = this.LCS_OfAllPrefixes_Subsequences(A, B);
  19.             Assert.IsTrue(DP_LCS_AllPrefixes_Subsequences_Cache[A.Length][B.Length].Length == 2);
  20.             Assert.IsTrue(DP_LCS_AllPrefixes_Subsequences_Cache[A.Length][B.Length].Subsequences
  21.                .Any(s => "AC".Equals(s)));
  22.             Assert.IsTrue(DP_LCS_AllPrefixes_Subsequences_Cache[A.Length][B.Length].Subsequences
  23.                 .Any(s => "GC".Equals(s)));
  24.             Assert.IsTrue(DP_LCS_AllPrefixes_Subsequences_Cache[A.Length][B.Length].Subsequences
  25.                 .Any(s => "GA".Equals(s)));
  26.             A = "ABCDGH"; B = "AEDFHR";
  27.             DP_LCS_AllPrefixes_Cache = this.LCS_OfAllPrefixes_Length(A, B);
  28.             Assert.IsTrue(DP_LCS_AllPrefixes_Cache[A.Length][B.Length] == 3);
  29.             lcs_sequences = this.GetLongestCommonSubsequences(A, B, DP_LCS_AllPrefixes_Cache);
  30.             Assert.IsNotNull(lcs_sequences);
  31.             diffs = this.GetDiffs(A, B, DP_LCS_AllPrefixes_Cache);
  32.             Assert.IsNotNull(diffs);
  33.             Assert.IsTrue(diffs.Length == 2);
  34.             Assert.IsTrue(diffs[0].Length == lcs_sequences.Length);
  35.             Assert.IsTrue(diffs[1].Length == lcs_sequences.Length);
  36.             Assert.IsTrue(lcs_sequences.Any(s => "ADH".Equals(s)));
  37.             DP_LCS_AllPrefixes_Subsequences_Cache = this.LCS_OfAllPrefixes_Subsequences(A, B);
  38.             Assert.IsTrue(DP_LCS_AllPrefixes_Subsequences_Cache[A.Length][B.Length].Length == 3);
  39.             Assert.IsTrue(DP_LCS_AllPrefixes_Subsequences_Cache[A.Length][B.Length].Subsequences
  40.                 .Any(s => "ADH".Equals(s)));
  41.             A = "AGGTAB"; B = "GXTXAYB";
  42.             DP_LCS_AllPrefixes_Cache = this.LCS_OfAllPrefixes_Length(A, B);
  43.             Assert.IsTrue(DP_LCS_AllPrefixes_Cache[A.Length][B.Length] == 4);
  44.             lcs_sequences = this.GetLongestCommonSubsequences(A, B, DP_LCS_AllPrefixes_Cache);
  45.             Assert.IsNotNull(lcs_sequences);
  46.             diffs = this.GetDiffs(A, B, DP_LCS_AllPrefixes_Cache);
  47.             Assert.IsNotNull(diffs);
  48.             Assert.IsTrue(diffs.Length == 2);
  49.             Assert.IsTrue(diffs[0].Length == 2);
  50.             Assert.IsTrue(diffs[1].Length == lcs_sequences.Length);
  51.             Assert.IsTrue(lcs_sequences.Any(s => "GTAB".Equals(s)));
  52.             DP_LCS_AllPrefixes_Subsequences_Cache = this.LCS_OfAllPrefixes_Subsequences(A, B);
  53.             Assert.IsTrue(DP_LCS_AllPrefixes_Subsequences_Cache[A.Length][B.Length].Length == 4);
  54.             Assert.IsTrue(DP_LCS_AllPrefixes_Subsequences_Cache[A.Length][B.Length].Subsequences
  55.                 .Any(s => "GTAB".Equals(s)));
  56.             A = "ABCDEF"; B = "UVWXYZ";
  57.             DP_LCS_AllPrefixes_Cache = this.LCS_OfAllPrefixes_Length(A, B);
  58.             Assert.IsTrue(DP_LCS_AllPrefixes_Cache[A.Length][B.Length] == 0);
  59.             lcs_sequences = this.GetLongestCommonSubsequences(A, B, DP_LCS_AllPrefixes_Cache);
  60.             diffs = this.GetDiffs(A, B, DP_LCS_AllPrefixes_Cache);
  61.             Assert.IsNotNull(diffs);
  62.             Assert.IsTrue(diffs.Length == 2);
  63.             Assert.IsTrue(diffs[0].Length == 1);
  64.             Assert.IsTrue(diffs[1].Length == 1);
  65.         }

No comments:

Post a Comment