In next several blogs, I will write things about my Ph.D. work. The first topic today is highly related to our work Vulcan, which utlizes 2 different kinds of gap penalties for better long-read alignments.
Different regions in human genome may have different mutation rates, here are some examples:

We all know that long-read sequencers suffers from sequening errors, and those sequencing errors may have different effects on alignment accuracy when we are trying to align reads to these differnet regions. In this case, using the same gap penalty maybe not sufficient for all those different regions. For example, in this case, 2 different gap penalties made different choices in EEIG2 gene. image Gap penalties mainly alter the pairwise alignment behaviors in the extension stage during long-read alignment process. The state-of-the-art long-read aligners use either affine gap penalty (minimap2) or convex gap penalty (NGMLR).
What is the difference between these two penalties?

  • Affine gap penalty calculates alignment scores as \(G(i) = g_o + g_E * i\)
  • Convex gap penalty calculates alignment scores as \(G(i)=\left\{\begin{matrix}g_o,i=0 \\ G(i-1)+min\left\{\begin{matrix}g_M \\ g_E+g_D*(i-1) \end{matrix},i>0\right. \end{matrix}\right.\)

In here, \(g_o\) : gap opening penalty; \(g_E\) : gap extension penalty; \(g_M\) : gap matching score; \(i\) : current length of the gap; \(g_D\) : gap decay parameter between 0 and 1; \(m\) and \(n\) are the length of two sequences. We can clearly see that the time complexity of affine gap penalty is \(O(mn)\) while the one for convex gap penalty is \(O(mnlog(m+n))\) since it need to remember the scores from previous gaps.
Due to the gap decay parameter, convex gap penalty funcitons always favor larger gaps than smaller ones. In NGMLR paper they claims this kind gap penalty helps long-read aligner to become more robost against sequencing errors. However, is this entirely true? We think when we are aligning reads to the regions where the sequencing errors are dominant “variants” against the reference, affine gap penalty can actually work better, and in these regions the edit distances between reads and the reference are usually low since there are fewer variations. Based on that, we can filter the read alignment that have high edit distance from minimap2 and put those into NGMLR for re-alignment.
Based on this motivation, we developed Vulcan: image