• FAQ: Huge Gaps / Blank Areas in the UI Borders

    1
    3 Votes
    1 Posts
    2k Views
    No one has replied
  • FAQ: Search and Replace Across Files

    1
    3 Votes
    1 Posts
    3k Views
    No one has replied
  • 3 Votes
    1 Posts
    12k Views
    No one has replied
  • FAQ: Function List Basics

    3
    4 Votes
    3 Posts
    3k Views
    PeterJonesP

    For the interested user, @MAPJe71 has a collection of Notepad++ Function List definitions (and other useful files) for a wide variety of programming languages:

    https://github.com/MAPJe71/Languages
  • FAQ: Regex "Backtracking Control Verbs"

    Moved
    6
    2 Votes
    6 Posts
    3k Views
    guy038G

    https://community.notepad-plus-plus.org/post/55641

    Hello, All,

    During my study of the backtracking control verbs ( see posts above ), I noticed this article which described ways to pre-define some subroutines, which can be used to build numerous modular patterns, at the same time :

    https://www.rexegg.com/regex-tricks.html#pseudo-define

    Note that the special conditional (?(DEFINE).....) structure was created, originally, to this purpose, by modern regex engines, including our Boost regex engine ! I already discussed of this syntax in this post below :

    https://community.notepad-plus-plus.org/post/52608

    So, for instance, with the free-spacing mode, the regex below :

    (?x-i) (?(DEFINE) # START of the conditional DEFINE structure ([A-Z]{2}\d) # TWO CAPITAL letters, followed with a DIGIT in Group 1 ) # END of the conditional DEFINE structure (?1)-(?1)-(?1) # THREE "Group 1" triplets, separated with DASHES

    Against the string YH5-RC6-UY0-BD5-AZ3-KL9, would match the two substrings YH5-RC6-UY0 and BD5-AZ3-KL9

    Now, in Rexegg site, we are told that, instead of the (?(DEFINE).....) conditional structure, we can use the (*FAIL) backtracking control verb to get a similar behavior, according to the syntax below, using an optional non-capturing group :

    (?x-i) (?: # START of an OPTIONAL NON-CAPTURING group ([A-Z]{2}\d) # TWO CAPITAL letters, followed with a DIGIT in Group 1 (*F) # Backtracking Control verb (*FAIL) )? # END of the OPTIONAL NON-CAPTURING group (?1)-(?1)-(?1) # THREE "Group 1" triplets, separated with DASHES

    Remark that, instead of the (*F) verb, we may also use the negative look-ahead (?!) syntax for identical results

    Now, I’ve found out a more simple syntax :

    (?x-i) ([A-Z]{2}\d) # TWO CAPITAL letters, followed with a DIGIT in Group 1 (*F) # Backtracking Control verb (*FAIL) | # OR (?1)-(?1)-(?1) # THREE "Group 1" triplets, separated with DASHES

    How this regex syntax works, against the string YH5-RC6-UY0-BD5-AZ3-KL9 ?

    First, the regex engine tries to match the part ([A-Z]{2}\d) and, indeed, matches the substring HY5.

    At the same time, it stores the pattern [A-Z]{2}\d in group 1

    Then the regex engine meets the backtracking control verb (*F) which forces it to backtrack

    So the current match HY5 is discarded and the starting position remains right before the first letter Y

    Then the regex engine tries the other alternative ?1)-(?1)-(?1) and matches, successively, the two substrings YH5-RC6-UY0 and BD5-AZ3-KL9, as we expect to !

    Note the effect of the (*FAIL) verb is exactly the same as if the regex engine would try to match a specific pattern that does not exist in subject string ! So, let simply substitute this verb with, for instance, the CURRENY sign ¤, of Unicode code point ``\x{00a4}`. which is generally not used in files !

    You can use the Windows input method, hitting the Alt key and typing number 0164 on the numeric keypad

    So we get the regex :

    (?x-i) ([A-Z]{2}\d) # TWO CAPITAL letters, followed with a DIGIT in Group 1 ¤ # The CURRENCY sign character ( In fact, any NON-EXISTENT character, in CURRENT file ! ) | # OR (?1)-(?1)-(?1) # THREE "Group 1" triplets, separated with DASHES

    Why my syntax also works :

    First, the regex engine tries to match the part ([A-Z]{2}\d) and, indeed, matches the substring HY5.

    At the same time, it stores the pattern [A-Z]{2}\d in group 1

    But it cannot match the following currency sign ¤ and the regex engine naturally backtracks

    The starting position remains right before the first letter Y

    And the regex engine tries the other alternative ?1)-(?1)-(?1), which matches, successively, the two substrings YH5-RC6-UY0 and BD5-AZ3-KL9

    Note that the regex engine does remember of the pattern contained in group 1 because, despite of the backtracking phase, the search process still goes on !

    Of course, instead of the single character ¤, you can, either, choose a simple string, which does not exist in current file. For instance, /// or @@ or _X_

    It’s also important to point out that you can perfectly use named group(s), for all these syntaxes. For instance, my last regex syntax can be rewritten :

    (?x-i) (?<Seq>[A-Z]{2}\d) # TWO CAPITAL letters, followed with a DIGIT in NAMED group 'Seq' ¤ # The CURRENCY sign character ( In fact, any NON-EXISTENT character, in CURRENT file ! ) | # OR (?&Seq)-(?&Seq)-(?&Seq) # THREE 'Seq' triplets, separated with DASHES

    Of course, all the syntaxes, above, which help us to create a library of pre-defined groups are mostly valuable, when you try to search between numerous patterns, built up from these elementary patterns !

    To that purpose, let’s expose a more interesting and practical example, regarding the search of specific sequences within the genetic code !

    If necessary, refer for a very basic background, about genetic code, at the end of this post !

    From now on, the genetic notions invoked are only the fruit of my imagination and are, in no way, based on scientific data. !! It’s just used to demonstrate the interest of the above regex syntaxes

    Let’s suppose that the 3 genetic sequences CGUUUA, GCCACUAAACAG and AAUCGACAU, named Seq_1, Seq_2 and Seq_3, are of main importance in order to build up greater genetic chains from a combination of these components

    Now, let’s assume that we want, at the same time, search for the seven combinations :

    Seq_2 + Seq_3 Seq_1 + Seq_2 Seq_1 + Seq_3 Seq_2 + AAA codon + Seq_3 Seq_2 + CCC codon + Seq_1 4 consecutive Seq_1 + Seq_2 4 consecutive Seq_1 + Seq_3

    Then, the regex to search any of these sequences, delimited with a start and a stop codon, could be, in free-spacing mode :

    (?x-i) # FREE-SPACING mode and NON-INSENSITIVE search (?<Seq_1>CGUUUA) # Seq_1 definition ( 6 bases ) (?<Seq_2>GCCACUAAACAG) # Seq_2 definition ( 12 bases ) (?<Seq_3>AAUCGACAU) # Seq_3 definition ( 9 bases ) X | # An INEXISTANT character in RNA sequence OR (AUG|GUG|UUG) # Possible START codons (?<Codon>[ACGU]{3})*? # Any number of CODONS, even ZERO, in the NAMED group Codon \K # ANYTHING matched, so far, is DISCARDED (?: # Start of a NON-CAPTURING group (?&Seq_2)(?&Seq_3) | # Chain 1 : Seq_2 + Seq_3 OR (?&Seq_1)(?&Seq_2) | # Chain 2 : Seq_1 + Seq_2 OR (?&Seq_1)(?&Seq_3) | # Chain 3 : Seq_1 + Seq_3 OR (?&Seq_2)AAA(?&Seq_3) | # Chain 4 : Seq_2 + AAA + Seq_3 OR (?&Seq_3)CCC(?&Seq_1) | # Chain 5 : Seq_3 + CCC + Seq_1 OR (?&Seq_1){4}(?&Seq_2) | # Chain 6 : FOUR consecutive Seq_1 + Seq_2 OR (?&Seq_1){4}(?&Seq_3) # Chain 7 : FOUR consecutive Seq_1 + Seq_3 ) # END of the NON-CAPTURING group (?= # START of a LOOK-AHEAD (?&Codon)* # Any number of CODONS, even ZERO (UAA|UGA|UAG) # Possible STOP codons ) # END of the LOOK-AHEAD You can test this regex against these lines, below, which **all** begin with a **start** codon and end with a **stop** codon : ~~~diff AUGUGCAACGAUCGUUUAAAUCGACAUGCCACUAAACAGUUACAUCAUACUGCCAACCAGGGCCAUGUUUAA # Chain 3 GUGAACCAGGGCCAUGUUCGUUUAAAUCGACAUAAUCGACAUAACCCCUUACUUGCUAAUUUCUGA # Chain 3 UUGCGUUUAAACCUAAAAGAUGGGGUCGCCACUAAACAGAAUCGACAUAAUCGACAUGGACCGUAG # Chain 1 UUGACUUUACAUCAUACUGCCGCCACUAAACAGAAAAAUCGACAUCCCCGUUUAAAACCUUGGAGCCCGUAG # Chains 4 then 5 AUGCGUUUACGUUUACGUUUACGUUUAGCCACUAAACAGUCAACUGGAGGAUCCCGGCAUUUUUAA # Chains 6 then 2 GUGUCGAACUGGGGAGCGCCACCUCAUAGUUGUCGUUUACGUUUACGUUUACGUUUAAAUCGACAUUGA # Chains 7 then 3

    However, if we change the greedy quantifier, of the Codon group, with a lazy quantifier, we get other matches. This is expected because some areas overlap with other areas or include some others !

    So, in order to correctly detect all these chains of nucleotides, we could look, only for the zero-length location, of the start of each sequence, with the appropriate regex, below :

    (?x-i) # FREE-SPACING mode and NON-INSENSITIVE search (?<Seq_1>CGUUUA) # Seq_1 definition ( 6 bases ) (?<Seq_2>GCCACUAAACAG) # Seq_2 definition ( 12 bases ) (?<Seq_3>AAUCGACAU) # Seq_3 definition ( 9 bases ) X| # An INEXISTANT character in RNA sequence OR (?= # START of a LOOK-AHEAD (?&Seq_2)(?&Seq_3) | # Case 1 : Seq_2 + Seq_3 OR (?&Seq_1)(?&Seq_2) | # Case 2 : Seq_1 + Seq_2 OR (?&Seq_1)(?&Seq_3) | # Case 3 : Seq_1 + Seq_3 OR (?&Seq_2)AAA(?&Seq_3) | # Case 4 : Seq_2 + AAA + Seq_3 OR (?&Seq_3)CCC(?&Seq_1) | # Case 5 : Seq_3 + CCC + Seq_1 OR (?&Seq_1){4}(?&Seq_2) | # Case 6 : FOUR consecutive Seq_1 + Seq_2 OR (?&Seq_1){4}(?&Seq_3) # Case 7 : FOUR consecutive Seq_1 + Seq_3 ) # END of the LOOK-AHEAD

    See below, with the indication of the start of each sequence :

    v AUGUGCAACGAUCGUUUAAAUCGACAUGCCACUAAACAGUUACAUCAUACUGCCAACCAGGGCCAUGUUUAA # Chain 3 v GUGAACCAGGGCCAUGUUCGUUUAAAUCGACAUAAUCGACAUAACCCCUUACUUGCUAAUUUCUGA # Chain 3 v UUGCGUUUAAACCUAAAAGAUGGGGUCGCCACUAAACAGAAUCGACAUAAUCGACAUGGACCGUAG # Chain 1 v v UUGACUUUACAUCAUACUGCCGCCACUAAACAGAAAAAUCGACAUCCCCGUUUAAAACCUUGGAGCCCGUAG # Chains 4 then 5 v v AUGCGUUUACGUUUACGUUUACGUUUAGCCACUAAACAGUCAACUGGAGGAUCCCGGCAUUUUUAA # Chains 6 then 2 v v GUGUCGAACUGGGGAGCGCCACCUCAUAGUUGUCGUUUACGUUUACGUUUACGUUUAAAUCGACAUUGA # Chains 7 then 3

    Now, if we want to completely match each composite sequence, we’ll use this last regex :

    (?x-i) # FREE-SPACING mode and NON-INSENSITIVE search (?<Seq_1>CGUUUA) # Seq_1 definition ( 6 bases ) (?<Seq_2>GCCACUAAACAG) # Seq_2 definition ( 12 bases ) (?<Seq_3>AAUCGACAU) # Seq_3 definition ( 9 bases ) X| # An INEXISTANT character in RNA sequence OR (?: # START of a NON-CAPTURING group (?&Seq_2)(?&Seq_3) | # Case 1 : Seq_2 + Seq_3 OR (?&Seq_1)(?&Seq_2) | # Case 2 : Seq_1 + Seq_2 OR (?&Seq_1)(?&Seq_3) | # Case 3 : Seq_1 + Seq_3 OR (?&Seq_2)AAA(?&Seq_3) | # Case 4 : Seq_2 + AAA + Seq_3 OR (?&Seq_3)CCC(?&Seq_1) | # Case 5 : Seq_3 + CCC + Seq_1 OR (?&Seq_1){4}(?&Seq_2) | # Case 6 : FOUR consecutive Seq_1 + Seq_2 OR (?&Seq_1){4}(?&Seq_3) # Case 7 : FOUR consecutive Seq_1 + Seq_3 ) # END of a NON-CAPTURING group

    See, below, the indication of each sequence with the v letter or the ^ symbol :

    vvvvvvvvvvvvvvv AUGUGCAACGAUCGUUUAAAUCGACAUGCCACUAAACAGUUACAUCAUACUGCCAACCAGGGCCAUGUUUAA # Chain 3 vvvvvvvvvvvvvvv GUGAACCAGGGCCAUGUUCGUUUAAAUCGACAUAAUCGACAUAACCCCUUACUUGCUAAUUUCUGA # Chain 3 vvvvvvvvvvvvvvvvvvvvv UUGCGUUUAAACCUAAAAGAUGGGGUCGCCACUAAACAGAAUCGACAUAAUCGACAUGGACCGUAG # Chain 1 vvvvvvvvvvvvvvvvvvvvvvvv # Chain 4 UUGACUUUACAUCAUACUGCCGCCACUAAACAGAAAAAUCGACAUCCCCGUUUAAAACCUUGGAGCCCGUAG ^^^^^^^^^^^^^^^^^^ # Chain 5 vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv # Chain 6 AUGCGUUUACGUUUACGUUUACGUUUAGCCACUAAACAGUCAACUGGAGGAUCCCGGCAUUUUUAA ^^^^^^^^^^^^^^^^^^ # Chain 2 vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv # Chain 7 GUGUCGAACUGGGGAGCGCCACCUCAUAGUUGUCGUUUACGUUUACGUUUACGUUUAAAUCGACAUUGA ^^^^^^^^^^^^^^^ # Chain 3

    Note that, in order to get, successively, all the occurrences, even in case of overlapping, hit the F3 shortcut to get a match. Then, hit the Left arrow, followed by the Right arrow, to advance of one position in text !

    At last, I would say that it’s obvious that I’m not competing, with my simple regexes, against all the powerful ORF finding tools used by biologists ;-)) Refer to :

    https://en.wikipedia.org/wiki/Open_reading_frame#ORF_finding_tools

    Best Regards,

    guy038

    Here is a very basic summary, about genetic code, for the sole purpose of that present discussion :

    DNA is a double ordered helix chain, made of four nucleotides, associated by pairs, adenine ( A ) with thymine ( T ) and guanine ( G ) with cytosine ( C ), which carries all the genetic instructions for development, functioning, growth and reproduction of all known living organisms

    A gene is a sequence of nucleotides that encodes, either, the synthesis of the RNA molecule from the DNA molecule or the synthesis of a protein from a RNA molecule and may contain more than 1,000 pairs of nucleotides. Humans have nearly 20,000 genes, whose about 2,000 are thought essential to our survival !

    Each triplet of nucleotides, named codon, part of a double-stranded DNA or part of a single-stranded RNA molecule, corresponds to an amino-acid, during the transcription process to RNA or during the translation process to a protein

    A single translated region of the genetic code, in the DNA molecule, is called an ORF ( Open reading Frame ), containing, generally, a minimum of 100 to 150 codons

    After the transcription phase, from the DNA molecule into Pre-mRNA and the suppression of introns, we get the mature RNA molecule, made of four nucleotides, associated by pairs, too : adenine ( A ) with uracil ( U ) and guanine ( G ) with cytosine ( C )

    In RNA, the ORF zone has become the Protein Coding Region, composed of :

    A start codon [ AUG , GUG , UUG ] A continuous stretch of codons A stop codon [ UAA , UGA , UAG ]

    If we consider, for instance, the genome of the Escherichia Coli bacteria K-12, the proportions of each start and stop codons are :

    START STOP Codon Codon AUG    83% UAA    63% GUG    14% UGA    29% UUG     3% UAG     8%

    Refer also to :

    https://en.wikipedia.org/wiki/Introduction_to_genetics

    https://en.wikipedia.org/wiki/DNA

    https://en.wikipedia.org/wiki/RNA

    https://en.wikipedia.org/wiki/Gene

    https://en.wikipedia.org/wiki/Open_reading_frame

    https://en.wikipedia.org/wiki/Start_codon

    https://en.wikipedia.org/wiki/Stop_codon

  • FAQ: Automatic File Extensions

    1
    5 Votes
    1 Posts
    3k Views
    No one has replied
  • FAQ: Make Notepad++ x64 "Open with..." menu work in Win7

    1
    2 Votes
    1 Posts
    857 Views
    No one has replied
  • FAQ: Where to find REGular EXpressions (RegEx) documentation ?

    2
    19 Votes
    2 Posts
    54k Views
    MAPJe71M

    Additional Regex Tester tool (offline): RegEx Tester

  • 3 Votes
    1 Posts
    10k Views
    No one has replied
  • FAQ: Why Does My .docx File Look Like Junk In Notepad++

    Locked
    1
    7 Votes
    1 Posts
    20k Views
    No one has replied
  • FAQ: You've asked your question in the wrong place!

    Locked
    1
    6 Votes
    1 Posts
    11k Views
    No one has replied
  • FAQ: Crash caused my file to be all NULLs

    Locked
    1
    5 Votes
    1 Posts
    8k Views
    No one has replied
  • FAQ: Autosaving "unnamed" `new 1` files

    Locked
    1
    4 Votes
    1 Posts
    9k Views
    No one has replied
  • FAQ: What is %AppData%

    Locked
    1
    8 Votes
    1 Posts
    24k Views
    No one has replied
  • 6 Votes
    1 Posts
    13k Views
    No one has replied
  • 7 Votes
    1 Posts
    8k Views
    No one has replied