FAQ | Notepad++ Community

Your browser does not seem to support JavaScript. As a result, your viewing experience will be diminished, and you have been placed in read-only mode.

Please download a browser that supports JavaScript, or enable it if it's disabled (i.e. NoScript).

P

FAQ: How to install and run a script in PythonScript

Watching Ignoring Scheduled Pinned Locked Moved faq pythonscript script
1

6 Votes

1 Posts

14k Views

No one has replied
P

FAQ: Generic Regular Expression (regex) Formulas

Watching Ignoring Scheduled Pinned Locked Moved regex faq
1

3 Votes

1 Posts

3k Views

No one has replied
P

FAQ: Huge Gaps / Blank Areas in the UI Borders

Watching Ignoring Scheduled Pinned Locked Moved faq user interface gui
1

3 Votes

1 Posts

2k Views

No one has replied
P

FAQ: Search and Replace Across Files

Watching Ignoring Scheduled Pinned Locked Moved search replace regex multiple files faq
1

3 Votes

1 Posts

3k Views

No one has replied
M

FAQ: Function List Basics

Watching Ignoring Scheduled Pinned Locked Moved faq functionlist
3

4 Votes

3 Posts

4k Views

P

For the interested user, @MAPJe71 has a collection of Notepad++ Function List definitions (and other useful files) for a wide variety of programming languages:
https://github.com/MAPJe71/Languages
G

FAQ: Regex "Backtracking Control Verbs"

Watching Ignoring Scheduled Pinned Locked Moved faq regex
6

2 Votes

6 Posts

4k Views

G

https://community.notepad-plus-plus.org/post/55641

Hello, All,

During my study of the backtracking control verbs ( see posts above ), I noticed this article which described ways to pre-define some subroutines, which can be used to build numerous modular patterns, at the same time :

https://www.rexegg.com/regex-tricks.html#pseudo-define

Note that the special conditional (?(DEFINE).....) structure was created, originally, to this purpose, by modern regex engines, including our Boost regex engine ! I already discussed of this syntax in this post below :

https://community.notepad-plus-plus.org/post/52608

So, for instance, with the free-spacing mode, the regex below :
(?x-i) (?(DEFINE) # START of the conditional DEFINE structure ([A-Z]{2}\d) # TWO CAPITAL letters, followed with a DIGIT in Group 1 ) # END of the conditional DEFINE structure (?1)-(?1)-(?1) # THREE "Group 1" triplets, separated with DASHES
Against the string YH5-RC6-UY0-BD5-AZ3-KL9, would match the two substrings YH5-RC6-UY0 and BD5-AZ3-KL9

Now, in Rexegg site, we are told that, instead of the (?(DEFINE).....) conditional structure, we can use the (*FAIL) backtracking control verb to get a similar behavior, according to the syntax below, using an optional non-capturing group :
(?x-i) (?: # START of an OPTIONAL NON-CAPTURING group ([A-Z]{2}\d) # TWO CAPITAL letters, followed with a DIGIT in Group 1 (*F) # Backtracking Control verb (*FAIL) )? # END of the OPTIONAL NON-CAPTURING group (?1)-(?1)-(?1) # THREE "Group 1" triplets, separated with DASHES
Remark that, instead of the (*F) verb, we may also use the negative look-ahead (?!) syntax for identical results

Now, I’ve found out a more simple syntax :
(?x-i) ([A-Z]{2}\d) # TWO CAPITAL letters, followed with a DIGIT in Group 1 (*F) # Backtracking Control verb (*FAIL) | # OR (?1)-(?1)-(?1) # THREE "Group 1" triplets, separated with DASHES
How this regex syntax works, against the string YH5-RC6-UY0-BD5-AZ3-KL9 ?

First, the regex engine tries to match the part ([A-Z]{2}\d) and, indeed, matches the substring HY5.

At the same time, it stores the pattern [A-Z]{2}\d in group 1

Then the regex engine meets the backtracking control verb (*F) which forces it to backtrack

So the current match HY5 is discarded and the starting position remains right before the first letter Y

Then the regex engine tries the other alternative ?1)-(?1)-(?1) and matches, successively, the two substrings YH5-RC6-UY0 and BD5-AZ3-KL9, as we expect to !

Note the effect of the (*FAIL) verb is exactly the same as if the regex engine would try to match a specific pattern that does not exist in subject string ! So, let simply substitute this verb with, for instance, the CURRENY sign ¤, of Unicode code point ``\x{00a4}`. which is generally not used in files !

You can use the Windows input method, hitting the Alt key and typing number 0164 on the numeric keypad

So we get the regex :
(?x-i) ([A-Z]{2}\d) # TWO CAPITAL letters, followed with a DIGIT in Group 1 ¤ # The CURRENCY sign character ( In fact, any NON-EXISTENT character, in CURRENT file ! ) | # OR (?1)-(?1)-(?1) # THREE "Group 1" triplets, separated with DASHES
Why my syntax also works :

First, the regex engine tries to match the part ([A-Z]{2}\d) and, indeed, matches the substring HY5.

At the same time, it stores the pattern [A-Z]{2}\d in group 1

But it cannot match the following currency sign ¤ and the regex engine naturally backtracks

The starting position remains right before the first letter Y

And the regex engine tries the other alternative ?1)-(?1)-(?1), which matches, successively, the two substrings YH5-RC6-UY0 and BD5-AZ3-KL9

Note that the regex engine does remember of the pattern contained in group 1 because, despite of the backtracking phase, the search process still goes on !

Of course, instead of the single character ¤, you can, either, choose a simple string, which does not exist in current file. For instance, /// or @@ or _X_

It’s also important to point out that you can perfectly use named group(s), for all these syntaxes. For instance, my last regex syntax can be rewritten :
(?x-i) (?<Seq>[A-Z]{2}\d) # TWO CAPITAL letters, followed with a DIGIT in NAMED group 'Seq' ¤ # The CURRENCY sign character ( In fact, any NON-EXISTENT character, in CURRENT file ! ) | # OR (?&Seq)-(?&Seq)-(?&Seq) # THREE 'Seq' triplets, separated with DASHES
Of course, all the syntaxes, above, which help us to create a library of pre-defined groups are mostly valuable, when you try to search between numerous patterns, built up from these elementary patterns !

To that purpose, let’s expose a more interesting and practical example, regarding the search of specific sequences within the genetic code !

If necessary, refer for a very basic background, about genetic code, at the end of this post !

From now on, the genetic notions invoked are only the fruit of my imagination and are, in no way, based on scientific data. !! It’s just used to demonstrate the interest of the above regex syntaxes

Let’s suppose that the 3 genetic sequences CGUUUA, GCCACUAAACAG and AAUCGACAU, named Seq_1, Seq_2 and Seq_3, are of main importance in order to build up greater genetic chains from a combination of these components

Now, let’s assume that we want, at the same time, search for the seven combinations :
Seq_2 + Seq_3 Seq_1 + Seq_2 Seq_1 + Seq_3 Seq_2 + AAA codon + Seq_3 Seq_2 + CCC codon + Seq_1 4 consecutive Seq_1 + Seq_2 4 consecutive Seq_1 + Seq_3
Then, the regex to search any of these sequences, delimited with a start and a stop codon, could be, in free-spacing mode :
(?x-i) # FREE-SPACING mode and NON-INSENSITIVE search (?<Seq_1>CGUUUA) # Seq_1 definition ( 6 bases ) (?<Seq_2>GCCACUAAACAG) # Seq_2 definition ( 12 bases ) (?<Seq_3>AAUCGACAU) # Seq_3 definition ( 9 bases ) X | # An INEXISTANT character in RNA sequence OR (AUG|GUG|UUG) # Possible START codons (?<Codon>[ACGU]{3})*? # Any number of CODONS, even ZERO, in the NAMED group Codon \K # ANYTHING matched, so far, is DISCARDED (?: # Start of a NON-CAPTURING group (?&Seq_2)(?&Seq_3) | # Chain 1 : Seq_2 + Seq_3 OR (?&Seq_1)(?&Seq_2) | # Chain 2 : Seq_1 + Seq_2 OR (?&Seq_1)(?&Seq_3) | # Chain 3 : Seq_1 + Seq_3 OR (?&Seq_2)AAA(?&Seq_3) | # Chain 4 : Seq_2 + AAA + Seq_3 OR (?&Seq_3)CCC(?&Seq_1) | # Chain 5 : Seq_3 + CCC + Seq_1 OR (?&Seq_1){4}(?&Seq_2) | # Chain 6 : FOUR consecutive Seq_1 + Seq_2 OR (?&Seq_1){4}(?&Seq_3) # Chain 7 : FOUR consecutive Seq_1 + Seq_3 ) # END of the NON-CAPTURING group (?= # START of a LOOK-AHEAD (?&Codon)* # Any number of CODONS, even ZERO (UAA|UGA|UAG) # Possible STOP codons ) # END of the LOOK-AHEAD You can test this regex against these lines, below, which **all** begin with a **start** codon and end with a **stop** codon : ~~~diff AUGUGCAACGAUCGUUUAAAUCGACAUGCCACUAAACAGUUACAUCAUACUGCCAACCAGGGCCAUGUUUAA # Chain 3 GUGAACCAGGGCCAUGUUCGUUUAAAUCGACAUAAUCGACAUAACCCCUUACUUGCUAAUUUCUGA # Chain 3 UUGCGUUUAAACCUAAAAGAUGGGGUCGCCACUAAACAGAAUCGACAUAAUCGACAUGGACCGUAG # Chain 1 UUGACUUUACAUCAUACUGCCGCCACUAAACAGAAAAAUCGACAUCCCCGUUUAAAACCUUGGAGCCCGUAG # Chains 4 then 5 AUGCGUUUACGUUUACGUUUACGUUUAGCCACUAAACAGUCAACUGGAGGAUCCCGGCAUUUUUAA # Chains 6 then 2 GUGUCGAACUGGGGAGCGCCACCUCAUAGUUGUCGUUUACGUUUACGUUUACGUUUAAAUCGACAUUGA # Chains 7 then 3
However, if we change the greedy quantifier, of the Codon group, with a lazy quantifier, we get other matches. This is expected because some areas overlap with other areas or include some others !

So, in order to correctly detect all these chains of nucleotides, we could look, only for the zero-length location, of the start of each sequence, with the appropriate regex, below :
(?x-i) # FREE-SPACING mode and NON-INSENSITIVE search (?<Seq_1>CGUUUA) # Seq_1 definition ( 6 bases ) (?<Seq_2>GCCACUAAACAG) # Seq_2 definition ( 12 bases ) (?<Seq_3>AAUCGACAU) # Seq_3 definition ( 9 bases ) X| # An INEXISTANT character in RNA sequence OR (?= # START of a LOOK-AHEAD (?&Seq_2)(?&Seq_3) | # Case 1 : Seq_2 + Seq_3 OR (?&Seq_1)(?&Seq_2) | # Case 2 : Seq_1 + Seq_2 OR (?&Seq_1)(?&Seq_3) | # Case 3 : Seq_1 + Seq_3 OR (?&Seq_2)AAA(?&Seq_3) | # Case 4 : Seq_2 + AAA + Seq_3 OR (?&Seq_3)CCC(?&Seq_1) | # Case 5 : Seq_3 + CCC + Seq_1 OR (?&Seq_1){4}(?&Seq_2) | # Case 6 : FOUR consecutive Seq_1 + Seq_2 OR (?&Seq_1){4}(?&Seq_3) # Case 7 : FOUR consecutive Seq_1 + Seq_3 ) # END of the LOOK-AHEAD
See below, with the indication of the start of each sequence :
v AUGUGCAACGAUCGUUUAAAUCGACAUGCCACUAAACAGUUACAUCAUACUGCCAACCAGGGCCAUGUUUAA # Chain 3 v GUGAACCAGGGCCAUGUUCGUUUAAAUCGACAUAAUCGACAUAACCCCUUACUUGCUAAUUUCUGA # Chain 3 v UUGCGUUUAAACCUAAAAGAUGGGGUCGCCACUAAACAGAAUCGACAUAAUCGACAUGGACCGUAG # Chain 1 v v UUGACUUUACAUCAUACUGCCGCCACUAAACAGAAAAAUCGACAUCCCCGUUUAAAACCUUGGAGCCCGUAG # Chains 4 then 5 v v AUGCGUUUACGUUUACGUUUACGUUUAGCCACUAAACAGUCAACUGGAGGAUCCCGGCAUUUUUAA # Chains 6 then 2 v v GUGUCGAACUGGGGAGCGCCACCUCAUAGUUGUCGUUUACGUUUACGUUUACGUUUAAAUCGACAUUGA # Chains 7 then 3
Now, if we want to completely match each composite sequence, we’ll use this last regex :
(?x-i) # FREE-SPACING mode and NON-INSENSITIVE search (?<Seq_1>CGUUUA) # Seq_1 definition ( 6 bases ) (?<Seq_2>GCCACUAAACAG) # Seq_2 definition ( 12 bases ) (?<Seq_3>AAUCGACAU) # Seq_3 definition ( 9 bases ) X| # An INEXISTANT character in RNA sequence OR (?: # START of a NON-CAPTURING group (?&Seq_2)(?&Seq_3) | # Case 1 : Seq_2 + Seq_3 OR (?&Seq_1)(?&Seq_2) | # Case 2 : Seq_1 + Seq_2 OR (?&Seq_1)(?&Seq_3) | # Case 3 : Seq_1 + Seq_3 OR (?&Seq_2)AAA(?&Seq_3) | # Case 4 : Seq_2 + AAA + Seq_3 OR (?&Seq_3)CCC(?&Seq_1) | # Case 5 : Seq_3 + CCC + Seq_1 OR (?&Seq_1){4}(?&Seq_2) | # Case 6 : FOUR consecutive Seq_1 + Seq_2 OR (?&Seq_1){4}(?&Seq_3) # Case 7 : FOUR consecutive Seq_1 + Seq_3 ) # END of a NON-CAPTURING group
See, below, the indication of each sequence with the v letter or the ^ symbol :
vvvvvvvvvvvvvvv AUGUGCAACGAUCGUUUAAAUCGACAUGCCACUAAACAGUUACAUCAUACUGCCAACCAGGGCCAUGUUUAA # Chain 3 vvvvvvvvvvvvvvv GUGAACCAGGGCCAUGUUCGUUUAAAUCGACAUAAUCGACAUAACCCCUUACUUGCUAAUUUCUGA # Chain 3 vvvvvvvvvvvvvvvvvvvvv UUGCGUUUAAACCUAAAAGAUGGGGUCGCCACUAAACAGAAUCGACAUAAUCGACAUGGACCGUAG # Chain 1 vvvvvvvvvvvvvvvvvvvvvvvv # Chain 4 UUGACUUUACAUCAUACUGCCGCCACUAAACAGAAAAAUCGACAUCCCCGUUUAAAACCUUGGAGCCCGUAG ^^^^^^^^^^^^^^^^^^ # Chain 5 vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv # Chain 6 AUGCGUUUACGUUUACGUUUACGUUUAGCCACUAAACAGUCAACUGGAGGAUCCCGGCAUUUUUAA ^^^^^^^^^^^^^^^^^^ # Chain 2 vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv # Chain 7 GUGUCGAACUGGGGAGCGCCACCUCAUAGUUGUCGUUUACGUUUACGUUUACGUUUAAAUCGACAUUGA ^^^^^^^^^^^^^^^ # Chain 3
Note that, in order to get, successively, all the occurrences, even in case of overlapping, hit the F3 shortcut to get a match. Then, hit the Left arrow, followed by the Right arrow, to advance of one position in text !

At last, I would say that it’s obvious that I’m not competing, with my simple regexes, against all the powerful ORF finding tools used by biologists ;-)) Refer to :

https://en.wikipedia.org/wiki/Open_reading_frame#ORF_finding_tools

Best Regards,

guy038

Here is a very basic summary, about genetic code, for the sole purpose of that present discussion :

DNA is a double ordered helix chain, made of four nucleotides, associated by pairs, adenine ( A ) with thymine ( T ) and guanine ( G ) with cytosine ( C ), which carries all the genetic instructions for development, functioning, growth and reproduction of all known living organisms

A gene is a sequence of nucleotides that encodes, either, the synthesis of the RNA molecule from the DNA molecule or the synthesis of a protein from a RNA molecule and may contain more than 1,000 pairs of nucleotides. Humans have nearly 20,000 genes, whose about 2,000 are thought essential to our survival !

Each triplet of nucleotides, named codon, part of a double-stranded DNA or part of a single-stranded RNA molecule, corresponds to an amino-acid, during the transcription process to RNA or during the translation process to a protein

A single translated region of the genetic code, in the DNA molecule, is called an ORF ( Open reading Frame ), containing, generally, a minimum of 100 to 150 codons

After the transcription phase, from the DNA molecule into Pre-mRNA and the suppression of introns, we get the mature RNA molecule, made of four nucleotides, associated by pairs, too : adenine ( A ) with uracil ( U ) and guanine ( G ) with cytosine ( C )

In RNA, the ORF zone has become the Protein Coding Region, composed of :
A start codon [ AUG , GUG , UUG ] A continuous stretch of codons A stop codon [ UAA , UGA , UAG ]
If we consider, for instance, the genome of the Escherichia Coli bacteria K-12, the proportions of each start and stop codons are :

START STOP Codon Codon AUG 83% UAA 63% GUG 14% UGA 29% UUG 3% UAG 8%

Refer also to :

https://en.wikipedia.org/wiki/Introduction_to_genetics

https://en.wikipedia.org/wiki/DNA

https://en.wikipedia.org/wiki/RNA

https://en.wikipedia.org/wiki/Gene

https://en.wikipedia.org/wiki/Open_reading_frame

https://en.wikipedia.org/wiki/Start_codon

https://en.wikipedia.org/wiki/Stop_codon
P

FAQ: Automatic File Extensions

Watching Ignoring Scheduled Pinned Locked Moved extensions txt faq
1

5 Votes

1 Posts

3k Views

No one has replied
D

FAQ: Make Notepad++ x64 "Open with..." menu work in Win7

Watching Ignoring Scheduled Pinned Locked Moved faq
1

2 Votes

1 Posts

1k Views

No one has replied
G

FAQ: Where to find REGular EXpressions (RegEx) documentation ?

Watching Ignoring Scheduled Pinned Locked Moved faq regex
2

19 Votes

2 Posts

56k Views

M

Additional Regex Tester tool (offline): RegEx Tester
S

FAQ: Can't set Notepad++ as a default program for any extension (Windows 10)

Watching Ignoring Scheduled Pinned Locked Moved faq
1

3 Votes

1 Posts

10k Views

No one has replied
P

FAQ: Why Does My .docx File Look Like Junk In Notepad++

Watching Ignoring Scheduled Pinned Locked Moved binary docx xls doc xlsx faq
1

7 Votes

1 Posts

21k Views

No one has replied
S

FAQ: You've asked your question in the wrong place!

Watching Ignoring Scheduled Pinned Locked Moved faq
1

6 Votes

1 Posts

12k Views

No one has replied
P

FAQ: Crash caused my file to be all NULLs

Watching Ignoring Scheduled Pinned Locked Moved faq null crash
1

5 Votes

1 Posts

8k Views

No one has replied
P

FAQ: Autosaving "unnamed" `new 1` files

Watching Ignoring Scheduled Pinned Locked Moved autosave faq unnamed new 1
1

4 Votes

1 Posts

10k Views

No one has replied
P

FAQ: What is %AppData%

Watching Ignoring Scheduled Pinned Locked Moved appdata faq
1

8 Votes

1 Posts

25k Views

No one has replied
P

FAQ: Request for Help without sufficient information to help you

Watching Ignoring Scheduled Pinned Locked Moved needmoredetails getting help faq
1

6 Votes

1 Posts

14k Views

No one has replied
P

FAQ Desk: Let's start making new FAQ Topics for the frequent questions/problems, and point people to them

Watching Ignoring Scheduled Pinned Locked Moved faq helping others answers
1

7 Votes

1 Posts

8k Views

No one has replied

2 / 2