Hi, @비공개, @alan-kilborn, @Neil-schipper and All,
OK, @비공개, I’m going to give some pieces of information but, as always :
You have to know how to make cement before you can put two bricks together
You must know how to put two bricks together before building a wall
You must know how to build a wall before building a room
You must know how to build a room before building a house
and so on !
In other words, check this FAQ which gives you the main links to learn regular expressions, from top to bottom ;-))
Now, let’s go :
----------------------------------------------------------------------------------------------------------------------------------------------------------
Regarding MODIFIERS, generally met at BEGINNING of the regex, but which may occur at ANY location within the overall regex :
(?-i) From this point, search care about letter's CASE
(?i) From this point, search does NOT care about letter's CASE
(?-s) From this point, any regex dot symbol represents a SINGLE STANDARD character. So the . is UNICODE equivalent to the NEGATIVE class character
[^\r\n\f\x{0085}\x{2028}\x{2029}] for an Unicode encoded file and equivalent to [^\r\n\f] for an ANSI encoded file
(?s) From this point, any regex DOT symbol represents ABSOLUTELY ANY character, included all the LINE-ENDING chars
(?-x) From this point, any LITERAL SPACE character is SIGNIFICANT and is part of the overall regex ( IMPLICIT in a N++ regex )
(?x) From this point, any LITERAL SPACE character is IGNORED and just helps READABILITY of the overall regex.
This mode is called FREE-SPACING mode and can SPLIT in SEVERAL lines. In this mode :
- Any SPACE char must be written [ ] or \x20 or escaped with a \ character
- Any text, after a # symbol, will be considered as COMMENTS
- Any litteral # symbol must be written [#] or \x23 or escaped as \#
(?-m) From this point :
- The regex symbol ^ represents only the VERY BEGINNING of the file, so equivalent to the regex \A
- The regex symbol $ represents only the VERY END of the file, so equivalent to the regex \z
(?m) From this point, the assertions ^ and $ represent their USUAL signification of START and END of line locations ( IMPLICIT in a N++ regex )
----------------------------------------------------------------------------------------------------------------------------------------------------------
Regarding GROUPS :
(•••••) It defines a CAPTURING group which allows, both :
- The regex engine to STORE the regex ENCLOSED part for FURTHER use, either in the SEARCH and/or the REPLACE part
- The regex ENCLOSED part to be possibly REPEATED with a QUANTIFIER, located right after
(?:•••••) It defines a NON-CAPTURING group which only allows the regex ENCLOSED part to be REPEATED and which is **not** stored by the regex engine
Note that the MODIFIERS, described above, may be INCLUDED within the parentheses :
- In a CAPTURING group as, for instance, ((?i)•••••) so that the INSENSITIVE search is RESTRICTED to the contents of this group, only
- In a NON-CAPTURING group, TWO syntaxes are possible : for instance : (?:(?i)•••••) or the shorthand (?i:•••••)
CAPTURING groups can be RE-USED with the syntax :
- \1 to \9 in the SEARCH and/or REPLACE regexes for reference to group 1 to 9
- $1 to $99 in the REPLACE regex ONLY for reference to group 1 to 99
- ${1} to ${99} in the REPLACE regex ONLY for reference to group 1 to 99
For instance, the ${1}5 syntax means contents of GROUP 1 , followed with digit 5 where as the $15 syntax would have meant contents of GROUP 15
- $0 or ${0} in the REPLACE regex ONLY for reference to the OVERALL math of the SEARCH regex
----------------------------------------------------------------------------------------------------------------------------------------------------------
Regarding QUANTIFIERS, 6 syntaxes are possible {n} , {n,}, {n,m}, ?, + and *. Note that :
- {n} EXACTLY n times the character or group, PRECEDING the quantifier
- {n,} n or MORE times the character or group, PRECEDING the quantifier
- {n,m} BETWEEN n and m times the character or group, PRECEDING the quantifier
- ? is equivalent to {0,1}
- + is equivalent to {1,}
- * is equivalent to {0,}
They are considered as GREEDY quantifiers because they match as MANY characters as possible
If these 6 syntaxes are followed with a QUESTION MARK ?, they are called LAZY quantifiers because they match as FEW characters as possible
For instance, given the following sentence :
The licenses for most software are designed to take away your freedom to share and change it
- Regex (?-s)e.+?ar, with the LAZY quantifier +?, matches ---------------------------
- Regex (?-s)e.+ar , with the GREEDY quantifier +, matches ---------------------------------------------------------------------------
If theses 6 syntaxes are followed with a ADDITION sign +, they are called ATOMIC quantifiers.
- They are quite similar to their GREEDY forms, exceot that, in case of failure, they don't backtrack to attempt further possible match(es)
- Note that this ADVANCED option should be studied when you'll be rather ACQUAINTED with regexes !
----------------------------------------------------------------------------------------------------------------------------------------------------------
BTW, a quick tip to SIMULATE a NORMAL search when the REGULAR EXPRESSION mode is selected : START the search zone with the \Q syntax :
For instance, the regex \Q/* This is a C-comment */ will find the LITERAL string /* This is a C-comment */
Now, I will rewrite my last regex, with your improvement, in the Free-Spacing mode :
----------------------------------------------------------------------------------------------------------------------------------------------------------
(?x-is) # FREE-SPACING mode, search SENSITIVE to CASE and DOT regex symbol represents a SINGLE STANDARD char
(?: # BEGINNING of a NON-CAPTURING group
~~choice # Matches the string ~~choice, with this EXACT case
| # OR ( ALTERNATION symbol )
(?!\A)\G # Matches from RIGHT AFTER the location of the LAST match, IF NOT at the VERY BEGINNING of the file
) # END of the NON-CAPTURING group
.+? # The SMALLEST NON-NULL range of STANDARD characters till...
\K # CURRENT match is DISCARDED and working location is RESET to this POINT
(?<!\() # ONLY if it's NOT PRECEDED with a STARTING parenthesis symbol
"[!'.?\w-]+" # ... a NON-NULL range of WORD chars or the characters !, ', ., ? and -
(?!\)) # ONLY if it's NOT FOLLOWED with an ENDING parenthesis symbol
NOTES :
- This syntax is totally FUNCTIONAL. To be convinced do a NORMAL selection from (?x-is) to ENDING parenthesis symbol and hit the Ctrl + F shortcut
=> This MULTI- lines regex is AUTOMATICALLY inserted in the 'Search what' zone
- The \G assertion means that the NEXT search must start, necessarily, RIGHT AFTER the LAST match !
- I rewrote your regex part [A-Za-z \-\.\!\?'] as [!'.?\w-] because most of the punctuation signs do NOT need to be ESCAPED, within a CLASS character.
- However note that the DASH - must be found at the VERY BEGINNING or the VERY END of the class character, when NON escaped
- I prefer the \w syntax to [A-Za-z] because \w also INClUDES all the ACCENTUATED characters of foreign languages
- You must use ONLY the REPLACE ALL button ( Do NOT click on the REPLACE button for SUCCESSIVE replacements : it won't work due to the \K syntax ! )
- If you don't tick the WRAP AROUND option, move preferably the CARET at the VERY BEGINNING of current file
- From BEGINNING of file, as the regex engine must SKIP some LINE-ENDING characters to get a match, the \G assertion is NOT verified
and the regex engine must necessarily look, FIRST, for a string ~~choice
- Then, from RIGHT AFTER the word choice, it grasps the SMALLEST NON-NULL range of STANDARD chars .+? till a "•••••" structure, but ONLY IF NOT embedded
between PARENTHESES itself !
- And, due to the \K syntax, ONLY the part "•••••" is the FINAL match desired
- This FINAL part is changed with the REPLACE regex \($0\) which just rewrites the string "•••••" between PARENTHESES.
The parenthesis symbols must be ESCAPED as they have a SPECIAL signification in REPLACEMENT
- Then, from RIGHT AFTER the closing " char, as the regex CANNOT find any other ~~choice string, the (?!\A)\G.+? part, again, selects the SMALLEST NON-NULL
range of STANDARD characters till an OTHER block "•••••", execute the REPLACEMENT and so on...
In the example, below, in each second line ( Regex types ) :
The dot . represents any char, found by the regex dummy part .+?
The bullet • represents any char, found by the regex useful part [!'.?\w-]
The character " and the string ~~choice stand for themselves
Text processed ~~choice(["Red", ("Blue"), ("Orange"), … ,"Purple"])
Regex types ~~choice.."•••"..........................."••••••"
Match number BEFORE \K 1111111111 222222222222222222222222222
Match number AFTER \K 11111 22222222
Text processed ~~choice([("Red"), "Blue", "Orange", … ,("Purple")])
Regex types ~~choice..........."••••".."••••••"
Match number BEFORE \K 1111111111111111111 22
Match number AFTER \K 111111 22222222
Text processed ~~choice(["Red", "Blue", "Orange", … ,"Purple"])
Regex types ~~choice.."•••".."••••".."••••••"....."••••••"
Match number BEFORE \K 1111111111 22 33 44444
Match number AFTER \K 11111 222222 33333333 44444444
I hope that you’ll find this article useful, in any way !
However, let me add that the \G and \K assertions, as well as atomic groups and recursive regexes or backtracking verbs ( not discussed ), are difficult notions and I can assure you that there are a LOT of regex things that you need to know before starting to use them !
Best Regards,
guy038