Hi, @비공개, @alan-kilborn, @Neil-schipper and All,
OK, @비공개, I’m going to give some pieces of information but, as always :
You have to know how to make cement before you can put two bricks together
You must know how to put two bricks together before building a wall
You must know how to build a wall before building a room
You must know how to build a room before building a house
and so on !
In other words, check this FAQ which gives you the main links to learn regular expressions, from top to bottom ;-))
Now, let’s go :
---------------------------------------------------------------------------------------------------------------------------------------------------------- Regarding MODIFIERS, generally met at BEGINNING of the regex, but which may occur at ANY location within the overall regex : (?-i) From this point, search care about letter's CASE (?i) From this point, search does NOT care about letter's CASE (?-s) From this point, any regex dot symbol represents a SINGLE STANDARD character. So the . is UNICODE equivalent to the NEGATIVE class character [^\r\n\f\x{0085}\x{2028}\x{2029}] for an Unicode encoded file and equivalent to [^\r\n\f] for an ANSI encoded file (?s) From this point, any regex DOT symbol represents ABSOLUTELY ANY character, included all the LINE-ENDING chars (?-x) From this point, any LITERAL SPACE character is SIGNIFICANT and is part of the overall regex ( IMPLICIT in a N++ regex ) (?x) From this point, any LITERAL SPACE character is IGNORED and just helps READABILITY of the overall regex. This mode is called FREE-SPACING mode and can SPLIT in SEVERAL lines. In this mode : - Any SPACE char must be written [ ] or \x20 or escaped with a \ character - Any text, after a # symbol, will be considered as COMMENTS - Any litteral # symbol must be written [#] or \x23 or escaped as \# (?-m) From this point : - The regex symbol ^ represents only the VERY BEGINNING of the file, so equivalent to the regex \A - The regex symbol $ represents only the VERY END of the file, so equivalent to the regex \z (?m) From this point, the assertions ^ and $ represent their USUAL signification of START and END of line locations ( IMPLICIT in a N++ regex ) ---------------------------------------------------------------------------------------------------------------------------------------------------------- Regarding GROUPS : (•••••) It defines a CAPTURING group which allows, both : - The regex engine to STORE the regex ENCLOSED part for FURTHER use, either in the SEARCH and/or the REPLACE part - The regex ENCLOSED part to be possibly REPEATED with a QUANTIFIER, located right after (?:•••••) It defines a NON-CAPTURING group which only allows the regex ENCLOSED part to be REPEATED and which is **not** stored by the regex engine Note that the MODIFIERS, described above, may be INCLUDED within the parentheses : - In a CAPTURING group as, for instance, ((?i)•••••) so that the INSENSITIVE search is RESTRICTED to the contents of this group, only - In a NON-CAPTURING group, TWO syntaxes are possible : for instance : (?:(?i)•••••) or the shorthand (?i:•••••) CAPTURING groups can be RE-USED with the syntax : - \1 to \9 in the SEARCH and/or REPLACE regexes for reference to group 1 to 9 - $1 to $99 in the REPLACE regex ONLY for reference to group 1 to 99 - ${1} to ${99} in the REPLACE regex ONLY for reference to group 1 to 99 For instance, the ${1}5 syntax means contents of GROUP 1 , followed with digit 5 where as the $15 syntax would have meant contents of GROUP 15 - $0 or ${0} in the REPLACE regex ONLY for reference to the OVERALL math of the SEARCH regex ---------------------------------------------------------------------------------------------------------------------------------------------------------- Regarding QUANTIFIERS, 6 syntaxes are possible {n} , {n,}, {n,m}, ?, + and *. Note that : - {n} EXACTLY n times the character or group, PRECEDING the quantifier - {n,} n or MORE times the character or group, PRECEDING the quantifier - {n,m} BETWEEN n and m times the character or group, PRECEDING the quantifier - ? is equivalent to {0,1} - + is equivalent to {1,} - * is equivalent to {0,} They are considered as GREEDY quantifiers because they match as MANY characters as possible If these 6 syntaxes are followed with a QUESTION MARK ?, they are called LAZY quantifiers because they match as FEW characters as possible For instance, given the following sentence : The licenses for most software are designed to take away your freedom to share and change it - Regex (?-s)e.+?ar, with the LAZY quantifier +?, matches --------------------------- - Regex (?-s)e.+ar , with the GREEDY quantifier +, matches --------------------------------------------------------------------------- If theses 6 syntaxes are followed with a ADDITION sign +, they are called ATOMIC quantifiers. - They are quite similar to their GREEDY forms, exceot that, in case of failure, they don't backtrack to attempt further possible match(es) - Note that this ADVANCED option should be studied when you'll be rather ACQUAINTED with regexes ! ---------------------------------------------------------------------------------------------------------------------------------------------------------- BTW, a quick tip to SIMULATE a NORMAL search when the REGULAR EXPRESSION mode is selected : START the search zone with the \Q syntax : For instance, the regex \Q/* This is a C-comment */ will find the LITERAL string /* This is a C-comment */Now, I will rewrite my last regex, with your improvement, in the Free-Spacing mode :
---------------------------------------------------------------------------------------------------------------------------------------------------------- (?x-is) # FREE-SPACING mode, search SENSITIVE to CASE and DOT regex symbol represents a SINGLE STANDARD char (?: # BEGINNING of a NON-CAPTURING group ~~choice # Matches the string ~~choice, with this EXACT case | # OR ( ALTERNATION symbol ) (?!\A)\G # Matches from RIGHT AFTER the location of the LAST match, IF NOT at the VERY BEGINNING of the file ) # END of the NON-CAPTURING group .+? # The SMALLEST NON-NULL range of STANDARD characters till... \K # CURRENT match is DISCARDED and working location is RESET to this POINT (?<!\() # ONLY if it's NOT PRECEDED with a STARTING parenthesis symbol "[!'.?\w-]+" # ... a NON-NULL range of WORD chars or the characters !, ', ., ? and - (?!\)) # ONLY if it's NOT FOLLOWED with an ENDING parenthesis symbol NOTES : - This syntax is totally FUNCTIONAL. To be convinced do a NORMAL selection from (?x-is) to ENDING parenthesis symbol and hit the Ctrl + F shortcut => This MULTI- lines regex is AUTOMATICALLY inserted in the 'Search what' zone - The \G assertion means that the NEXT search must start, necessarily, RIGHT AFTER the LAST match ! - I rewrote your regex part [A-Za-z \-\.\!\?'] as [!'.?\w-] because most of the punctuation signs do NOT need to be ESCAPED, within a CLASS character. - However note that the DASH - must be found at the VERY BEGINNING or the VERY END of the class character, when NON escaped - I prefer the \w syntax to [A-Za-z] because \w also INClUDES all the ACCENTUATED characters of foreign languages - You must use ONLY the REPLACE ALL button ( Do NOT click on the REPLACE button for SUCCESSIVE replacements : it won't work due to the \K syntax ! ) - If you don't tick the WRAP AROUND option, move preferably the CARET at the VERY BEGINNING of current file - From BEGINNING of file, as the regex engine must SKIP some LINE-ENDING characters to get a match, the \G assertion is NOT verified and the regex engine must necessarily look, FIRST, for a string ~~choice - Then, from RIGHT AFTER the word choice, it grasps the SMALLEST NON-NULL range of STANDARD chars .+? till a "•••••" structure, but ONLY IF NOT embedded between PARENTHESES itself ! - And, due to the \K syntax, ONLY the part "•••••" is the FINAL match desired - This FINAL part is changed with the REPLACE regex \($0\) which just rewrites the string "•••••" between PARENTHESES. The parenthesis symbols must be ESCAPED as they have a SPECIAL signification in REPLACEMENT - Then, from RIGHT AFTER the closing " char, as the regex CANNOT find any other ~~choice string, the (?!\A)\G.+? part, again, selects the SMALLEST NON-NULL range of STANDARD characters till an OTHER block "•••••", execute the REPLACEMENT and so on...In the example, below, in each second line ( Regex types ) :
The dot . represents any char, found by the regex dummy part .+?
The bullet • represents any char, found by the regex useful part [!'.?\w-]
The character " and the string ~~choice stand for themselves
Text processed ~~choice(["Red", ("Blue"), ("Orange"), … ,"Purple"]) Regex types ~~choice.."•••"..........................."••••••" Match number BEFORE \K 1111111111 222222222222222222222222222 Match number AFTER \K 11111 22222222 Text processed ~~choice([("Red"), "Blue", "Orange", … ,("Purple")]) Regex types ~~choice..........."••••".."••••••" Match number BEFORE \K 1111111111111111111 22 Match number AFTER \K 111111 22222222 Text processed ~~choice(["Red", "Blue", "Orange", … ,"Purple"]) Regex types ~~choice.."•••".."••••".."••••••"....."••••••" Match number BEFORE \K 1111111111 22 33 44444 Match number AFTER \K 11111 222222 33333333 44444444I hope that you’ll find this article useful, in any way !
However, let me add that the \G and \K assertions, as well as atomic groups and recursive regexes or backtracking verbs ( not discussed ), are difficult notions and I can assure you that there are a LOT of regex things that you need to know before starting to use them !
Best Regards,
guy038