Regex help: Find/Replace only on lines that include specific words
-
First, I’m not native English speaker and the sentences I write might be wrong, so feel free to ask me if you don’t understand my question.
Ok…
I have text files with tons of sentences.
Like
.
.
~~list([“Apple”, “Banana”, “Orange”])
~~choice([“Red”, “Blue”, “Orange”, … ,“Purple”])
~~screen(“fruit_image”, _choice[1], )
~~action Return([“category”, “fruits”])
And tens of thousands of other sentences…
.
.And what I want to do is replace only “words” in choice([]) to (“words”)
Ex) choice([“Red”, “Blue”, “Orange”, … ,“Purple”]) --> choice([(“Red”), (“Blue”), (“Orange”),…,(“Purple”)])I need to change only “words” inside the choice([]) and not inside the list([]), screen([]), etc.
So this is the question I want to ask.
1. What regular expression should I use to do that?
I’ve used some regex, but I couldn’t find just the parts I wanted… The biggest problem is that I still only understand a little bit of the regular expression. I’ve been studying recently, but it’s still so hard…😭😭2. Is there a way to find/replace only within the bookmarked line?
If that is possible, I can solve the problem by bookmarking only the lines containing ‘choice’ and replacing (“.?") or (?=.?]))(”.*?") with ($1).3. Or is there a way to select all search results or to find/replace in the search results?
Then, I can solve the problem by using the ‘find in selection’ function… -
-
Thank you @Alan-Kilborn😀
I made regex after looking at the linked post, and it ‘almost’ seems to work well…Find:
(?i-s)(?:choice\(\[|\G).*?\K(?=.*?\]\))(".*?")
Replaced by:\($1\)
I think it found every “words” in choice([]), so my problem was solved.
But there was one exception, it also found the code below.screen dropdown_menu(pos=(0, 0), name="", spacing=0, items_offset=(0, 0), background="#00000080", style="empty", iconset=["▾", "▴"]):
Can you tell me why that code was found as well?
I want to know what the problem of regex I made is. -
@비공개 I have an incomplete solution based on this strategy:
For each hit we want to match:
"choice(" followed by.. any stuff, until you bump into.. space or comma or left brace, followed immediately by.. NOT "(" (and issue a match reset) followed by.. a word inside fancy quotation marks (and this last text going into capture group 1)
The search string is:
(?<=choice\().*?[ ,\\[](?<!\()\K(\“\w+\”)
(I hope it appears that after the comma there’s ONE backslash before the left brace).
I found this matches what (I think) you want – well, I think it matches because the editor highlights it.
(Also, the backslashes escaping the fancy quotation marks appear to be optional.)
The replace text I used:
\(\1\)
Here’s the big problem: the matched text doesn’t get replaced. I need a guru to explain to me why this is so.
My guess is that the problem is related to the fact that the fancy quotation marks are each three bytes long: E2809C and E2809D.
Another weakness of my solution is that (if replace worked) it only processes the first text meeting the criteria on a line, so you’d need to run “replace all” a bunch of times. (I think there are ways of overcoming this, resetting and backtracking, but I haven’t looked closely into that.)
-
@Neil-Schipper My suggestion that the failure to replace the captured text is related to the fancy quotation marks is probably wrong because I could easily match-and-capture
(“Blue”)
and replace it with\(\1\)
which gave the expected results.So maybe the problem is related to my use of \K.
-
Hello, @비공개, @alan-kilborn, @Neil-schipper and All,
Thanks for trying to get the solution by yourself !
I’ve already found out a suitable regex S/R for your case ! Try this version and tell me if it avoids the mentioned side-effects !
SEARCH
(?-is)(?:~~choice|(?!\A)\G).+?\K"\w+"
REPLACE
\($0\)
If OK, I could give your some regex explanations next time !
Best Regards,
guy038
P.S. :
I supposed that your file contains only regular double quotes
"
and not the“
and”
characters, of Unicode value\x{201C}
and\x{201D}
, which are automatically displayed in our forum ! -
@guy038 It didn’t work for me. I ran it on this test text:
~~list([“Apple”, “Banana”, “Orange”]) ~~choice([“Red”, “Blue”, “Orange”, … ,“Purple”]) ~~choice([“Red”, (“Blue”), (“Orange”),…,(“Purple”)]) ~~choice([(“Red”), “Blue”, (“Orange”),…,(“Purple”)]) ~~choice([(“Red”), (“Blue”), “Orange”,…,(“Purple”)]) ~~choice([“Red”, “Blue”, “Orange”, … ,“Purple”]) ~~screen(“fruit_image”, _choice[1], ) ~~action Return([“category”, “fruits”]) choice([(“Red”), (“Blue”), (“Orange”),…,(“Purple”)])
and after seeing your P.S. I converted the fancy qm’s to standard ascii:
~~list(["Apple", "Banana", "Orange"]) ~~choice(["Red", "Blue", "Orange", … ,"Purple"]) ~~choice(["Red", ("Blue"), ("Orange"),…,("Purple")]) ~~choice([("Red"), "Blue", ("Orange"),…,("Purple")]) ~~choice([("Red"), ("Blue"), "Orange",…,("Purple")]) ~~choice(["Red", "Blue", "Orange", … ,"Purple"]) ~~screen("fruit_image", _choice[1], ) ~~action Return(["category", "fruits"]) choice([("Red"), ("Blue"), ("Orange"),…,("Purple")])
but I still get no matches.
I didn’t analyze your search string, and I have no doubt it’s based on sound principles.
I am amazed to learn about the quotation marks getting altered. That’s another “gotcha” that warrants documentation in an easily found location! (Maybe it’s a feature than can be disabled.)
Also the codes for the qm’s you state are different from mine. I got mine (lazily) by running a conversion using the Converter plug-in, which I have not vetted for byte-level correctness against standard character tables. Yet another trap for the unwary?
-
Hello, @비공개, @alan-kilborn, @Neil-schipper and All,
Ah…OK, Neil. So I improved my regex S/R in order that it will not process anything if the double quotes are already preceded and followed with parentheses !
Here is the new version :
SEARCH
(?-is)(?:~~choice|(?!\A)\G).+?\K(?<!\()"\w+"(?!\))
REPLACE
\($0\)
Taking your INPUT text in account :
~~list(["Apple", "Banana", "Orange"]) ~~choice(["Red", "Blue", "Orange", … ,"Purple"]) ~~choice(["Red", ("Blue"), ("Orange"),…,("Purple")]) ~~choice([("Red"), "Blue", ("Orange"),…,("Purple")]) ~~choice([("Red"), ("Blue"), "Orange",…,("Purple")]) ~~choice(["Red", "Blue", "Orange", … ,"Purple"]) ~~screen("fruit_image", _choice[1], ) ~~action Return(["category", "fruits"]) choice([("Red"), ("Blue"), ("Orange"),…,("Purple")])
It correctly changes it as below :
~~list(["Apple", "Banana", "Orange"]) ~~choice([("Red"), ("Blue"), ("Orange"), … ,("Purple")]) ~~choice([("Red"), ("Blue"), ("Orange"),…,("Purple")]) ~~choice([("Red"), ("Blue"), ("Orange"),…,("Purple")]) ~~choice([("Red"), ("Blue"), ("Orange"),…,("Purple")]) ~~choice([("Red"), ("Blue"), ("Orange"), … ,("Purple")]) ~~screen("fruit_image", _choice[1], ) ~~action Return(["category", "fruits"]) choice([("Red"), ("Blue"), ("Orange"),…,("Purple")])
Notes :
-
You must use only the
Replace All
button ( Do NOT click on theReplace
button for successive replacements : it won’t work due to the\K
syntax ! ) -
If you don’t tick the
Wrap around
option, move preferably the caret at the very beginning of current file -
This new version avoids the formation of forms such as
((((("text")))))
, if you’re trying to execute this regex S/R several times !
BR
guy038
-
-
@guy038 Following your instructions, this works exactly as you say.
Furthermore, I see now that your earlier search string also works with Replace All (which I hadn’t tried) but adds the unwanted extra sets of
()
.Furthermore, I also see now that my original search string (my first post in this thread) also works with Replace All (which I also hadn’t tried) but with the requirement for successive runs to get the whole job done as I had stated.
It appears there’s something about
\K
that I don’t understand, unless it’s something not fully described in the docs but that was discovered by trial and error.I do consider it a weakness (of both of our search strings) that single replaces don’t work.
-
There are many answers while I’m sleeping. Thank you so much @guy038 ,@Neil-Schipper, @Alan-Kilborn and all!
OK, so @guy038, as soon as I woke up, I tried your method and it solved my problem perfectly!😀😀😀
The “words” I’m looking for aren’t just written as “word charactors”, so I just changed
\w
to[A-Za-z \-\.\!\?']
. The example I held was not appropriate. I’m sorry.And please! I need your explain.
Especially, I’m not sure what(?:choice|(?!\A)\G)
and.+?\K"
mean. -
Hi, @비공개, @alan-kilborn, @Neil-schipper and All,
OK, @비공개, I’m going to give some pieces of information but, as always :
-
You have to know how to make cement before you can put two bricks together
-
You must know how to put two bricks together before building a wall
-
You must know how to build a wall before building a room
-
You must know how to build a room before building a house
and so on !
In other words, check this FAQ which gives you the main links to learn regular expressions, from top to bottom ;-))
Now, let’s go :
---------------------------------------------------------------------------------------------------------------------------------------------------------- Regarding MODIFIERS, generally met at BEGINNING of the regex, but which may occur at ANY location within the overall regex : (?-i) From this point, search care about letter's CASE (?i) From this point, search does NOT care about letter's CASE (?-s) From this point, any regex dot symbol represents a SINGLE STANDARD character. So the . is UNICODE equivalent to the NEGATIVE class character [^\r\n\f\x{0085}\x{2028}\x{2029}] for an Unicode encoded file and equivalent to [^\r\n\f] for an ANSI encoded file (?s) From this point, any regex DOT symbol represents ABSOLUTELY ANY character, included all the LINE-ENDING chars (?-x) From this point, any LITERAL SPACE character is SIGNIFICANT and is part of the overall regex ( IMPLICIT in a N++ regex ) (?x) From this point, any LITERAL SPACE character is IGNORED and just helps READABILITY of the overall regex. This mode is called FREE-SPACING mode and can SPLIT in SEVERAL lines. In this mode : - Any SPACE char must be written [ ] or \x20 or escaped with a \ character - Any text, after a # symbol, will be considered as COMMENTS - Any litteral # symbol must be written [#] or \x23 or escaped as \# (?-m) From this point : - The regex symbol ^ represents only the VERY BEGINNING of the file, so equivalent to the regex \A - The regex symbol $ represents only the VERY END of the file, so equivalent to the regex \z (?m) From this point, the assertions ^ and $ represent their USUAL signification of START and END of line locations ( IMPLICIT in a N++ regex ) ---------------------------------------------------------------------------------------------------------------------------------------------------------- Regarding GROUPS : (•••••) It defines a CAPTURING group which allows, both : - The regex engine to STORE the regex ENCLOSED part for FURTHER use, either in the SEARCH and/or the REPLACE part - The regex ENCLOSED part to be possibly REPEATED with a QUANTIFIER, located right after (?:•••••) It defines a NON-CAPTURING group which only allows the regex ENCLOSED part to be REPEATED and which is **not** stored by the regex engine Note that the MODIFIERS, described above, may be INCLUDED within the parentheses : - In a CAPTURING group as, for instance, ((?i)•••••) so that the INSENSITIVE search is RESTRICTED to the contents of this group, only - In a NON-CAPTURING group, TWO syntaxes are possible : for instance : (?:(?i)•••••) or the shorthand (?i:•••••) CAPTURING groups can be RE-USED with the syntax : - \1 to \9 in the SEARCH and/or REPLACE regexes for reference to group 1 to 9 - $1 to $99 in the REPLACE regex ONLY for reference to group 1 to 99 - ${1} to ${99} in the REPLACE regex ONLY for reference to group 1 to 99 For instance, the ${1}5 syntax means contents of GROUP 1 , followed with digit 5 where as the $15 syntax would have meant contents of GROUP 15 - $0 or ${0} in the REPLACE regex ONLY for reference to the OVERALL math of the SEARCH regex ---------------------------------------------------------------------------------------------------------------------------------------------------------- Regarding QUANTIFIERS, 6 syntaxes are possible {n} , {n,}, {n,m}, ?, + and *. Note that : - {n} EXACTLY n times the character or group, PRECEDING the quantifier - {n,} n or MORE times the character or group, PRECEDING the quantifier - {n,m} BETWEEN n and m times the character or group, PRECEDING the quantifier - ? is equivalent to {0,1} - + is equivalent to {1,} - * is equivalent to {0,} They are considered as GREEDY quantifiers because they match as MANY characters as possible If these 6 syntaxes are followed with a QUESTION MARK ?, they are called LAZY quantifiers because they match as FEW characters as possible For instance, given the following sentence : The licenses for most software are designed to take away your freedom to share and change it - Regex (?-s)e.+?ar, with the LAZY quantifier +?, matches --------------------------- - Regex (?-s)e.+ar , with the GREEDY quantifier +, matches --------------------------------------------------------------------------- If theses 6 syntaxes are followed with a ADDITION sign +, they are called ATOMIC quantifiers. - They are quite similar to their GREEDY forms, exceot that, in case of failure, they don't backtrack to attempt further possible match(es) - Note that this ADVANCED option should be studied when you'll be rather ACQUAINTED with regexes ! ---------------------------------------------------------------------------------------------------------------------------------------------------------- BTW, a quick tip to SIMULATE a NORMAL search when the REGULAR EXPRESSION mode is selected : START the search zone with the \Q syntax : For instance, the regex \Q/* This is a C-comment */ will find the LITERAL string /* This is a C-comment */
Now, I will rewrite my last regex, with your improvement, in the
Free-Spacing
mode :---------------------------------------------------------------------------------------------------------------------------------------------------------- (?x-is) # FREE-SPACING mode, search SENSITIVE to CASE and DOT regex symbol represents a SINGLE STANDARD char (?: # BEGINNING of a NON-CAPTURING group ~~choice # Matches the string ~~choice, with this EXACT case | # OR ( ALTERNATION symbol ) (?!\A)\G # Matches from RIGHT AFTER the location of the LAST match, IF NOT at the VERY BEGINNING of the file ) # END of the NON-CAPTURING group .+? # The SMALLEST NON-NULL range of STANDARD characters till... \K # CURRENT match is DISCARDED and working location is RESET to this POINT (?<!\() # ONLY if it's NOT PRECEDED with a STARTING parenthesis symbol "[!'.?\w-]+" # ... a NON-NULL range of WORD chars or the characters !, ', ., ? and - (?!\)) # ONLY if it's NOT FOLLOWED with an ENDING parenthesis symbol NOTES : - This syntax is totally FUNCTIONAL. To be convinced do a NORMAL selection from (?x-is) to ENDING parenthesis symbol and hit the Ctrl + F shortcut => This MULTI- lines regex is AUTOMATICALLY inserted in the 'Search what' zone - The \G assertion means that the NEXT search must start, necessarily, RIGHT AFTER the LAST match ! - I rewrote your regex part [A-Za-z \-\.\!\?'] as [!'.?\w-] because most of the punctuation signs do NOT need to be ESCAPED, within a CLASS character. - However note that the DASH - must be found at the VERY BEGINNING or the VERY END of the class character, when NON escaped - I prefer the \w syntax to [A-Za-z] because \w also INClUDES all the ACCENTUATED characters of foreign languages - You must use ONLY the REPLACE ALL button ( Do NOT click on the REPLACE button for SUCCESSIVE replacements : it won't work due to the \K syntax ! ) - If you don't tick the WRAP AROUND option, move preferably the CARET at the VERY BEGINNING of current file - From BEGINNING of file, as the regex engine must SKIP some LINE-ENDING characters to get a match, the \G assertion is NOT verified and the regex engine must necessarily look, FIRST, for a string ~~choice - Then, from RIGHT AFTER the word choice, it grasps the SMALLEST NON-NULL range of STANDARD chars .+? till a "•••••" structure, but ONLY IF NOT embedded between PARENTHESES itself ! - And, due to the \K syntax, ONLY the part "•••••" is the FINAL match desired - This FINAL part is changed with the REPLACE regex \($0\) which just rewrites the string "•••••" between PARENTHESES. The parenthesis symbols must be ESCAPED as they have a SPECIAL signification in REPLACEMENT - Then, from RIGHT AFTER the closing " char, as the regex CANNOT find any other ~~choice string, the (?!\A)\G.+? part, again, selects the SMALLEST NON-NULL range of STANDARD characters till an OTHER block "•••••", execute the REPLACEMENT and so on...
In the example, below, in each second line ( Regex types ) :
-
The dot
.
represents any char, found by the regex dummy part.+?
-
The bullet
•
represents any char, found by the regex useful part[!'.?\w-]
-
The character
"
and the string~~choice
stand for themselves
Text processed ~~choice(["Red", ("Blue"), ("Orange"), … ,"Purple"]) Regex types ~~choice.."•••"..........................."••••••" Match number BEFORE \K 1111111111 222222222222222222222222222 Match number AFTER \K 11111 22222222 Text processed ~~choice([("Red"), "Blue", "Orange", … ,("Purple")]) Regex types ~~choice..........."••••".."••••••" Match number BEFORE \K 1111111111111111111 22 Match number AFTER \K 111111 22222222 Text processed ~~choice(["Red", "Blue", "Orange", … ,"Purple"]) Regex types ~~choice.."•••".."••••".."••••••"....."••••••" Match number BEFORE \K 1111111111 22 33 44444 Match number AFTER \K 11111 222222 33333333 44444444
I hope that you’ll find this article useful, in any way !
However, let me add that the
\G
and\K
assertions, as well as atomic groups and recursive regexes or backtracking verbs ( not discussed ), are difficult notions and I can assure you that there are a LOT of regex things that you need to know before starting to use them !Best Regards,
guy038
-