Regex help: Find/Replace only on lines that include specific words

비공개

First, I’m not native English speaker and the sentences I write might be wrong, so feel free to ask me if you don’t understand my question.

Ok…
I have text files with tons of sentences.
Like
.
.
~~list([“Apple”, “Banana”, “Orange”])
~~choice([“Red”, “Blue”, “Orange”, … ,“Purple”])
~~screen(“fruit_image”, _choice[1], )
~~action Return([“category”, “fruits”])
And tens of thousands of other sentences…
.
.

And what I want to do is replace only “words” in choice([]) to (“words”)
Ex) choice([“Red”, “Blue”, “Orange”, … ,“Purple”]) --> choice([(“Red”), (“Blue”), (“Orange”),…,(“Purple”)])

I need to change only “words” inside the choice([]) and not inside the list([]), screen([]), etc.

So this is the question I want to ask.

1. What regular expression should I use to do that?
I’ve used some regex, but I couldn’t find just the parts I wanted… The biggest problem is that I still only understand a little bit of the regular expression. I’ve been studying recently, but it’s still so hard…😭😭

2. Is there a way to find/replace only within the bookmarked line?
If that is possible, I can solve the problem by bookmarking only the lines containing ‘choice’ and replacing (“.?") or (?=.?]))(”.*?") with ($1).

3. Or is there a way to select all search results or to find/replace in the search results?
Then, I can solve the problem by using the ‘find in selection’ function…

Alan Kilborn

@비공개

The answers to your questions 2 and 3 is “no” and “no”.

For question number 1, there is a technique presented HERE for solving that. Have a read and see if you can apply that technique.

비공개

Thank you @Alan-Kilborn😀
I made regex after looking at the linked post, and it ‘almost’ seems to work well…

Find: (?i-s)(?:choice$\[|\G).*?\K(?=.*?\]$)(".*?")
Replaced by: $$1$

I think it found every “words” in choice([]), so my problem was solved.
But there was one exception, it also found the code below.

screen dropdown_menu(pos=(0, 0), name="", spacing=0, items_offset=(0, 0), background="#00000080", style="empty", iconset=["▾", "▴"]):

Can you tell me why that code was found as well?
I want to know what the problem of regex I made is.

Neil Schipper

@비공개 I have an incomplete solution based on this strategy:

For each hit we want to match:

"choice(" followed by..
any stuff, until you bump into..
space or comma or left brace, followed immediately by..
NOT "("
(and issue a match reset) 
followed by..
a word inside fancy quotation marks
(and this last text going into capture group 1)

The search string is: (?<=choice\().*?[ ,\\[](?<!\()\K(\“\w+\”)

(I hope it appears that after the comma there’s ONE backslash before the left brace).

I found this matches what (I think) you want – well, I think it matches because the editor highlights it.

(Also, the backslashes escaping the fancy quotation marks appear to be optional.)

The replace text I used: $\1$

Here’s the big problem: the matched text doesn’t get replaced. I need a guru to explain to me why this is so.

My guess is that the problem is related to the fact that the fancy quotation marks are each three bytes long: E2809C and E2809D.

Another weakness of my solution is that (if replace worked) it only processes the first text meeting the criteria on a line, so you’d need to run “replace all” a bunch of times. (I think there are ways of overcoming this, resetting and backtracking, but I haven’t looked closely into that.)

Neil Schipper

@Neil-Schipper My suggestion that the failure to replace the captured text is related to the fancy quotation marks is probably wrong because I could easily match-and-capture (“Blue”) and replace it with $\1$ which gave the expected results.

So maybe the problem is related to my use of \K.

guy038

Hello, @비공개, @alan-kilborn, @Neil-schipper and All,

Thanks for trying to get the solution by yourself !

I’ve already found out a suitable regex S/R for your case ! Try this version and tell me if it avoids the mentioned side-effects !

SEARCH (?-is)(?:~~choice|(?!\A)\G).+?\K"\w+"

REPLACE $$0$

If OK, I could give your some regex explanations next time !

Best Regards,

guy038

P.S. :

I supposed that your file contains only regular double quotes " and not the “ and ” characters, of Unicode value \x{201C} and \x{201D}, which are automatically displayed in our forum !

Neil Schipper

@guy038 It didn’t work for me. I ran it on this test text:

~~list([“Apple”, “Banana”, “Orange”])
~~choice([“Red”, “Blue”, “Orange”, … ,“Purple”])
~~choice([“Red”, (“Blue”), (“Orange”),…,(“Purple”)])
~~choice([(“Red”), “Blue”, (“Orange”),…,(“Purple”)])
~~choice([(“Red”), (“Blue”), “Orange”,…,(“Purple”)])
~~choice([“Red”, “Blue”, “Orange”, … ,“Purple”])
~~screen(“fruit_image”, _choice[1], )
~~action Return([“category”, “fruits”])
choice([(“Red”), (“Blue”), (“Orange”),…,(“Purple”)])

and after seeing your P.S. I converted the fancy qm’s to standard ascii:

~~list(["Apple", "Banana", "Orange"])
~~choice(["Red", "Blue", "Orange", … ,"Purple"])
~~choice(["Red", ("Blue"), ("Orange"),…,("Purple")])
~~choice([("Red"), "Blue", ("Orange"),…,("Purple")])
~~choice([("Red"), ("Blue"), "Orange",…,("Purple")])
~~choice(["Red", "Blue", "Orange", … ,"Purple"])
~~screen("fruit_image", _choice[1], )
~~action Return(["category", "fruits"])
choice([("Red"), ("Blue"), ("Orange"),…,("Purple")])

but I still get no matches.

I didn’t analyze your search string, and I have no doubt it’s based on sound principles.

I am amazed to learn about the quotation marks getting altered. That’s another “gotcha” that warrants documentation in an easily found location! (Maybe it’s a feature than can be disabled.)

Also the codes for the qm’s you state are different from mine. I got mine (lazily) by running a conversion using the Converter plug-in, which I have not vetted for byte-level correctness against standard character tables. Yet another trap for the unwary?

guy038

Hello, @비공개, @alan-kilborn, @Neil-schipper and All,

Ah…OK, Neil. So I improved my regex S/R in order that it will not process anything if the double quotes are already preceded and followed with parentheses !

Here is the new version :

SEARCH (?-is)(?:~~choice|(?!\A)\G).+?\K(?<!$)"\w+"(?!$)

REPLACE $$0$

Taking your INPUT text in account :

~~list(["Apple", "Banana", "Orange"])
~~choice(["Red", "Blue", "Orange", … ,"Purple"])
~~choice(["Red", ("Blue"), ("Orange"),…,("Purple")])
~~choice([("Red"), "Blue", ("Orange"),…,("Purple")])
~~choice([("Red"), ("Blue"), "Orange",…,("Purple")])
~~choice(["Red", "Blue", "Orange", … ,"Purple"])
~~screen("fruit_image", _choice[1], )
~~action Return(["category", "fruits"])
choice([("Red"), ("Blue"), ("Orange"),…,("Purple")])

It correctly changes it as below :

~~list(["Apple", "Banana", "Orange"])
~~choice([("Red"), ("Blue"), ("Orange"), … ,("Purple")])
~~choice([("Red"), ("Blue"), ("Orange"),…,("Purple")])
~~choice([("Red"), ("Blue"), ("Orange"),…,("Purple")])
~~choice([("Red"), ("Blue"), ("Orange"),…,("Purple")])
~~choice([("Red"), ("Blue"), ("Orange"), … ,("Purple")])
~~screen("fruit_image", _choice[1], )
~~action Return(["category", "fruits"])
choice([("Red"), ("Blue"), ("Orange"),…,("Purple")])

Notes :

You must use only the Replace All button ( Do NOT click on the Replace button for successive replacements : it won’t work due to the \K syntax ! )
If you don’t tick the Wrap around option, move preferably the caret at the very beginning of current file
This new version avoids the formation of forms such as ((((("text"))))), if you’re trying to execute this regex S/R several times !

BR

guy038

Neil Schipper

@guy038 Following your instructions, this works exactly as you say.

Furthermore, I see now that your earlier search string also works with Replace All (which I hadn’t tried) but adds the unwanted extra sets of ().

Furthermore, I also see now that my original search string (my first post in this thread) also works with Replace All (which I also hadn’t tried) but with the requirement for successive runs to get the whole job done as I had stated.

It appears there’s something about \K that I don’t understand, unless it’s something not fully described in the docs but that was discovered by trial and error.

I do consider it a weakness (of both of our search strings) that single replaces don’t work.

비공개

There are many answers while I’m sleeping. Thank you so much @guy038 ,@Neil-Schipper, @Alan-Kilborn and all!

OK, so @guy038, as soon as I woke up, I tried your method and it solved my problem perfectly!😀😀😀

The “words” I’m looking for aren’t just written as “word charactors”, so I just changed \w to [A-Za-z \-\.\!\?']. The example I held was not appropriate. I’m sorry.

And please! I need your explain.
Especially, I’m not sure what (?:choice|(?!\A)\G) and .+?\K" mean.

guy038

Hi, @비공개, @alan-kilborn, @Neil-schipper and All,

OK, @비공개, I’m going to give some pieces of information but, as always :

You have to know how to make cement before you can put two bricks together

You must know how to put two bricks together before building a wall

You must know how to build a wall before building a room

You must know how to build a room before building a house

and so on !

In other words, check this FAQ which gives you the main links to learn regular expressions, from top to bottom ;-))

Now, let’s go :

----------------------------------------------------------------------------------------------------------------------------------------------------------

Regarding MODIFIERS, generally met at BEGINNING of the regex, but which may occur at ANY location within the overall regex :

(?-i)   From this point, search      care      about letter's CASE

(?i)    From this point, search  does NOT care about letter's CASE

	      
(?-s)   From this point, any regex dot symbol represents a SINGLE STANDARD character. So the . is UNICODE equivalent to the NEGATIVE class character
              [^\r\n\f\x{0085}\x{2028}\x{2029}] for an Unicode encoded file and equivalent to [^\r\n\f] for an ANSI encoded file
	      
(?s)    From this point, any regex DOT symbol represents ABSOLUTELY ANY character, included all the LINE-ENDING chars
	      

(?-x)   From this point, any LITERAL SPACE character is SIGNIFICANT and is part of the overall regex ( IMPLICIT in a N++ regex )

(?x)    From this point, any LITERAL SPACE character is IGNORED  and just helps READABILITY of the overall regex.
             This mode is called FREE-SPACING mode and can SPLIT in SEVERAL lines. In this  mode :

             - Any SPACE char must be written [ ] or \x20  or escaped with a \ character
             - Any text, after a # symbol, will be considered as COMMENTS
             - Any litteral # symbol must be written [#] or \x23 or escaped as \#

	      
(?-m)   From this point :
             - The regex symbol ^ represents only  the VERY BEGINNING of the file, so equivalent to the regex \A
             - The regex symbol $ represents only  the VERY END       of the file, so equivalent to the regex \z
	      
(?m)    From this point, the assertions ^ and $ represent their USUAL signification of START and END of line locations ( IMPLICIT in a N++ regex )

----------------------------------------------------------------------------------------------------------------------------------------------------------

Regarding GROUPS :

(•••••)    It defines a CAPTURING group which allows, both :

               - The regex engine to STORE the regex ENCLOSED part for FURTHER use, either in the SEARCH and/or the REPLACE part

               - The regex ENCLOSED part to be possibly REPEATED with a  QUANTIFIER, located right after

(?:•••••)  It defines a NON-CAPTURING group which only allows the regex ENCLOSED part to be REPEATED and which is **not** stored by the regex engine

Note that the MODIFIERS, described above, may be INCLUDED within the parentheses :

             - In a CAPTURING group as, for instance, ((?i)•••••) so that the INSENSITIVE search is RESTRICTED to the contents of this group, only

             - In a NON-CAPTURING group, TWO syntaxes are possible : for instance : (?:(?i)•••••) or the shorthand (?i:•••••)


CAPTURING groups can be RE-USED with the syntax :

    - \1   to \9     in the SEARCH and/or REPLACE regexes   for reference to group 1 to  9
    - $1   to $99    in the REPLACE regex ONLY              for reference to group 1 to 99
    - ${1} to ${99}  in the REPLACE regex ONLY              for reference to group 1 to 99

	    For instance, the ${1}5 syntax means contents of GROUP 1 , followed with digit 5 where as the $15 syntax would have meant contents of GROUP 15
				  
    - $0 or ${0}     in the REPLACE regex ONLY              for reference to the OVERALL math of the SEARCH regex

----------------------------------------------------------------------------------------------------------------------------------------------------------

Regarding QUANTIFIERS, 6 syntaxes are possible {n} , {n,}, {n,m}, ?, + and *. Note that :

    - {n}   EXACTLY n       times the character or group, PRECEDING the quantifier
    - {n,}  n or MORE       times the character or group, PRECEDING the quantifier
    - {n,m} BETWEEN n and m times the character or group, PRECEDING the quantifier

    - ? is equivalent to {0,1}
    - + is equivalent to {1,}
    - * is equivalent to {0,}

They are considered as GREEDY quantifiers because they match as MANY characters as possible

If these 6 syntaxes are followed with a QUESTION MARK ?, they are called LAZY quantifiers because they match as FEW characters as possible
    
For instance, given the following sentence :
                                                            The licenses for most software are designed to take away your freedom to share and change it
- Regex (?-s)e.+?ar, with the LAZY   quantifier +?, matches   ---------------------------
- Regex (?-s)e.+ar , with the GREEDY quantifier +,  matches   ---------------------------------------------------------------------------


If theses 6 syntaxes are followed with a ADDITION sign +, they are called ATOMIC quantifiers.

   - They are quite similar to their GREEDY forms, exceot that, in case of failure, they don't backtrack to attempt further possible match(es)

   - Note that this ADVANCED option should be studied when you'll be rather ACQUAINTED with regexes !

----------------------------------------------------------------------------------------------------------------------------------------------------------

BTW, a quick tip to SIMULATE a NORMAL search when the REGULAR EXPRESSION mode is selected : START the search zone with the \Q syntax :

    For instance, the regex \Q/* This is a C-comment */ will find the LITERAL string  /* This is a C-comment */

Now, I will rewrite my last regex, with your improvement, in the Free-Spacing mode :

----------------------------------------------------------------------------------------------------------------------------------------------------------

(?x-is)                #  FREE-SPACING mode, search SENSITIVE to CASE and DOT regex symbol represents a SINGLE STANDARD char
(?:                    #  BEGINNING of a NON-CAPTURING group
~~choice               #       Matches the string ~~choice, with this EXACT case
|                      #    OR ( ALTERNATION symbol )
(?!\A)\G               #       Matches from RIGHT AFTER the location of the LAST match, IF NOT at the VERY BEGINNING of the file
)                      #  END of the NON-CAPTURING group
.+?                    #  The SMALLEST NON-NULL range of STANDARD characters till...
\K                     #  CURRENT match is DISCARDED and working location is RESET to this POINT
(?<!\()                #  ONLY if it's NOT PRECEDED with a STARTING parenthesis symbol
"[!'.?\w-]+"           #  ... a NON-NULL range of WORD chars or the characters !, ', ., ? and -
(?!\))                 #  ONLY if it's NOT FOLLOWED with an ENDING parenthesis symbol


NOTES :

- This syntax is totally FUNCTIONAL. To be convinced do a NORMAL selection from (?x-is) to ENDING parenthesis symbol and hit the Ctrl + F shortcut
     => This MULTI- lines regex is AUTOMATICALLY inserted in the 'Search what' zone 

- The \G assertion means that the NEXT search must start, necessarily, RIGHT AFTER the LAST match !

- I rewrote your regex part [A-Za-z \-\.\!\?'] as [!'.?\w-] because most of the punctuation signs do NOT need to be ESCAPED, within a CLASS character.

    - However note that the DASH - must be found at the VERY BEGINNING or the VERY END of the class character, when NON escaped
    - I prefer the \w syntax to [A-Za-z] because \w also INClUDES all the ACCENTUATED characters of foreign languages

- You must use ONLY the REPLACE ALL button ( Do NOT click on the REPLACE button for SUCCESSIVE replacements : it won't work due to the \K syntax ! )

- If you don't tick the WRAP AROUND option, move preferably the CARET at the VERY BEGINNING of current file

- From BEGINNING of file, as the regex engine must SKIP some LINE-ENDING characters to get a match, the \G assertion is NOT verified

    and the regex engine must necessarily look, FIRST, for a string ~~choice

- Then, from RIGHT AFTER the word choice, it grasps the SMALLEST NON-NULL range of STANDARD chars .+? till a "•••••" structure, but ONLY IF NOT embedded
      between PARENTHESES itself !

- And, due to the \K syntax, ONLY the part "•••••" is the FINAL match desired

- This FINAL part is changed with the REPLACE regex \($0\) which just rewrites the string "•••••" between PARENTHESES.
    The parenthesis symbols must be ESCAPED as they have a SPECIAL signification in REPLACEMENT

- Then, from RIGHT AFTER the closing " char, as the regex CANNOT find any other ~~choice string, the (?!\A)\G.+? part, again, selects the SMALLEST NON-NULL
    range of STANDARD characters till an OTHER block "•••••", execute the REPLACEMENT and so on...

In the example, below, in each second line ( Regex types ) :

The dot . represents any char, found by the regex dummy part .+?
The bullet • represents any char, found by the regex useful part [!'.?\w-]
The character " and the string ~~choice stand for themselves

Text  processed          ~~choice(["Red", ("Blue"), ("Orange"), … ,"Purple"])
Regex types              ~~choice.."•••"..........................."••••••"
Match number BEFORE \K   1111111111     222222222222222222222222222
Match number AFTER  \K             11111                           22222222


Text  processed          ~~choice([("Red"), "Blue", "Orange", … ,("Purple")])
Regex types              ~~choice..........."••••".."••••••"
Match number BEFORE \K   1111111111111111111      22        
Match number AFTER  \K                      111111  22222222


Text  processed          ~~choice(["Red", "Blue", "Orange", … ,"Purple"])
Regex types              ~~choice.."•••".."••••".."••••••"....."••••••"
Match number BEFORE \K   1111111111     22      33        44444        
Match number AFTER  \K             11111  222222  33333333     44444444

I hope that you’ll find this article useful, in any way !

However, let me add that the \G and \K assertions, as well as atomic groups and recursive regexes or backtracking verbs ( not discussed ), are difficult notions and I can assure you that there are a LOT of regex things that you need to know before starting to use them !

Best Regards,

guy038