Generic Regex: Replacing in a specific zone of text

guy038

This regex S/R allows to restrict a replacement to a specific zone of text, possibly repeated, on one or several consecutive lines.

This is particularly useful when dealing with XML or HTML languages, if you need to do some modifications within a specific start and end tag range, only.

Let FR (Find Regex ) be the regex which defines the char, string or expression to be searched
Let RR (Replacement Regex ) be the regex which defines the char, string or expression which must replace the FR expression
Let BSR ( Begin Search-region Regex ) be the regex which defines the beginning of the area where the search for FR, must start
Let ESR ( End Search-region Regex) be the regex which defines the end of the area where the search for FR must stop

Then, the generic regex can be expressed :

SEARCH (?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?-si:FR)

REPLACE RR

When the BSR and the different matches of the FR regex are all located in a single line, any line-ending char(s) will implicitly break down the \G feature. The ESR part is then useless and the generic regex can be simplified into :

SEARCH (?-s)(?-i:BSR|(?!\A)\G).*?\K(?-i:FR)

REPLACE RR

IMPORTANT :

You must use, at least, the v7.9.1 N++ release, so that the \A assertion is correctly handled
You must, move the caret at the very beginning of current file ( Ctrl + Home )
If you perform a simple search, without any replacement, just click several times on the Find Next button to notice the different zones affected by the future replacement
As soon as a replacement is needed, you’ll have to click on the Replace All button, exclusively. Thus, it will perform a global replacement on the entire file

NOTES :

Each non-capturing group, relative to the BSR, ESR and FR regexes, may be prefixed with the s or -s modifiers :
- If the BSR and/or ESR and/or FR regexes may match EOL characters, use the s modifier in the appropriate non-capturing group(s)
- If the BSR and/or ESR and/or FR regexes does not match EOL characters, use the -s modifier in the appropriate non-capturing group(s)
Each non-capturing group, relative to the BSR, ESR and FR regexes, may be prefixed with the i or -i modifiers :
- If the BSR and/or ESR and/or FR regexes are sensitive to case, use the -i modifier in the appropriate non-capturing group(s)
- If the BSR and/or ESR and/or FR regexes are insensitive to case, use the i modifier in the appropriate non-capturing group(s)
Of course, these modifiers may not be necessary ( for instance in case of search of an exact string or search of non-letter characters )
Note that the generic regexes, above, show the case when :
- These two generic regexes are sensitive to case => The -i modifier is present everywhere in the definitions
- The ESR region of the first regex may overlap on several lines => The s modifier in the ESR non-capturing group
The FR regex may define a group, between parentheses, which will be re-used in the RR regex with the \# or ${#} syntaxes, where # represents an integer
The RR regex may contain the $0 syntax which refers to each whole SR match or re-use a group, previously defined in the FR regex

Below, here are two examples to illustrate how to build real regexes S/R from these generic ones !

First, let’s imagine that you want to delete any part within parentheses in any range of text <Descrip>............</Descrip>, only, located in a single line

Paste the XML text, below, in a new tab :

<iden>123456 (START)</iden>
<name>Case_1</name>
<descrip>This is a (short) text to (easily) see the results (of the modifications)</descrip>
<param>val (250)</param>

<iden>123456</iden>
<name>Case_2</name>
<descrip>And the (obvious) changes occur only in (the) "descrip" tag</descrip>
<param>val (500)</param>

<iden>123456 (END)</iden>
<name>Case_3</name>
<descrip>All (the) other tags are (just) untouched</descrip>
<param>val (999)</param>

As all the parts to delete are contained in a single line, we can use the simplified formulation :
- SEARCH (?-s)(?-i:BSR|(?!\A)\G).*?\K(?-i:FR)
- REPLACE RR
Obviously, as we want to delete, the RR regex is a zero-length match. So, the Replace with field will be empty
Now, the FR regex represents a space char followed by the shortest text between parentheses => FR = (?:\x20$.+?$) We do not need any case modifier as this regex does not refer to letters !
The BSR regex is simply the literal string <descrip>, with this exact case. So BSR = (?-i:<descrip>

Finally, the functional regex S/R to use is :

SEARCH (?-s)(?-i:<descrip>|(?!\A)\G).*?\K(?:\x20$.+?$)
REPLACE Leave EMPTY
Open the Replace dialog Ctrl + H
Untick all options
Select the Regular expression search mode
Move to the very beginning of current file ( Ctrl + Home )
Hit several times the Find Next button to verify if the FR regex does match what you want ! In this present case it matches a space followed by text between parentheses
Again, move to the very beginning of current file ( Ctrl + Home )
Click, once only, on the Replace All button

=> As expected, all text between parentheses, of the <descrip> tag only, has been deleted, but the other parentheses, present in other tags, are untouched !

In the second example, we’ll try to replace any number of consecutive dash character with a single space char in any range <text>..........</text>, possibly splitted into several lines

Paste the following XML text in a new tab

<val>37--001</val>
<text>This-is
-a</text>
<pos>4-1234</pos>

<val>37--002</val>
<text>-small---example</text>
<pos>9-0012</pos>


<val>37--003</val>
<text>-of-text-
which-</text>
<pos>1-9999</pos>


<val>37--004</val>

<text>need
-to-be-
modi
fied</text>

<pos>0-0000</pos>

As, this time, the <text>..........</text> may be spread over several lines, we’ll use the first generic regex :
- SEARCH (?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?-si:FR)
- REPLACE RR
Obviously, the RR regex is simply \x20
Now, the FR regex represents a non-null number of consecutive dashe(s) => FR is just -+, as the non-capturing group seems not needed at all
The BSR regex is simply the literal string <text>, with this exact case => BSR = (?-si:<text>
The ESR regex is the literal string </text>, with this exact case. So the BSR regex, within its non-capturing group, is (?s-i:(?!</text>).)

Then, the real regex S/R to use is :

SEARCH (?-si:<text>|(?!\A)\G)(?s-i:(?!</text>).)*?\K-+
REPLACE \x20
Open the Replace dialog Ctrl + H
Untick all options
Select the Regular expression search mode
Move to the very beginning of current file ( Ctrl + Home )
Hit several times the Find Next button to verify if the FR regex does match what you want ! In this present case it matches any consecutive range of dash chars
Again, move to the very beginning of current file ( Ctrl + Home )
Click, once only, on the Replace All button

=> As expected, all range of consecutive dashes, of the <text> tag only, have been replaced with a single space char and the other dash characters, present in other tags, are kepted !

guy038

Two other examples regarding this generic regex ! In these ones, we’ll even restrict the replacements to each concerned zone before a # character !

Paste the text below in a new tab :

<iden>123456 (START)</iden>
<name>Case_1</name>
<descrip>This is a (short) text to (easily) see the results (of the modifications)# (12345) test (67890)</descrip>
<param>val (250)</param>

<iden>123456</iden>
<name>Case_2</name>
<descrip>And the (obvious) changes occur only in (the) "descrip" tag # Parentheses (Yeaah) OK</descrip>
<param>val (500)</param>

<iden>123456 (END)</iden>
<name>Case_3</name>
<descrip>All (the) other tags are (just) untouched #(This is) the end (of the test)</descrip>
<param>val (999)</param>

In this first example, of single-line <descrip> tags , two solutions are possible :

Use the complete generic regex (?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?-si:FR) where ESR = # which leads to the functional S/R :
- SEARCH (?-s)(?-i:<descrip>|(?!\A)\G)((?!#).)*?\K(?:\x20$.+?$)
- REPLACE Leave EMPTY

=> This time, in addition to only replace in each <descrip>..........</descrip> zone, NO replacement will occur after the # character of each <descrip> tag !

Use the simplified solution and add a ESR condition at the end of the regex, giving this generic variant (?-s)(?-i:BSR|(?!\A)\G).*?\K(?-i:FR)(?=ESR)
- SEARCH (?-s)(?-i:<descrip>|(?!\A)\G).*?\K(?:\x20$.+?$)(?=.*#)
- REPLACE Leave EMPTY

However, this other solution needs that all the <descrip> tags contains a comment zone with a # char

Now, paste this other text below in a new tab :

<val>37--001</val>
<text>This-is
-a--very---< # Dashes - - - OK/text>
<pos>4-1234</pos>

<val>37--002</val>
<text>-small----#---example</text>
<pos>9-0012</pos>


<val>37--003</val>
<text>-of-a-text-
which-</text>
<pos>1-9999</pos>


<val>37--004</val>

<text>need
-to-be-
modi
fied # but - not - there</text>

<pos>0-0000</pos>

This second example is a multi-lines replacement, in each <text>.............</text> zone only and also limited to the part before a # char which can be present or not

Of course, we’ll have to use the complete generic regex (?-si:BSR|(?!\A)\G)(?s-i:(?!ESR).)*?\K(?-si:FR) but, instead of a single (?!ESR), we’ll have to use this variant :

(?-si:BSR|(?!\A)\G)(?s-i:(?!ESR_1)(?!ESR_2).)*?\K(?-si:FR)

So, the functional regex S/R becomes :

SEARCH (?-si:<text>|(?!\A)\G)(?s-i:(?!</text>)(?!#).)*?\K-+
REPLACE \x20

=> ONLY IF a sequence of dashes is located in a <text>..........</text> zone AND, moreover, before a possible # char, it will be replaced with a single space character

As you can verify, the third multi-lines <text>.............</text> zone does not contain any # char. Thus, all dash characters, of that <Text> tag, are replaced with a single space char !

Remainder :

You must use, at least, the v7.9.1 N++ release, so that the \A assertion is correctly handled
Move to the very beginning of file, before any Find Next sequence or Replace All operation
Do not click on the step-by-step Replace button