Find-in-FIles: Can’t Replace Multiple Instances of Word

Coises

You’ve found another glitch:

If there are no matches following lyrics-text, the expression we’ve suggested will match from the beginning of the file.

All three matches are in the head section of the document. There are no matches after lyrics-text, because the word is hyphenated in the lyrics text.

Sylvester Bullitt

@Coises Is the glitch in the regular expression itself, or in the regex engine?

PeterJones

@Coises said in Find-in-FIles: Can’t Replace Multiple Instances of Word:

No regular expression thread is finished until @guy038 drops in to tell us that there’s a better way to do it.

No kidding. But with the most recent failure, I think @guy038’s FAQ has already given us the solution that we should have used, if the OP had not stated it as an XY-Problem.

Looking at the example data, I think the better problem statement would be "please replace all instances of WORD_TO_FIND between the start <section class="lyrics" and end </section> . With that set of rules, just use the Generic Regex Formula > Replace in a specific zone of text, with FR = \bWORD_TO_FIND\b, and my “start” and “end” a sentence ago are the BSR and ESR. (Though you might have to add your (?<!^)(?<!<p>)(?<!<p class="chorus">) restrictions to the FR)

Is the glitch in the regular expression itself, or in the regex engine?

The “glitch” is the expectation that one can safely edit HTML with regex (see FAQ). Since I’m sure you’ll insist on it anyway, then dealing with glitches is something you must expect, and that you must start putting effort into.

We’ve gone above-and-beyond in getting it working this well for you. At this point, it’s really time for you to start reading the same documentation that we’re reading, and try to figure it out on your own.

Coises

@Sylvester-Bullitt said in Find-in-FIles: Can’t Replace Multiple Instances of Word:

@Coises Is the glitch in the regular expression itself, or in the regex engine?

The expression. After testing, it appears that my variant, which avoids matching files that don’t contain lyrics-text, also fixes this problem. Applying a small simplification to the previous version (the \G was unnecessary) this:

(?s)(\A.*?(lyrics-text|\Z(*COMMIT)(*FAIL)).*?|)(?<!^)(?<!<p>)(?<!<p class="chorus">)\Ksavior(?=(.+?</div>))

matches nothing, as it should; this:

(?s)(\A.*?(lyrics-text|\Z(*COMMIT)(*FAIL)).*?|)(?<!^)(?<!<p>)(?<!<p class="chorus">)\Ksav\xADior(?=(.+?</div>))

matches the the single occurrence of the word (which is hyphenated using a “soft hyphen”) in the lyrics, on line 63.

Sylvester Bullitt

@Coises As Peter suggested a few minutes ago, we’re now trying the approach shown at https://community.notepad-plus-plus.org/topic/22690/generic-regex-replacing-in-a-specific-zone-of-text

Based in the example there, we’re testing this regex:

(?-si:<section class="lyrics">|(?!\A)\G)(?s-i:(?!</section>).)*?(?<!^)(?<!<p>)(?<!<p class="chorus">)\K(?-si:\bWORD_TO_FIND\b)

We also discovered a blemish in our previous version: The quote marks around the word “chorus” were typographical, and should have been standards typewriter-style quotes (").

We’ve been testing the regex above on live Web site files, and so far things are going well (fingers crossed as tightly as ever!).

guy038

Hello, @coises, @sylvester-bullitt, @peterjones and All,

@coises, you said in a previous post :

No regular expression thread is finished until @guy038 drops in to tell us that there’s a better way to do it.

Well, many thanks, @coises, for your kind words, but I, definitively, do not deserve this honor, because you and some other people could easily be included in this list !

I noticed that, given the large number of regex solutions, that most of us have been proposing for some time now, we’re getting fewer questions on this subject. To my mind, this means that the general level of N++ users, regarding the regex world, is increasing which is, globally, a good thing for a better N++ use, along with the other script solutions and their workflow !

I suppose, that the regex section, described in the @peterjones’s official documentation, did help some of us, too, from time to time !

BTW, I did not drop in this discussion, but the generic regex, suggested by @sylvester-bullitt, in its last post, seems to be the right solution

Best Regards,

guy038

Sylvester Bullitt

@guy038 First, let me thank you for the work you’ve done on helping develop generic regex solutions. And you’ right, the solution I mentioned yesterday, which I was testing, seemed to be very promising.

However, I woke up in the middle of last night, and realized that we may have overlooked a potential pitfall. As you may remember, my ultimate objective was to modify texts in song lyrics, and the generic regex on the Notepad++ site seems to be an ideal fit for my use case.

Though it seems to be working well so far, I’m wondering if we overlooked one thing: the the search term might be part of a hyperlink URL, and thus should not be changed. I’m running a hyperlink report on the Web site now to see if any of the links have been broken. I I don’t know the answer yet, but I should know within the next hour.

If the regex did indeed match/modify/break some URLs, I plan to ad a negative lookahead to exclude matches which precede .htm or an underscore. Hopefully that will be enough to prevent us from inadvertently changing links.

Have you run into the issue of breaking HTML links with a regex search-and-replace before?

Sylvester Bullitt

@Sylvester-Bullitt Got done generating broken link report.

THE GOOD NEWS: My regex didn’t break any links

THE BAD NEWS: I just go lucky. Some further testing revealed that my regex would have broken links, if I had had the bad luck to use a search term that also in a hyperlink URL.

So, I added lookaheads to ignore matches of underscores and .htm, and it seems to work. In case anyone’s interested, here’s the new-and-improved regex, with some comments added for clarity:

(?-si:<section class="lyrics">|(?!\A)\G)(?s-i:(?!</section>).)*?(?#Not at start of line or para)(?<!^)(?<!<p>)(?<!<p class="chorus">)\K(?-si:\bWORD_TO_FIND\bq(?#Not in hyperlink)(?!(\.htm))(?!_))