Search accented and non-accented characters alike with one simple setting?
-
Hi there,
Windows has finally decided for whatever reason in Notepad’s recent update to disable editing .m3u8 files with it, and so for the first time in my life and with a great delay, I’ve installed Notepad++.
How can I conveniently search for a non-accented character and have Notepad++ show results of the accented characters as well? (For example, if I search for ‘premier’, it would locate within the file ‘Premiére’ as well, and vice versa, as it does in Windows Notepad?
I have looked it up and got there: https://community.notepad-plus-plus.org/topic/22938/search-for-accented-words/6 however I humbly failed to understand as I am not very techy; moreover I find it cumbersome to enter that long line noted in the reply to the above thread every time I want to search in a file for all phrases, accented or not.
Thank you all
-
@B said in Search accented and non-accented characters alike with one simple setting?:
I humbly failed to understand as I am not very techy
Anyplace in your FIND string that you wanted to match a normal or accented character, put the normal character inside the weird sequence, like searching for
oor any accentedousing[[=o=]]… It’s cumbersome, but unless someone figures out how to implement the official feature request (which no one has taken a stab at in 4 years, so not very likely), it’s your only choice for now.You will also need to choose Search Mode =
Regular Expressionfor those to work.For example, to search for all variants of
premier, you would usepr[[=e=]]m[[=i=]][[=e=]]r. If your m3u8 files have other characters with accents or equivalent (likeñ), then it might just be easiest to use[[=p=]][[=r=]][[=e=]][[=m=]][[=i=]][[=e=]][[=r=]]– just copy=]][[=and paste it between every letter in your search, then do the starting and ending half-sequences around it. Cumbersome, yes. But it works.(The library that Notepad++ uses for the searching doesn’t have a “accent insensitive search” option, so Notepad++ would have to code its own wrapper, which isn’t likely to work intuitively, nor be efficient; as such, I am doubtful that feature would ever be implemented if the library (a completely separate project, which Notepad++ does not control) doesn’t; and Notepad++ cannot force the library to add that feature.)
That said, @Coises has a Columns++ plugin for Notepad++, where he’s working diligently on making his search all-powerful (trying to get better / more-complete Unicode support, etc). And he’s very open to feature requests. I am betting that me posting this paragraph will get him thinking about how he might implement an accent/diacritic-insensitive mode option in his plugin’s search (this recommends KA/KD mode, though I don’t know enough about such things, so I’d leave it in his capable hands). If he does, he’ll chime in here once it’s ready for at least testing. (Or maybe, if you use his Unicode Normalize plugin, it would make the search easier to do with N++ existing regex; again, I don’t know, and he could chime in.)
edit: I fixed
==to=throughout -
@PeterJones said in Search accented and non-accented characters alike with one simple setting?:
I am betting that me posting this paragraph will get him thinking about how he might implement an accent/diacritic-insensitive mode
[…]
(Or maybe, if you use his Unicode Normalize plugin, it would make the search easier to do with N++ existing regex; again, I don’t know, and he could chime in.)The problematic part of this is that Unicode characters aren’t always a single code point (“decomposed” vs “pre-composed” characters). Making something that would usually work probably wouldn’t be too hard, but making it reliable in less common cases would be tricky. If you normalize the contents of the file, you’ve changed it; and if you keep the normalization in a separate string, there’s no straightforward way to know which characters in the original string correspond to the ones you’ve found in the normalized string.
The following will probably only make sense to people who know C++:
The practical solution, I think, would be to create another set of iterators, like Notepad++ already has for ANSI and UTF-8. (Columns++ has three, one for single byte character sets, one for double byte character sets, and one for UTF-8.) Instead of dereferencing to the actual code points represented in the file, these iterators would dereference to compatibility compositions or decompositions. Then the regex engine would be comparing to the compatibility code point(s) instead of the ones in the file, but by using an iterator it would still “know where it is” in the file and wouldn’t be changing the file itself.
It looks as if ICU contains iterators for normalization. It might be possible to use them or to learn from them.
-
Hello, @b, @peterjones, @coises and All,
Peter, in your last post you said about @coises and his plugin :
I am betting that me posting this paragraph will get him thinking about how he might implement an
accent/diacritic-insensitivemode option in his plugin’s search.I agree with you and to my mind, it would not been very difficult to implement !
The main rule would be to change any single character
Cof the search field, different from\rand\n, with the[[=C=]]syntax, whatever the status of current char : letter, digit, symbol or blank. Refer my post https://community.notepad-plus-plus.org/post/104378Perhaps it would be wise to define an upper limit which would allow this new option : ONLY IF the search field contains just a single line and shorter than, let’s say,
100or200characters ?!Indeed, if we insert a
LFcharacter in the middle of the[[==]]syntax, theColumns++plugin does detect aLFcharacter or the combinationCRLF. However, the insertion of theCRcharacter is invalid within the Columns++ pluginFor example, the INPUT texts :
first word cömparÅciõnand
2th word Compâraciønwould produce the OUTPUT texts :
[[=f=]][[=i=]][[=r=]][[=s=]][[=t=]][[= =]][[=w=]][[=o=]][[=r=]][[=d=]][[= =]][[= =]][[= =]][[= =]][[=c=]][[=ö=]][[=m=]][[=p=]][[=a=]][[=r=]][[=Å=]][[=c=]][[=i=]][[=õ=]][[=n=]]and
[[=2=]][[=t=]][[=h=]][[= =]][[=w=]][[=o=]][[=r=]][[=d=]][[= =]][[=C=]][[=o=]][[=m=]][[=p=]][[=â=]][[=r=]][[=a=]][[=c=]][[=i=]][[=ø=]][[=n=]]You can verify that, either with the
BoostN++ search and with theColumns++plugin, the two OUTPUT regexes above do match the two INPUT texts !
Remark :
Some equivalence classes don’t return the right characters when using our
Boostregex engine ! For instance :- The
Columns++plugin with the regex[[=Œ=]]or the regex[[=œ=]]correctly detects the two charactersŒandœ
Whereas :
-
The N++
Boostsearch with the regex[[=Œ=]]matches any letterrorR!? -
The N++
Boostsearch with the regex[[=œ=]]matches any lettersorS!?
Best Regards,
guy038
P.S.
Ah…, I’ve just seen the @coises’s reply . Thus, my solution seems a bit too simple …
P.P.S. :
I’ve just verified that the
Microsoft Edgesearch does implement anaccent/diacritic-insensitivemode, by default, as well ! - The
-
@guy038 said in Search accented and non-accented characters alike with one simple setting?:
the main rule would be to change any single character
Cof the search fieldModifying the search, rather than the text it’s searching, does seem the “simpler” choice.
The restrictions that I think would have to be necessary to make it “simpler” rather than the near-impossible task that @Coises hinted at:
- it would have to be similar to the “regular” search mode, in that it wouldn’t be able to have any other regex characters (otherwise, he’d have to do so much parsing for escape sequences, et al, that it would not be worth the effort)
- given what @Coises said, it might not want to mess around with normalization (despite what I originally said)
thus, in my current imagining:
- the FIND input would be plain text
- internally, it would change it to be a lot of
[[=☐=]]-style terms, and passed to the regex engine for doing the search - because the document text is unchanged, the results positions would be consistent with the document, to make highlighting results (or doing replacements) easy
- it might not handle all cases, but it’d definitely handle the single-codepoint accentented characters
- even if it doesn’t handle all the cases with combing-accent characters – my test with
[[=a=]]shows that it just matches theawhen my doc isafollowed by U+0301 Combining Acuteá, but obviously matches the U+00E1ásingle-character - but my guess is that most of the people who have been asking for accent-insensitive searching are just using simple single-character accented characters, rather than the combing versions. but that is just a guess. And once @coises added the simple version, I am sure he would be innundated with requests to make it handle the combining, and might not like that.
- even if it doesn’t handle all the cases with combing-accent characters – my test with
I’ve just verified that the Microsoft Edge search does implement an accent/diacritic-insensitive mode, by default, as well !
So does Chrome. (And it seems to be matching the full combo
á, not just the a before the combining accent) -
@PeterJones said in Search accented and non-accented characters alike with one simple setting?:
the near-impossible task that @Coises hinted at
In a C++ plugin, not near-impossible, just tedious. In anything other than C++, maybe near-impossible.
- it might not handle all cases, but it’d definitely handle the single-codepoint accentented characters
- even if it doesn’t handle all the cases with combing-accent characters – my test with
[[=a=]]shows that it just matches theawhen my doc isafollowed by U+0301 Combining Acuteá, but obviously matches the U+00E1ásingle-character - but my guess is that most of the people who have been asking for accent-insensitive searching are just using simple single-character accented characters, rather than the combing versions. but that is just a guess.
- even if it doesn’t handle all the cases with combing-accent characters – my test with
It could be that
(?=[[=a=]])\Xwould catch most if not all of the combining cases and not add false positives. Matching the full character is important because you’d want to string characters together, and the intervening combining marks would make the match fail.And once @coises added the simple version, I am sure he would be innundated with requests to make it handle the combining, and might not like that.
If I get into this, I will almost certainly go the iterator route. The modify-the-search-string route is plausible, though, for someone who might want to tackle this in Python Script, or probably anything other than a C++ plugin calling Boost::regex directly.
- it might not handle all cases, but it’d definitely handle the single-codepoint accentented characters