Search accented and non-accented characters alike with one simple setting?

B

Hi there,

Windows has finally decided for whatever reason in Notepad’s recent update to disable editing .m3u8 files with it, and so for the first time in my life and with a great delay, I’ve installed Notepad++.

How can I conveniently search for a non-accented character and have Notepad++ show results of the accented characters as well? (For example, if I search for ‘premier’, it would locate within the file ‘Premiére’ as well, and vice versa, as it does in Windows Notepad?

I have looked it up and got there: https://community.notepad-plus-plus.org/topic/22938/search-for-accented-words/6 however I humbly failed to understand as I am not very techy; moreover I find it cumbersome to enter that long line noted in the reply to the above thread every time I want to search in a file for all phrases, accented or not.

Thank you all

PeterJones

@B said in Search accented and non-accented characters alike with one simple setting?:

I humbly failed to understand as I am not very techy

Anyplace in your FIND string that you wanted to match a normal or accented character, put the normal character inside the weird sequence, like searching for o or any accented o using [[=o=]] … It’s cumbersome, but unless someone figures out how to implement the official feature request (which no one has taken a stab at in 4 years, so not very likely), it’s your only choice for now.

You will also need to choose Search Mode = Regular Expression for those to work.

For example, to search for all variants of premier, you would use pr[[=e=]]m[[=i=]][[=e=]]r . If your m3u8 files have other characters with accents or equivalent (like ñ), then it might just be easiest to use [[=p=]][[=r=]][[=e=]][[=m=]][[=i=]][[=e=]][[=r=]] – just copy =]][[= and paste it between every letter in your search, then do the starting and ending half-sequences around it. Cumbersome, yes. But it works.

(The library that Notepad++ uses for the searching doesn’t have a “accent insensitive search” option, so Notepad++ would have to code its own wrapper, which isn’t likely to work intuitively, nor be efficient; as such, I am doubtful that feature would ever be implemented if the library (a completely separate project, which Notepad++ does not control) doesn’t; and Notepad++ cannot force the library to add that feature.)

That said, @Coises has a Columns++ plugin for Notepad++, where he’s working diligently on making his search all-powerful (trying to get better / more-complete Unicode support, etc). And he’s very open to feature requests. I am betting that me posting this paragraph will get him thinking about how he might implement an accent/diacritic-insensitive mode option in his plugin’s search (this recommends KA/KD mode, though I don’t know enough about such things, so I’d leave it in his capable hands). If he does, he’ll chime in here once it’s ready for at least testing. (Or maybe, if you use his Unicode Normalize plugin, it would make the search easier to do with N++ existing regex; again, I don’t know, and he could chime in.)

edit: I fixed == to = throughout

Coises

@PeterJones said in Search accented and non-accented characters alike with one simple setting?:

I am betting that me posting this paragraph will get him thinking about how he might implement an accent/diacritic-insensitive mode

(Or maybe, if you use his Unicode Normalize plugin, it would make the search easier to do with N++ existing regex; again, I don’t know, and he could chime in.)

The problematic part of this is that Unicode characters aren’t always a single code point (“decomposed” vs “pre-composed” characters). Making something that would usually work probably wouldn’t be too hard, but making it reliable in less common cases would be tricky. If you normalize the contents of the file, you’ve changed it; and if you keep the normalization in a separate string, there’s no straightforward way to know which characters in the original string correspond to the ones you’ve found in the normalized string.

The following will probably only make sense to people who know C++:

The practical solution, I think, would be to create another set of iterators, like Notepad++ already has for ANSI and UTF-8. (Columns++ has three, one for single byte character sets, one for double byte character sets, and one for UTF-8.) Instead of dereferencing to the actual code points represented in the file, these iterators would dereference to compatibility compositions or decompositions. Then the regex engine would be comparing to the compatibility code point(s) instead of the ones in the file, but by using an iterator it would still “know where it is” in the file and wouldn’t be changing the file itself.

It looks as if ICU contains iterators for normalization. It might be possible to use them or to learn from them.

guy038

Hello, @b, @peterjones, @coises and All,

Peter, in your last post you said about @coises and his plugin :

I am betting that me posting this paragraph will get him thinking about how he might implement an accent/diacritic-insensitive mode option in his plugin’s search.

I agree with you and to my mind, it would not been very difficult to implement !

The main rule would be to change any single character C of the search field, different from \r and \n, with the [[=C=]] syntax, whatever the status of current char : letter, digit, symbol or blank. Refer my post https://community.notepad-plus-plus.org/post/104378

Perhaps it would be wise to define an upper limit which would allow this new option : ONLY IF the search field contains just a single line and shorter than, let’s say, 100 or 200 characters ?!

Indeed, if we insert a LF character in the middle of the [[==]] syntax, the Columns++ plugin does detect a LF character or the combination CRLF. However, the insertion of the CR character is invalid within the Columns++ plugin

For example, the INPUT texts :

first	word    cömparÅciõn

and

2th word Compâraciøn

would produce the OUTPUT texts :

[[=f=]][[=i=]][[=r=]][[=s=]][[=t=]][[=	=]][[=w=]][[=o=]][[=r=]][[=d=]][[= =]][[= =]][[= =]][[= =]][[=c=]][[=ö=]][[=m=]][[=p=]][[=a=]][[=r=]][[=Å=]][[=c=]][[=i=]][[=õ=]][[=n=]]

and

[[=2=]][[=t=]][[=h=]][[= =]][[=w=]][[=o=]][[=r=]][[=d=]][[= =]][[=C=]][[=o=]][[=m=]][[=p=]][[=â=]][[=r=]][[=a=]][[=c=]][[=i=]][[=ø=]][[=n=]]

You can verify that, either with the Boost N++ search and with the Columns++ plugin, the two OUTPUT regexes above do match the two INPUT texts !

Remark :

Some equivalence classes don’t return the right characters when using our Boost regex engine ! For instance :

The Columns++ plugin with the regex [[=Œ=]] or the regex [[=œ=]] correctly detects the two characters Œ and œ

Whereas :

The N++ Boost search with the regex [[=Œ=]] matches any letter r or R !?
The N++ Boost search with the regex [[=œ=]] matches any letter s or S !?

Best Regards,

guy038

P.S.

Ah…, I’ve just seen the @coises’s reply . Thus, my solution seems a bit too simple …

P.P.S. :

I’ve just verified that the Microsoft Edge search does implement an accent/diacritic-insensitive mode, by default, as well !

PeterJones

@guy038 said in Search accented and non-accented characters alike with one simple setting?:

the main rule would be to change any single character C of the search field

Modifying the search, rather than the text it’s searching, does seem the “simpler” choice.

The restrictions that I think would have to be necessary to make it “simpler” rather than the near-impossible task that @Coises hinted at:

it would have to be similar to the “regular” search mode, in that it wouldn’t be able to have any other regex characters (otherwise, he’d have to do so much parsing for escape sequences, et al, that it would not be worth the effort)
given what @Coises said, it might not want to mess around with normalization (despite what I originally said)

thus, in my current imagining:

the FIND input would be plain text
internally, it would change it to be a lot of [[=☐=]]-style terms, and passed to the regex engine for doing the search
because the document text is unchanged, the results positions would be consistent with the document, to make highlighting results (or doing replacements) easy
it might not handle all cases, but it’d definitely handle the single-codepoint accentented characters
- even if it doesn’t handle all the cases with combing-accent characters – my test with [[=a=]] shows that it just matches the a when my doc is a followed by U+0301 Combining Acute á, but obviously matches the U+00E1 á single-character
- but my guess is that most of the people who have been asking for accent-insensitive searching are just using simple single-character accented characters, rather than the combing versions. but that is just a guess. And once @coises added the simple version, I am sure he would be innundated with requests to make it handle the combining, and might not like that.

I’ve just verified that the Microsoft Edge search does implement an accent/diacritic-insensitive mode, by default, as well !

So does Chrome. (And it seems to be matching the full combo á, not just the a before the combining accent)

Coises

@PeterJones said in Search accented and non-accented characters alike with one simple setting?:

the near-impossible task that @Coises hinted at

In a C++ plugin, not near-impossible, just tedious. In anything other than C++, maybe near-impossible.

it might not handle all cases, but it’d definitely handle the single-codepoint accentented characters

even if it doesn’t handle all the cases with combing-accent characters – my test with [[=a=]] shows that it just matches the a when my doc is a followed by U+0301 Combining Acute á, but obviously matches the U+00E1 á single-character

but my guess is that most of the people who have been asking for accent-insensitive searching are just using simple single-character accented characters, rather than the combing versions. but that is just a guess.

It could be that (?=[[=a=]])\X would catch most if not all of the combining cases and not add false positives. Matching the full character is important because you’d want to string characters together, and the intervening combining marks would make the match fail.

And once @coises added the simple version, I am sure he would be innundated with requests to make it handle the combining, and might not like that.

If I get into this, I will almost certainly go the iterator route. The modify-the-search-string route is plausible, though, for someone who might want to tackle this in Python Script, or probably anything other than a C++ plugin calling Boost::regex directly.

guy038

Hi, @b, @peterjones, @coises and All,

Ah… very clever, @coises, your new finding (?=[[=C=]]\X !!

I’m not sure if Peter did fully understand my post ! It was obvious to me that the initial change between the simple syntax C and [[=C=]] would be totally handled by the plugin !

Now, this management would be to transform, internally, the simple syntax C into the (?=[[=C=]])\X regex syntax by the Columns++ plugin !

So, from the two INPUT texts, below :

first	word    cömparÅciõn

Where the accentuated letter ö is the union of the two chars o and ̈ + the accentuated letters Å and õ

and :

2th word Compâraciøn

Where the accentuated letter â is the union of the two chars a and ̂ + the accentuated letter ø

would produce the two OUTPUT texts :

(?=[[=f=]])\X(?=[[=i=]])\X(?=[[=r=]])\X(?=[[=s=]])\X(?=[[=t=]])\X(?=[[=	=]])\X(?=[[=w=]])\X(?=[[=o=]])\X(?=[[=r=]])\X(?=[[=d=]])\X(?=[[= =]])\X(?=[[= =]])\X(?=[[= =]])\X(?=[[= =]])\X(?=[[=c=]])\X(?=[[=o=]])\X(?=[[=m=]])\X(?=[[=p=]])\X(?=[[=a=]])\X(?=[[=r=]])\X(?=[[=Å=]])\X(?=[[=c=]])\X(?=[[=i=]])\X(?=[[=õ=]])\X(?=[[=n=]])\X

and

(?=[[=2=]])\X(?=[[=t=]])\X(?=[[=h=]])\X(?=[[= =]])\X(?=[[=w=]])\X(?=[[=o=]])\X(?=[[=r=]])\X(?=[[=d=]])\X(?=[[= =]])\X(?=[[=C=]])\X(?=[[=o=]])\X(?=[[=m=]])\X(?=[[=p=]])\X(?=[[=a=]])\X(?=[[=r=]])\X(?=[[=a=]])\X(?=[[=c=]])\X(?=[[=i=]])\X(?=[[=ø=]])\X(?=[[=n=]])\X

You can easily verify that each of the two regexes right above do match each of the two INPUT texts !

@coises, for information, I used the following search-replacement to transform the normal text into the new regex syntax :

SEARCH (?=\X)(.)|(?!\X).
REPLACE (?1$\?=[[=$1=]]$\\X)

And it means that :

If the current character found is the main character of the current \X sequence, it must be replaced with the string (?=[[=$1=]])\X, where $1 represents the current char !
If the current character found is an subsequent character of the current \X sequence, it must just be deleted

So, @coises, you could simply realize this S/R, internally ! But, of course, I’m sure you’ll find out a more elegant way !

Best Regards,

guy038

guy038

Hi, @b, @peterjones, @coises and All,

Let’s consider the Spanish word comparación and its various forms below, pasted in a new tab :

comparación
cömparÅciõn
ＣömꝕarÅcⓘoñ
ȻOℳₚA⒭ⱯℭᴵꝊᵰ
ƆºᶆⱷǞℝȺꜿᵼᴓɲ
ɕƢꟽⓟǻꝝⱭƆⅈᴗN

Note that I have deliberately exaggerated the shapes and modified the case of the possible characters !!

If you do any of the six N++ searches, below, against the six lines in the new tab, you’ll see that ONLY the two first regexes, which refer to standard equivalence clases, can find all the lines.

SEARCH (?=[[=c=]])\X(?=[[=o=]])\X(?=[[=m=]])\X(?=[[=p=]])\X(?=[[=a=]])\X(?=[[=r=]])\X(?=[[=a=]])\X(?=[[=c=]])\X(?=[[=i=]])\X(?=[[=ó=]])\X(?=[[=n=]])\X
SEARCH (?=[[=c=]])\X(?=[[=o=]])\X(?=[[=m=]])\X(?=[[=p=]])\X(?=[[=a=]])\X(?=[[=r=]])\X(?=[[=Å=]])\X(?=[[=c=]])\X(?=[[=i=]])\X(?=[[=õ=]])\X(?=[[=n=]])\X
SEARCH (?=[[=Ｃ=]])\X(?=[[=ö=]])\X(?=[[=m=]])\X(?=[[=ꝕ=]])\X(?=[[=a=]])\X(?=[[=r=]])\X(?=[[=A=]])\X(?=[[=c=]])\X(?=[[=ⓘ=]])\X(?=[[=o=]])\X(?=[[=ñ=]])\X
SEARCH (?=[[=Ȼ=]])\X(?=[[=O=]])\X(?=[[=ℳ=]])\X(?=[[=ₚ=]])\X(?=[[=A=]])\X(?=[[=⒭=]])\X(?=[[=Ɐ=]])\X(?=[[=ℭ=]])\X(?=[[=ᴵ=]])\X(?=[[=Ꝋ=]])\X(?=[[=ᵰ=]])\X
SEARCH (?=[[=Ɔ=]])\X(?=[[=º=]])\X(?=[[=ᶆ=]])\X(?=[[=ⱷ=]])\X(?=[[=Ǟ=]])\X(?=[[=ℝ=]])\X(?=[[=Ⱥ=]])\X(?=[[=ꜿ=]])\X(?=[[=ᵼ=]])\X(?=[[=ᴓ=]])\X(?=[[=ɲ=]])\X
SEARCH (?=[[=ɕ=]])\X(?=[[=Ƣ=]])\X(?=[[=ꟽ=]])\X(?=[[=ⓟ=]])\X(?=[[=ǻ=]])\X(?=[[=ꝝ=]])\X(?=[[=Ɑ=]])\X(?=[[=Ɔ=]])\X(?=[[=ⅈ=]])\X(?=[[=ᴗ=]])\X(?=[[=N=]])\X

In constrast to the last release of the Columns++ plugin, independant of Scintilla and doing true UTF-32 searches, which is able to match the six lines of our example, whatever the regex used among the six regexes above !

This example clearly demonstrates that it’s possible to perform a search without considering case and/or accentuated characters with the v1.3.1 release of Columns++ plugin ;-))

Best Regards,

guy038