Community
    • Login

    Find - Replace

    Scheduled Pinned Locked Moved General Discussion
    13 Posts 4 Posters 11.7k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • PeterJonesP
      PeterJones @Kendall DeMott
      last edited by PeterJones

      @kendall-demott

      First, to solve your problem:

      I can replicate your problem by using the same options:
      e4863f39-31fe-435e-b29c-e984bc0d8387-image.png

      But if I turn off “match whole words only”, then it finds it easily:
      fd4b91e5-117c-471d-baaf-6bf753bf8192-image.png

      This is because "1000" is not the “whole word”; price="1000"/> is the “whole word”.

      the error at the bottom is showing double quotes?

      Because that error bar takes whatever is in the FIND box and puts it between quotes to display the text. If you had said Find What: gobbeldygook, the error message would say Find: Can't find the text "gobbeldygook", as shown here:
      57ca9bc1-6a03-45bc-a44d-9db4356db3ef-image.png

      To reiterate the main solution: the reason your search did not work is because you told it to match whole words only, but then were trying to match against text that wasn’t a “whole word”.

      ------
      see https://npp-user-manual.org/docs/searching/#find-replace-tabs

      9f8d5089-0ad3-4a7b-9e2e-fcc03264d1fb-image.png

      Alan KilbornA 1 Reply Last reply Reply Quote 1
      • Alan KilbornA
        Alan Kilborn @PeterJones
        last edited by

        @peterjones said in Find - Replace:

        price="1000"/> is the “whole word”.

        Can you elaborate on why this is?
        Aside from “it works”? :-)

        Reading the fine manual HERE doesn’t really shed light on it, for me.

        Note that I know how to use the option, and would never have used it like OP did, but it never hurts to know deeper meanings in things, so that maybe I can use a function better.

        PeterJonesP 1 Reply Last reply Reply Quote 0
        • PeterJonesP
          PeterJones @Alan Kilborn
          last edited by PeterJones

          @alan-kilborn ,

          I don’t have insight into how the non-regex “word” is defined in the code.

          However, at least in my brief experimentation, the “normal mode + match whole word only” seems to agree with “regex mode” and \b.*?\b.

          For example, because the spot between the = and the " will not match a word boundary \b, a “whole word only” match will not match if just the " is included, but it will if the match starts with =" or if it starts at the 1000.

          Maybe this will show it better: If you are searching the text price="1000"/>:

          looking for text regex version normal+whole word matches regex matches notes
          1000 \b1000\b YES YES the zero-width between "1 is a word boundary, as is 0"
          "1000 \b"1000\b NO NO the zero-width between =" is not a word boundary, so fails
          ="1000 \b="1000\b YES YES the zero-width between e= is a boundary
          ="1000" \b="1000"\b YES NO ERROR "/ is not a word boundary, so the regex fails, but the normal+whole somehow matches
          price="1000" \bprice="1000"\b NO NO including price before the = seems to change the normal+whole defintion of “whole word”… weird.

          Unfortunately, with experimentation, my theory broke down. I don’t know enough about the underlying details to explain exactly how it matches – someone with more insight into the source code would need to comment.

          But I think a good general rule is, “if it doesn’t also match regex=\bXXX\b, then normal+word=XXX probably won’t work, though there are subtle exceptions”. For normal+word, I would stick to words that are obviously word units, like the 1000 or price (with no spaces or punctuation), rather than trying to get normal+word to go across words or word boundaries. If you want to search across multiple words, or want mixed words and punctuation, normal+word will not always work as you expect.

          1 Reply Last reply Reply Quote 1
          • guy038G
            guy038
            last edited by guy038

            Hello, @kendall-demott, @peterjones, @alan-kilborn ans All,

            Well, I would say :

            • For an ANSI file :

              • If a string of word chars is immediately surrounded both, before and after, with one of the characters [\x00 - \x2F] , [\x3A - \x40] , [\x5B - \x5E] , \x60 or [ \x7B - \x7F], that string will match when the Match whole word only option is ticked

              • In other words, if a string is immediately surrounded by, at least, one word char, in the strict range [0-9A-Z_a-z] or any char in range [\x80-\xFF], that string will not match when the Match whole word only option is ticked

            • For a NON-ANSI file ( so any encoding different from ANSI ) :

              • If a string of word chars is immediately surrounded both, before and after, with a Unicode non-word character, recognized by Notepad++, that string will match when the Match whole word only option is ticked

              • In other words, if a string is immediately surrounded by, at least, one Unicode word char, recognized by Notepad++, that string will not match when the Match whole word only option is ticked


            Now, regarding the regex \b zero-width assertion, it represents, either :

            • The position between the very beginning of current file and a word character

            • The position between a non-word character and a word character

            • The position between a word character and a non-word character

            • The position between a word character and the very end of current file

            Note also that the \n and/or \r line-endings chars are always considered as non-word chars

            Best Regards,

            guy038

            PeterJonesP 1 Reply Last reply Reply Quote 1
            • Alan KilbornA
              Alan Kilborn
              last edited by Alan Kilborn

              More on the subject from @guy038 in this old post: https://community.notepad-plus-plus.org/post/20424

              Peter, could the user manual be better in this regard?

              1 Reply Last reply Reply Quote 0
              • PeterJonesP
                PeterJones @guy038
                last edited by

                @guy038 said in Find - Replace:

                If the string to search for is, itself, surrounded with non-word characters, that string will match when the Match whole word only option is ticked ONLY IF surrounded with the \n or \r chars

                That’s not accurate.

                If the document is

                <a price="1000"/> x
                <a price="1000"/>x
                

                then FIND = ="1000"/> will match both those lines, even though it’s got an e to the left and either a space or an x to the right.

                —

                Also, I originally said that ="1000" matched normal+whole word in the document price="1000"/>, but it does not… so apparently my test was wrong yesterday. And with NORMAL=="1000" and REGEX=\b="1000"\b actually agreeing that it doesn’t match, I am back to thinking that for a “normal+whole word” FIND=☒☒☒, it is equivalent to a regex FIND=\b☒☒☒\b (or, I should say \b\Q☒☒☒\E\b, because ☒ might be a regex special character, so it needs to be escaped in the regex-equivalent). I haven’t been able to find an exception to this. If anyone can show me different, let me know.

                1 Reply Last reply Reply Quote 0
                • Kendall DeMottK
                  Kendall DeMott
                  last edited by

                  Peter, Thank You, unticking that box solved my issue.

                  1 Reply Last reply Reply Quote 1
                  • guy038G
                    guy038
                    last edited by guy038

                    Hi, @kendall-demott, @peterjones, @alan-kilborn and All,

                    I said, in my previous post ( from now on deleted ) :

                    • If the string to search for is, itself, surrounded with non-word characters, that string will match when the Match whole word only option is ticked ONLY IF surrounded with the \n or \r chars

                    Actually, I really misspoke ! I wanted to mean :

                    • Any string, containing word and/or non-word characters, at any location, will match, when the Match whole word only option is ticked, IF this string is surrounded with nothing, a \n char or a \r char

                    Now, Peter, you said in your last post :

                    I am back to thinking that for a “normal+whole word” FIND=☒☒☒, it is equivalent to a regex FIND=\b☒☒☒\b …

                    So I created a file, containing all Unicode characters of the BMP, only ( so 63,454 characters with code-point < U+FFFF ), in the form below :

                    NULabcd¤
                    SOHabcd¤
                    ...
                    ...
                    ...
                    abcd¤
                    �abcd¤
                    

                    And it happens that :

                    • The search of the string abcd, in Normal mode, with the Match whole word only option ticked, returns 12,561 matches

                    • The search of the regex string \babcd\b in Regular expression mode, returns 15,424 matches

                    So, obviously, these two kinds of searches are not equivalent at all !


                    For instance, let’s insert the string ¼abcd¤ in a new tab, whatever its encoding

                    First note that, either, the ¼ and the ¤ characters are non-word characters. To be convinced, just look for \w in Regular expression mode. The four letters are matched, only

                    • However, the search of abcd, in Normal search mode, with the Match whole word only ticked, gives : NO match

                    • Luckily, the search of \babcd\b, in Regular expression search mode, does give the correct answer : MATCH


                    Unfortunately, the general template \bString of Word chars\b is not exact, too, in numerous cases :

                    Let’s consider, for instance :

                    • The Ԩ Unicode character. It’s the CYRILLIC CAPITAL LETTER EN WITH LEFT HOOK with code-point U+0528

                    • The ᏹ Unicode character. It’s the CHEROKEE SMALL LETTER YI, with code-point U+13F9

                    • The ⴭ Unicode character. It’s the GEORGIAN SMALL LETTER AEN, with code-point U+2D2D

                    Despite all these chars are seen as true letters by the Unicode Consortium, they are not considered, yet, as word chars by our N++ regex engine :((. Thus, the search of \babcd\b, in Regular expression mode, will wrongly match the string abcd in the examples below :

                    Ԩabcd¤
                    ᏹabcd¤
                    ⴭabcd¤
                    

                    Conclusion :

                    Although the search of a whole word with the regex \b....\b seems more accurate and will give correct results with usual chars, it may fail with a lot of non-usual Unicode chars !

                    Best Regards,

                    guy038

                    P.S. :

                    Note that the use of the regex assertion \b may give correct but rather surprising results ! For instance, the regex \b\Q^!:/@?$\E\b matches the part ^!:/@?$, of the string A^!:/@?$Z, because the \b assertion may be the location between a word char and a non-word char ! So, definitively, the use of the \b assertion, in regexes and the option Match whole word only, in Normal mode, are not equivalent !

                    PeterJonesP 1 Reply Last reply Reply Quote 0
                    • PeterJonesP
                      PeterJones @guy038
                      last edited by

                      @guy038 ,

                      Thanks for the experiment. Basically, it boils down to “Unicode complicates things for whole word only”. ;-)

                      The phrasing I am considering for the user manual:

                      • For ASCII text

                        • if the left and right characters of your search string are both “word characters” (letters, numbers, underscore, and optionally additional characters set by your preferences), then “match whole word only” will only allow a match if the characters to the left and right of the match are non-word-characters or spaces or the beginning or ending of the line
                        • if the left and right characters of your search string are both non-word characters (so not letters, numbers, underscore, and optionally additional characters set by your preferences)
                        • if the left of your search string is a word character and the right is not (or vice versa), then the characters to the left and right must be of the opposite type, or spaces, or beginning/ending of line.
                      • For non-ASCII text, the general concepts are the same; however, some edge cases may behave differently than you expect, and with thousands of possible Unicode characters and millions of combinations of pairs of Unicode characters, this manual cannot contain a full description.

                      • Either way, if you want full control of what counts as a “word” or a “word boundary”, use Search Mode = Regular Expression instead of Normal with Match Whole Word Only, which allows you full and precise control of what is allowed before and after what you consider a “whole word”.

                      And yes, I did verify that Settings > Preferences > Delimiter > add your character as part of a word does affect whether Match whole word only matches.

                      PeterJonesP 1 Reply Last reply Reply Quote 2
                      • PeterJonesP
                        PeterJones @PeterJones
                        last edited by

                        The phrasing I am considering for the user manual:

                        issue #349 => PR #350

                        It should be in the next release of the user manual

                        1 Reply Last reply Reply Quote 2
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors