Community
    • Login

    Using sets to find A-Za-z plus the # and - chars ..?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    23 Posts 6 Posters 1.4k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • IanSunlunI
      IanSunlun
      last edited by

      I’m trying to find and replace some URLs.

      This is an example of what URL links look like:

      http://mysitename.net/index.php/pagename#bookmark
      http://mysitename.net/index.php/pagename-hypen

      I need to replace these with, for example:
      http://mysitename.net/index.php/pagename - mysitename.mhtml#bookmark
      (So I need to store pagename in ${1} and bookmark in ${2}.)

      You can see I can’t just search for (\w*) because of the - and # and probably % literal chars that may appear.

      I looked at sets. ([A-Za-z#-%]) but that didn’t seem to work. And I tried (\w*-*#*) and that didn’t work either. Any ideas on what would work for me?

      PeterJonesP José Luis Montero CastellanosJ 2 Replies Last reply Reply Quote 0
      • PeterJonesP
        PeterJones @IanSunlun
        last edited by PeterJones

        @IanSunlun ,

        As is documented, - has special meaning in regex character sets. If you want it to be treated as a literal in a character set, it needs to be either the first or last character in the set.

        Compare yours:
        06cd8970-c9c3-4f48-a86b-6d11092cdb3c-image.png
        to this [A-Za-z#%-]:
        55485809-73fa-4c09-a69a-02d767886454-image.png

        or, going back to yours, with the $ in the text file:
        81c093b8-7447-435b-885c-29f46a07496a-image.png
        vs
        a1a807f1-8615-4134-8149-0155e8995a8e-image.png

        … the [#-%] portion of the character set says “characters # through %”, which includes the $ between those, so [#-%] will match # or $ or %. Whereas [#%-] says “match # or % or the literal -”

        PeterJonesP 1 Reply Last reply Reply Quote 1
        • PeterJonesP
          PeterJones @PeterJones
          last edited by

          @PeterJones said in Using sets to find A-Za-z plus the # and - chars ..?:

          As is documented

          Actually, it’s not documented in our character classes section. I will remedy that.

          IanSunlunI 1 Reply Last reply Reply Quote 2
          • IanSunlunI
            IanSunlun @PeterJones
            last edited by IanSunlun

            @PeterJones
            My search term is not finding the URL in my html page.
            e1171fc4-f61d-46e5-b069-6cc0b0be6192-image.png

            html page (its not finding this, but it should):
            http://mysitename.net/index.php/New_Video#column-one"

            PeterJonesP 1 Reply Last reply Reply Quote 0
            • PeterJonesP
              PeterJones @IanSunlun
              last edited by PeterJones

              @IanSunlun said in Using sets to find A-Za-z plus the # and - chars ..?:

              http://mysitename.net/index.php/New_Video#column-one"

              Um, no it shouldn’t. New_Video#column-one is more than one character. [A-Za-z%#_-] only matches one character.

              I think what you want is http://mysitename.net/index.php/[A-Za-z%#_-]+" , which wants one or more charaters from that set.

              Also, I hope you don’t have a URL like http://mysitename.net/index.php/one1#column2

              Or http://school.edu/~username/o.n.e.#2 , which is something I might have had back in my university homepage days, lo those two-and-a-half decades ago.

              Maybe use http://mysitename.net/index.php/[\w%#.~-]+", since \w encompases the [A-Za-z0-9_] portion, and it adds in the URL-safe characters of . and ~, as well as the # separator and %-encoding-start.

              IanSunlunI 1 Reply Last reply Reply Quote 2
              • José Luis Montero CastellanosJ
                José Luis Montero Castellanos @IanSunlun
                last edited by José Luis Montero Castellanos

                @IanSunlun
                Hello :) Try this in Npp: (Just to easily verify that it matches)

                Find: [.#\-%]
                

                Inside a character class [set]:

                The character # is literal
                The character % is literal
                The . It is literal (remember that outside equals any character.)
                \- The only one that needs an escape sequence using \ .

                So:
                [A-Za-z#\-%.]
                The second hyphen is inside in an escape sequence (preceded by \ ).

                Another character that needs escape is ^ because of its negation meaning within the brackets [\^].

                PeterJonesP 1 Reply Last reply Reply Quote 1
                • IanSunlunI
                  IanSunlun @PeterJones
                  last edited by

                  @PeterJones Ah, thats seems to work thanks.
                  Does [\w%#.~-]+ put whatever it matches into ${1} ?

                  PeterJonesP 2 Replies Last reply Reply Quote 0
                  • PeterJonesP
                    PeterJones @IanSunlun
                    last edited by PeterJones

                    This post is deleted!
                    1 Reply Last reply Reply Quote 0
                    • PeterJonesP
                      PeterJones @José Luis Montero Castellanos
                      last edited by

                      This post is deleted!
                      1 Reply Last reply Reply Quote 0
                      • PeterJonesP
                        PeterJones @IanSunlun
                        last edited by PeterJones

                        @IanSunlun said in Using sets to find A-Za-z plus the # and - chars ..?:

                        Does [\w%#.~-]+ put whatever it matches into ${1} ?

                        Sorry, when I answered, I had forgotten that you previously said,

                        (So I need to store pagename in ${1} and bookmark in ${2}.)

                        Putting the # into either match is not what you want, either. You really need two groups, one before the # and one after.

                        FIND = http://mysitename.net/index.php/([\w%.~-]+)#([\w%.~-]+)"
                        will only match if there is a bookmark, and the # will not be inside the ${2} group. If you want the # to be included in ${2}, use http://mysitename.net/index.php/([\w%.~-]+)(#[\w%.~-]+)"

                        IanSunlunI 1 Reply Last reply Reply Quote 2
                        • IanSunlunI
                          IanSunlun @PeterJones
                          last edited by IanSunlun

                          @PeterJones said in Using sets to find A-Za-z plus the # and - chars ..?:

                          FIND = http://mysitename.net/index.php/([\w%.~-]+)#([\w%.~-]+)"

                          With the period . inbetween the % and the ~ it did not find:
                          http://mysitename.net/index.php/New_Video#column-one"
                          But taking the period out, it did find it.
                          Whats the thinking behind the period in this context ?

                          PeterJonesP 1 Reply Last reply Reply Quote 0
                          • PeterJonesP
                            PeterJones @IanSunlun
                            last edited by PeterJones

                            @IanSunlun ,

                            Except for -, order doesn’t matter inside the [] character class. The period is there because New.Video#column-one is also a valid URL ender end-string.

                            FIND = http://mysitename.net/index.php/([\w%.~-]+)#([\w%.~-]+)"
                            does match http://mysitename.net/index.php/New_Video#column-one":

                            2fb36c05-cd1f-406d-92f6-ec71aec5bb2a-image.png

                            Alan KilbornA 1 Reply Last reply Reply Quote 2
                            • Alan KilbornA
                              Alan Kilborn @PeterJones
                              last edited by

                              @PeterJones said in Using sets to find A-Za-z plus the # and - chars ..?:

                              FIND = http://mysitename.net/index.php/([\w%.~-]+)#([\w%.~-]+)"

                              Is it worth pointing out that the first two periods here really aren’t periods but rather “match any char”, because they aren’t escaped? Sure, an unescaped . will match a literal period, but it will match other things as well (obviously).

                              IMO, OP here needs to stop asking forum questions and go off and study regex.

                              1 Reply Last reply Reply Quote 1
                              • guy038G
                                guy038
                                last edited by guy038

                                Hello, @peterjones,

                                In the post below, Peter :

                                https://community.notepad-plus-plus.org/post/81643

                                You said :

                                Actually, it’s not documented in our character classes section. I will remedy that.

                                Then, regarding the Character Class feature, may be, this part could be added to the Official Notepad++ Documentation : :

                                If we consider the following CHARACTER CLASS structure :
                                
                                [.......]
                                123456789
                                
                                The POSSIBLE location(s), in order to find the LITERAL character below, are :
                                
                                LITERAL Character [    :     POSSIBLE at any position, BETWEEN 2 to 8 
                                                             POSSIBLE at any position, BETWEEN 2 to 8, if PRECEDED with an ANTI-SLASH character
                                							 
                                LITERAL Character ]    :     POSSIBLE at position 2 ONLY
                                                             POSSIBLE at any position, BETWEEN 2 to 8, if PRECEDED with an ANTI-SLASH character
                                							 
                                LITERAL Character -    :     POSSIBLE at position 2
                                                             POSSIBLE at position 8
                                                             POSSIBLE at any position, BETWEEN 2 to 8, if PRECEDED with an ANTI-SLASH character
                                							 
                                LITERAL Character \    :     POSSIBLE at any position, BETWEEN 2 to 8, if PRECEDED with an ANTI-SLASH character
                                

                                Of course, change this layout as you like !

                                Best Regards,

                                guy038

                                Alan KilbornA 1 Reply Last reply Reply Quote 2
                                • Alan KilbornA
                                  Alan Kilborn @guy038
                                  last edited by Alan Kilborn

                                  @guy038

                                  It is rather awkward to express, but I like your idea.

                                  My idea for expression:

                                  • To use a “literal [” in a character class: Use it directly like any other character, e.g. [ab[c]; “escaping” is not necessary (but is permissible), e.g. [ab\\[c]

                                  • To use a “literal ]” in a character class: Directly right after the opening [ of the class notation, e.g. []abc], OR “escaped” at any position, e.g. [\\]abc] or [a\\]bc]

                                  • To use a “literal -” in a character class: Directly as the first or last character in the enclosing class notation, e.g. [-abc] or [abc-], OR “escaped” at any position, e.g. [\-abc] or [a\-bc]

                                  • To use a “literal \” in a character class: Must be doubled (i.e., \\) inside the enclosing class notation, e.g. [ab\\c]

                                  PeterJonesP 1 Reply Last reply Reply Quote 2
                                  • PeterJonesP
                                    PeterJones @Alan Kilborn
                                    last edited by

                                    @Alan-Kilborn & @guy038 ,

                                    I like those suggestions, especially the way Alan rephrased it: it works much better than my clunky first attempt in the manual, that only included - and was not not very readable.

                                    Thanks.

                                    Alan KilbornA 1 Reply Last reply Reply Quote 2
                                    • Alan KilbornA
                                      Alan Kilborn @PeterJones
                                      last edited by Alan Kilborn

                                      @PeterJones

                                      Maybe my first-of-4 bullet points previously should be moved to be the last-of-4, and changed to:

                                      • To use any other literal character in a character class, just use it directly, i.e., no “escaping” needed

                                      Maybe it works well as a 2 column 4 row table, headers:

                                      • Character
                                      • To use it literally in a character class

                                      With those headers, the “cell contents” for column 2 could be appropriately shortened to remove redundant verbiage.

                                      1 Reply Last reply Reply Quote 1
                                      • guy038G
                                        guy038
                                        last edited by

                                        Hi, @peterjones,

                                        BTW, Peter, do you intend to include, in some way, the end part of this post, regarding the Free-space mode, which is in the Notes section ?

                                        https://community.notepad-plus-plus.org/post/81368


                                        Also, did you correctly receive, by e-mail, my attached text file, regarding the TextFX features ?

                                        Please, I do not want to stress you, unnecessarily ! Just go at your own pace !

                                        Best Regards

                                        guy038

                                        Alan KilbornA 1 Reply Last reply Reply Quote 1
                                        • Alan KilbornA
                                          Alan Kilborn @guy038
                                          last edited by

                                          @guy038 said in Using sets to find A-Za-z plus the # and - chars ..?:

                                          do you intend to include, in some way, the end part of this post, regarding the Free-space mode

                                          He already did, see HERE.

                                          Andrew McPA 1 Reply Last reply Reply Quote 1
                                          • Andrew McPA
                                            Andrew McP @Alan Kilborn
                                            last edited by

                                            @Alan-Kilborn I really admire you guys for figuring out Regular Expressions; I bet you never get lost in real life when you can keep track of the patterns/positions so well, aka good spatial awareness :)

                                            Oh and I like the trick of having - as last character before ]

                                            Alan KilbornA 1 Reply Last reply Reply Quote 1
                                            • First post
                                              Last post
                                            The Community of users of the Notepad++ text editor.
                                            Powered by NodeBB | Contributors