Using sets to find A-Za-z plus the # and - chars ..?
-
I’m trying to find and replace some URLs.
This is an example of what URL links look like:
http://mysitename.net/index.php/pagename#bookmark
http://mysitename.net/index.php/pagename-hypen
I need to replace these with, for example:
http://mysitename.net/index.php/pagename - mysitename.mhtml#bookmark
(So I need to store pagename in ${1} and bookmark in ${2}.)You can see I can’t just search for
(\w*)
because of the-
and#
and probably%
literal chars that may appear.I looked at sets.
([A-Za-z#-%])
but that didn’t seem to work. And I tried(\w*-*#*)
and that didn’t work either. Any ideas on what would work for me? -
As is documented,
-
has special meaning in regex character sets. If you want it to be treated as a literal in a character set, it needs to be either the first or last character in the set.Compare yours:
to this[A-Za-z#%-]
:
or, going back to yours, with the
$
in the text file:
vs
… the
[#-%]
portion of the character set says “characters#
through%
”, which includes the$
between those, so[#-%]
will match#
or$
or%
. Whereas[#%-]
says “match#
or%
or the literal-
” -
@PeterJones said in Using sets to find A-Za-z plus the # and - chars ..?:
As is documented
Actually, it’s not documented in our character classes section. I will remedy that.
-
@PeterJones
My search term is not finding the URL in my html page.
html page (its not finding this, but it should):
http://mysitename.net/index.php/New_Video#column-one"
-
@IanSunlun said in Using sets to find A-Za-z plus the # and - chars ..?:
http://mysitename.net/index.php/New_Video#column-one"
Um, no it shouldn’t.
New_Video#column-one
is more than one character.[A-Za-z%#_-]
only matches one character.I think what you want is
http://mysitename.net/index.php/[A-Za-z%#_-]+"
, which wants one or more charaters from that set.Also, I hope you don’t have a URL like
http://mysitename.net/index.php/one1#column2
Or
http://school.edu/~username/o.n.e.#2
, which is something I might have had back in my university homepage days, lo those two-and-a-half decades ago.Maybe use
http://mysitename.net/index.php/[\w%#.~-]+"
, since\w
encompases the[A-Za-z0-9_]
portion, and it adds in the URL-safe characters of . and ~, as well as the # separator and %-encoding-start. -
@IanSunlun
Hello :) Try this in Npp: (Just to easily verify that it matches)Find: [.#\-%]
Inside a character class [set]:
The character # is literal
The character % is literal
The.
It is literal (remember that outside equals any character.)
\-
The only one that needs an escape sequence using\
.So:
[A-Za-z#\-%.]
The second hyphen is inside in an escape sequence (preceded by \ ).Another character that needs escape is ^ because of its negation meaning within the brackets
[\^]
. -
@PeterJones Ah, thats seems to work thanks.
Does[\w%#.~-]+
put whatever it matches into ${1} ? -
This post is deleted! -
This post is deleted! -
@IanSunlun said in Using sets to find A-Za-z plus the # and - chars ..?:
Does [\w%#.~-]+ put whatever it matches into ${1} ?
Sorry, when I answered, I had forgotten that you previously said,
(So I need to store pagename in ${1} and bookmark in ${2}.)
Putting the
#
into either match is not what you want, either. You really need two groups, one before the # and one after.FIND =
http://mysitename.net/index.php/([\w%.~-]+)#([\w%.~-]+)"
will only match if there is a bookmark, and the # will not be inside the ${2} group. If you want the # to be included in ${2}, usehttp://mysitename.net/index.php/([\w%.~-]+)(#[\w%.~-]+)"
-
@PeterJones said in Using sets to find A-Za-z plus the # and - chars ..?:
FIND = http://mysitename.net/index.php/([\w%.~-]+)#([\w%.~-]+)"
With the period
.
inbetween the%
and the~
it did not find:
http://mysitename.net/index.php/New_Video#column-one"
But taking the period out, it did find it.
Whats the thinking behind the period in this context ? -
Except for
-
, order doesn’t matter inside the[]
character class. The period is there becauseNew.Video#column-one
is also a valid URLenderend-string.FIND =
http://mysitename.net/index.php/([\w%.~-]+)#([\w%.~-]+)"
does matchhttp://mysitename.net/index.php/New_Video#column-one"
: -
@PeterJones said in Using sets to find A-Za-z plus the # and - chars ..?:
FIND = http://mysitename.net/index.php/([\w%.~-]+)#([\w%.~-]+)"
Is it worth pointing out that the first two periods here really aren’t periods but rather “match any char”, because they aren’t escaped? Sure, an unescaped
.
will match a literal period, but it will match other things as well (obviously).IMO, OP here needs to stop asking forum questions and go off and study regex.
-
Hello, @peterjones,
In the post below, Peter :
https://community.notepad-plus-plus.org/post/81643
You said :
Actually, it’s not documented in our character classes section. I will remedy that.
Then, regarding the
Character Class
feature, may be, this part could be added to theOfficial Notepad++ Documentation :
:If we consider the following CHARACTER CLASS structure : [.......] 123456789 The POSSIBLE location(s), in order to find the LITERAL character below, are : LITERAL Character [ : POSSIBLE at any position, BETWEEN 2 to 8 POSSIBLE at any position, BETWEEN 2 to 8, if PRECEDED with an ANTI-SLASH character LITERAL Character ] : POSSIBLE at position 2 ONLY POSSIBLE at any position, BETWEEN 2 to 8, if PRECEDED with an ANTI-SLASH character LITERAL Character - : POSSIBLE at position 2 POSSIBLE at position 8 POSSIBLE at any position, BETWEEN 2 to 8, if PRECEDED with an ANTI-SLASH character LITERAL Character \ : POSSIBLE at any position, BETWEEN 2 to 8, if PRECEDED with an ANTI-SLASH character
Of course, change this layout as you like !
Best Regards,
guy038
-
It is rather awkward to express, but I like your idea.
My idea for expression:
-
To use a “literal
[
” in a character class: Use it directly like any other character, e.g.[ab[c]
; “escaping” is not necessary (but is permissible), e.g.[ab\\[c]
-
To use a “literal
]
” in a character class: Directly right after the opening[
of the class notation, e.g.[]abc]
, OR “escaped” at any position, e.g.[\\]abc]
or[a\\]bc]
-
To use a “literal
-
” in a character class: Directly as the first or last character in the enclosing class notation, e.g.[-abc]
or[abc-]
, OR “escaped” at any position, e.g.[\-abc]
or[a\-bc]
-
To use a “literal
\
” in a character class: Must be doubled (i.e.,\\
) inside the enclosing class notation, e.g.[ab\\c]
-
-
@Alan-Kilborn & @guy038 ,
I like those suggestions, especially the way Alan rephrased it: it works much better than my clunky first attempt in the manual, that only included
-
and was not not very readable.Thanks.
-
Maybe my first-of-4 bullet points previously should be moved to be the last-of-4, and changed to:
- To use any other literal character in a character class, just use it directly, i.e., no “escaping” needed
Maybe it works well as a 2 column 4 row table, headers:
- Character
- To use it literally in a character class
With those headers, the “cell contents” for column 2 could be appropriately shortened to remove redundant verbiage.
-
Hi, @peterjones,
BTW, Peter, do you intend to include, in some way, the end part of this post, regarding the
Free-space
mode, which is in the Notes section ?https://community.notepad-plus-plus.org/post/81368
Also, did you correctly receive, by e-mail, my attached text file, regarding the
TextFX
features ?Please, I do not want to stress you, unnecessarily ! Just go at your own pace !
Best Regards
guy038
-
@guy038 said in Using sets to find A-Za-z plus the # and - chars ..?:
do you intend to include, in some way, the end part of this post, regarding the Free-space mode
He already did, see HERE.
-
@Alan-Kilborn I really admire you guys for figuring out Regular Expressions; I bet you never get lost in real life when you can keep track of the patterns/positions so well, aka good spatial awareness :)
Oh and I like the trick of having - as last character before ]