Regex - Find URLs with Embedded Spaces
-
I need to check HTML pages for URLs with embedded spaces. The URL I’ve come up with so far:
href=".*? +?.*?">
It finds this URL as expected:
href="url with spaces">
but then it gives a false positive on this subsequent URL:
href="../../../../../bio/d/y/k/e/dykes_jb.htm">John B. Dykes</a><span class="verbose">
Any suggestions would be greatly appreciated!
-
@Dick-Adams-0
What you are trying to do is difficult but the forum’s resident regex guru (@guy038 ) has already made some posts here.I’m not going to try to give you the exact regex but it will be worthwhile you reading those posts. In particular his 2nd post in that thread which looks to be exactly what you are seeking (with obvious character replacements).
Terry
-
Your regex probably needs to be more restrictive. I know you tried to restrict it by making it non-greedy
?
, but that’s not always enough.Try
[^"]*?
instead of.*?
in at least the first and maybe both instances of.*?
.The first might even want to be
[^"\s]*?
so that it allows neither quotes nor whitespace characters.----
Useful References
- Notepad++ Online User Manual: Searching/Regex
- FAQ: Where to find other regular expressions (regex) documentation
----
Please note: This Community Forum is not a data transformation service; you should not expect to be able to always say “I have data like X and want it to look like Y” and have us do all the work for you. If you are new to the Forum, and new to regular expressions, we will often give help on the first one or two data-transformation questions, especially if they are well-asked and you show a willingness to learn; and we will point you to the documentation where you can learn how to do the data transformations for yourself in the future. But if you repeatedly ask us to do your work for you, you will find that the patience of usually-helpful Community members wears thin. The best way to learn regular expressions is by experimenting with them yourself, and getting a feel for how they work; having us spoon-feed you the answers without you putting in the effort doesn’t help you in the long term and is uninteresting and annoying for us.
-
@Terry-R said in Regex - Find URLs with Embedded Spaces:
has already made some posts here.
And as slightly more of a hint: the search zone would begin with
href="
and end with"
. (If you tried the simpler begin with"
, you would find that it would sometimes match between the end of one URL and the beginning of the next, and it wouldn’t work, which might frustrate you; the zone-matching works best when the start and end markers can be distinguished.)