regex: Parsing html tags in other tags / links and titles
-
@Robin-Cruise said in regex: Parsing html tags in other tags / links and titles:
<a href=“https://my-website.com/page-66.html” title=“Page 66”>
hello. So, basicaly, this line is almost the same in both parts, having:
<a href
andtitle=
, even the name of the link and the name of the title is different. For this it can be done a parsing.<a href="https://my-website.com/page-1.html" title="Page 1">
with this
<a href="https://my-website.com/page-66.html" title="Page 66">
the last part is a little tricky, I have to import the values PAGE and NUMBER OF THE PAGE, for one place to another:
Page 1 (34)</a>
must become</i> Page 66 <span>
and
(34)</a></li>
should become<span>27</span> </p> </a>
-
Hi, @robin-cruise,
Sorry, but I still do not see what you want to achieve ??!!
How, on earth, do you think a regular expression can make the link between
Page 1
andPage 66
or between numbers(34)
and27
??BR
guy038
-
…and then some of the takers don’t even think through the logistics of their problem first, before asking for help, so the net result is that they waste the time of the helpers and any of the thread’s readers. :-(
-
simple, with the formula:
FIND:
(FROM-HERE(.*?)TO-HERE.*?)(FROM-HERE).*?(TO HERE)
REPLACE BY
\1\3\2\4
for example, the first part:
<ul id="myNavigation"> <li><a href="https://my-website.com/page-1.html" title="Page 1">Page 1 (34)</a></li> </ul>
and this:
<div class="categories-name"> <a href="https://my-website.com/page-66.html" title="Page 66"> <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 66 <span>27</span> </p> </a> </div>
Please test this formula, and you will see that the link from the first text will be parse/copy to the second text.
FIND:
(<li><a href="https://my-website.com/(.*?)" title=".*?)(\h+<a href="https://my-website.com/).*?(" title=")
REPLACE BY:
\1\3\2\4
Now, the problem is that I must replace each part differently.
-
also, the same basic formula, for another part:
FIND:
(title="(.*?)">.*?)(mr-1"></i> ).*?(<span>)
REPLACE BY:
\1\3\2 \4
Check this 2 regex formulas I wrote in the last 2 posts, and see what has changed
-
Hi, @Robin-cruise and All,
When I said, in my previous post :
More generally, you would like that any line, as below, in a
<ul .............</ul>
section :¤¤¤¤<li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
is changed into :
<div class="categories-name"> ¤¤¤¤<a href="https://my-website.com/••••.html" title="••••"> ¤¤¤¤<p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> •••• <span>••••</span> </p> ¤¤¤¤</a> </div>
This can be achieved with the following regex S/R, without any group :
SEARCH
^\h*<li><a href="https://my\-website.com/.+?\.html" title=".+?">.+?</a></li>
REPLACE
<div class="categories-name">\r\n <a href="https://my-website.com/••••.html" title="••••">\r\n <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p>\r\n </a>\r\n</div>
So, from any initial line, inside a
<ul .............</ul>
section, as below :<li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
You get these output lines :
<div class="categories-name"> <a href="https://my-website.com/••••.html" title="••••"> <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p> </a> </div>
Just try it out !
Now, the problem is HOW to perform this S/R, ONLY on lines inside these
<ul>
sections, below :<ul id="myNavigation"> .... .... .... </ul>
To do that, we’ll use, again, the generic regex, already discussed, with :
BSR =
<ul id="myNavigation">
ESR =(?!</ul>)
SR =^\h*<li><a href="https://my\-website.com/.+?\.html" title=".+?">.+?</a></li>
RR =<div class="categories-name">\r\n <a href="https://my-website.com/••••.html" title="••••">\r\n <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> •••• <span>••••</span> </p>\r\n </a>\r\n</div>
Leading to this general S/R :
SEARCH
(?s-i)(<ul id="myNavigation">|(?!\A)\G)((?!</ul>).)*?\K^\h*<li><a href="https://my\-website.com/.+?\.html" title=".+?">.+?</a></li>
REPLACE
<div class="categories-name">\r\n <a href="https://my-website.com/••••.html" title="••••">\r\n <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p>\r\n </a>\r\n</div>
Notes :
-
You must use N++
v7.9.1
or a later release, which correctly handles the\A
assertion ! -
In replacement, everything is literal, but the parts
\r\n
, standing forWindows
EOL -
Of course, AFTER the replacement, you’ll have to change any
••••
zone with its exact contents !
Let’s try this final regex S/R :
- Place this sample text, below, in a N++ new tab :
... ... <ul id="myNavigation"> <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li> <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li> <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li> .... .... .... <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li> </ul> <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li> <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li> <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li> <ul id="myNavigation"> <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li> .... .... .... <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li> <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li> <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li> </ul> <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li> <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li> <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
-
Perform the last regex S/R, above
-
You’re left with the expected text :
... ... <ul id="myNavigation"> <div class="categories-name"> <a href="https://my-website.com/••••.html" title="••••"> <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p> </a> </div> <div class="categories-name"> <a href="https://my-website.com/••••.html" title="••••"> <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p> </a> </div> <div class="categories-name"> <a href="https://my-website.com/••••.html" title="••••"> <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p> </a> </div> .... .... .... <div class="categories-name"> <a href="https://my-website.com/••••.html" title="••••"> <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p> </a> </div> </ul> <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li> <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li> <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li> <ul id="myNavigation"> <div class="categories-name"> <a href="https://my-website.com/••••.html" title="••••"> <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p> </a> </div> .... .... .... <div class="categories-name"> <a href="https://my-website.com/••••.html" title="••••"> <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p> </a> </div> <div class="categories-name"> <a href="https://my-website.com/••••.html" title="••••"> <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p> </a> </div> <div class="categories-name"> <a href="https://my-website.com/••••.html" title="••••"> <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>••••<span>••••</span> </p> </a> </div> </ul> <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li> <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li> <li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
You can verify that the remaining lines
<li><a href="https://my-website.com/••••.html" title="••••">••••</a></li>
are only lines, which are outside a<ul id="myNavigation">............</ul>
multi-lines section, so not replaced at all !Best Regards,
guy038
Oups !! I forgot to delete, either, the lines
<ul id="myNavigation">
and</ul>
. Quite easy with :SEARCH
(?-i)<(ul id="myNavigation"|/ul)>\R
REPLACE
Leave EMPTY
-
-
This post is deleted! -
@guy038 yes, sir. Thank you. You succeded to transform the part 1 with part 2.
My problem, or I didn’t understand to good, are those
••••
How can I change those many
••••
with different words and numbers ? If I have 200 lines, I need 2 months to change those••••
I return to my regex,
please .matches newlines and run this 2 regex
FIND:
(<li><a href="https://my-website.com/(.*?)" title=".*?)(\h+<a href="https://my-website.com/).*?(" title=")
REPLACE BY:
\1\3\2\4
and
FIND:
(title="(.*?)">.*?)(mr-1"></i> ).*?(<span>)
REPLACE BY:
\1\3\2 \4
With this 2 regex, I manage to make a parsing. I copy a information from part 1 to part 2. Of course, is an example of how the formula should work
-
Hello @Robin-cruise,
If you want me to test your two regex S/R ( Let’s say respectively, S/R
A
and S/RB
)-
I need an exact raw text which must be processed with the SR
A
-
I need any additional action, if any, which is to be done, after processing S/R
A
and before processing S/RB
-
I need an exact raw text which must be processed with the S/R
B
( Of course, it may be the result of the S/RA
on the initial text )
BR
guy038
-
-
@guy038 @Alan-Kilborn This is the solution. Yes, it can be done with regex, but must integrated in powershell. Works super !
$file1 = 'c:\Folder1\file1.html' $file2 = 'c:\Folder1\file2.html' $result = 'c:\Folder1\result.html' $link=@() $title=@() $number=@() Get-Content -Path $file1 -Delimiter '</li>'|ForEach-Object{ $_|ForEach-Object{ if($_ -match '(?<=href=").+?(?=")'){$link += $Matches.Values} if($_ -match '(?<=title=").+?(?=")'){$title += $Matches.Values} if($_ -match '(?<=\()\d+(?=\))'){$number += $Matches.Values} } } $content = Get-Content -Path $file2 -Delimiter '</div>' for($i=0;$i -lt $content.Count;$i++){ $content[$i] | ForEach-Object{ if($_ -match '(?<=href=").+?(?=")'){$link2 = $Matches.Values} if($_ -match '(?<=title=").+?(?=")'){$title2 = $Matches.Values} if($_ -match '(?<=<span>)\d+(?=</span>)'){$number2 = $Matches.Values} } $content[$i] -replace $link2, $link[$i] -replace $title2, $title[$i] -replace $number2, $number[$i] | Out-File -FilePath $result -Append }
-
@Robin-Cruise said in regex: Parsing html tags in other tags / links and titles:
This is the solution. Yes, it can be done with regex, but must integrated in powershell.
Ok… but this is really off-topic for a Notepad++ forum.
Whatever you’re doing could I’m sure be done in many different ways.
This isn’t a general “data manipulation” forum. -
Hi, @Robin-cruise,
Nice ! So you, finally, got a working solution.
Out of curiosity, could you provide me a short sample of your initial files
file1.html
with several sections<ul......</ul>
andfile2.html
and the correspondingresult.html
file, created by yourPowershell
scriptThis could help me to understand which you really wanted to achieve and why I could not imagine about suitable and useful regexes !
Thanks for your cooperation !
Best Regards,
guy038
-
File1.html
<ul id="myNavigation"> <li><a href="https://my-website.com/page-1.html" title="Page 1">Page 1 (34)</a></li> <li><a href="https://my-website.com/page-2.html" title="Page 2">Page 2 (29)</a></li> <li><a href="https://my-website.com/page-3.html" title="Page-3">Page 3 (11)</a></li> </ul>
File2.html
<div class="categories-name"> <a href="https://my-website.com/page-66.html" title="Page 66"> <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 66 <span>27</span> </p> </a> </div> <div class="categories-name"> <a href="https://my-website.com/page-67.html" title="Page 67"> <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 67 <span>24</span> </p> </a> </div> <div class="categories-name"> <a href="https://my-website.com/page-68.html" title="Page 68"> <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 68 <span>07</span> </p> </a> </div>
@guy038 @Alan-Kilborn
Thank you. Please understand that finding this solution, was not easy.First, I knew that @guy038 is one of the best REGEX developer on earth. He already gave a solution, in another post, to a PARSING html. so I thought he would succeed.
Second, in another post, someone else gave a solution with PowerShell, of another problem. I remember also that post.
So, I knew that this problem can be resolve, but I had to try both options.
Now, maybe will be a better way to handle regex and files, by integrate notepad++ them into another program such as PowerShell.
-
-
@Alan-Kilborn said in regex: Parsing html tags in other tags / links and titles:
but this is really off-topic for a Notepad++ forum
No, sir. Is an opportunity to understand that a problem has many solution, even if people told you that isn’t possible.
By the way, the same problem can be done in LINUX, much easier…with bash script, but that’s another story.
-
@guy038
result.html
is just a blank file, put in the same folder withFile1.html
andFile2.html
After you run the code in PowerShell, the output result will appear into
result.html
please test the PowerShell code. You will see in
result.html
the structure html from theFile2.html
but with the values extracted fromFile1.html
-
Hi, @robin-cruise,
OK but could you provide the
result.html
file AFTER running thePowershell
script !BR
guy038
Sorry, with my old
XP SP3
machine is a bit out of date ! -
@guy038 this is the
result.html
after run the PowerShell code -
the single problem in PowerShell is that not every regex formulas that works in notepad++ works there. Seems that PowerShell don’t like
\K
or\G
or replacement such as(?2\2)
-
Hello, @robin-cruise, @alan-kilborn and All,
OMG, now I understand the all story ! In fact, you’re did not express your needs correctly, in your first post ! The
File-2.tml
is useless and we just can go from theresult_1.hmtl
file to theresult.html
file !Indeed, in your first post, you said :
<ul id="myNavigation"> <li><a href="https://my-website.com/page-1.html" title="Page 1">Page 1 (34)</a></li> <li><a href="https://my-website.com/page-2.html" title="Page 2">Page 2 (29)</a></li> <li><a href="https://my-website.com/page-3.html" title="Page-3">Page 3 (11)</a></li> </ul>
MUST BECOME:
<div class="categories-name"> <a href="https://my-website.com/page-66.html" title="Page 66"> <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 66 <span>27</span> </p> </a> </div> <div class="categories-name"> <a href="https://my-website.com/page-67.html" title="Page 67"> <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 67 <span>24</span> </p> </a> </div> <div class="categories-name"> <a href="https://my-website.com/page-68.html" title="Page 68"> <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 68 <span>07</span> </p> </a> </div>
can something like this be done with regex?
But… you SHOULD have written :
<ul id="myNavigation"> <li><a href="https://my-website.com/page-1.html" title="Page 1">Page 1 (34)</a></li> <li><a href="https://my-website.com/page-2.html" title="Page 2">Page 2 (29)</a></li> <li><a href="https://my-website.com/page-3.html" title="Page-3">Page 3 (11)</a></li> </ul>
MUST BECOME
<div class="categories-name"> <a href="https://my-website.com/page-1.html" title="Page 1"> <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 1 <span>34</span> </p> </a> </div> <div class="categories-name"> <a href="https://my-website.com/page-2.html" title="Page 2"> <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 2 <span>29</span> </p> </a> </div> <div class="categories-name"> <a href="https://my-website.com/page-3.html" title="Page-3"> <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page-3 <span>11</span> </p> </a> </div>
can something like this be done with regex?
And yes, this transformation can, indeed, be solved with ONE regex S/R only !!!
What misled me was the fact that your output file contained other values than the ones indicated in the initial file ! (
Page 66 <span>27</span>, Page 67 <span>24</span>, Page 68 <span>07</span>...
which are completely out of scope :-(( I vainly looked for a link between, for example, the strings"Page 1">Page 1 (34)
andPage 66 <span>27</span>
!
So, Robin, now, it’s fairly easy to get the right regex S/R !
From this initial text
=========================================================================================================================================== <ul id="myNavigation"> <li><a href="https://my-website.com/page-1.html" title="Page 1">Page 1 (34)</a></li> <li><a href="https://my-website.com/page-2.html" title="Page 2">Page 2 (29)</a></li> <li><a href="https://my-website.com/page-3.html" title="Page-3">Page 3 (11)</a></li> </ul> ===========================================================================================================================================
with the following regex S/R :
SEARCH
(?-i)^\h*<li><a href="https://my\-website.com/(.+?)" title="(.+?)">.+?\((.+?)\)</a></li>|<(ul id="myNavigation"|/ul)>\R
REPLACE
?4:<div class="categories-name">\r\n\x20\x20\x20<a href="https://my-website.com/\1" title="\2">\r\n\x20\x20\x20<p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>\x20\2\x20<span>\3</span> </p>\r\n\x20\x20\x20</a>\r\n</div>
we do get the expected text, totally identical to your
result.html
contents ! Moreover, you do not need thePowershell
script any more !=========================================================================================================================================== <div class="categories-name"> <a href="https://my-website.com/page-1.html" title="Page 1"> <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 1 <span>34</span> </p> </a> </div> <div class="categories-name"> <a href="https://my-website.com/page-2.html" title="Page 2"> <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page 2 <span>29</span> </p> </a> </div> <div class="categories-name"> <a href="https://my-website.com/page-3.html" title="Page-3"> <p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i> Page-3 <span>11</span> </p> </a> </div> ===========================================================================================================================================
Et voilà !
Notes :
-
The groups used are :
-
Group
1
: the part after website.com/ till the double quote, not included -
Group
2
: the title part -
Group
3
: the string between parentheses, before</a>
-
Group
4
: either the literal string<ul id="myNavigation">
or the literal string</ul>
, which must be deleted, in replacement
-
-
Ths regex string
\x20\x20\x20
occurs three times, in the replacement regex, and correspond to the needed leading spaces in each<div ......</div>
section. Changed it as desired !
Best Regards,
guy038
P.S. :
And, if you want to change the lines
<li>.........</li>
, ONLY inside the<ul...............</ul>
sections, use this second regex S/R, derived of the generic regex S/R, discussed in other topics :SEARCH
(?s-i)(?:<ul id="myNavigation">|(?!\A)\G)(?:(?!</ul>).)*?\K^\h*<li><a href="https://my\-website.com/(.+?)" title="(.+?)">.+?\((.+?)\)</a></li>
REPLACE
<div class="categories-name">\r\n\x20\x20\x20<a href="https://my-website.com/\1" title="\2">\r\n\x20\x20\x20<p class="font-16 color-grey text-capitalize"><i class="fa fa-angle-right font-14 color-blue mr-1"></i>\x20\2\x20<span>\3</span> </p>\r\n\x20\x20\x20</a>\r\n</div>
Now, end the job, getting rid of the literal string
<ul id="myNavigation">
and the literal string</ul>
, with, simply :SEARCH
(?-i)<(ul id="myNavigation"|/ul)>\R
REPLACE
Leave EMPTY
We need consecutive regexes S/R because the values
<ul id="myNavigation">
and</ul>
are used as anchors, in the first generic S/R and must not be deleted during the first regex process ! -