unable to extract two different numbers
-
Hi, I am trying to extract two different numbers from a source code of an html page, I have tried this: code:
.*data-test-pin-id="(.*?)class=".*
but it select couple of div codes, but I want to have a cleaner result,
this is part of the html code that contains the data I want to extract:<div data-test-id="pin" data-test-pin-id="2251868552505328" rel="4867" class="pinbox pinterest--block"><div class="zI7 iyn Hsu"><div data-test-id="pinWrapper" class="XiG sLG zI7 iyn Hsu" style="border-radius: 16px; -webkit-mask-image: -webkit-radial-gradient(center, white, black);"><div class="XiG sLG zI7 iyn Hsu" style="border-radius: 16px; -webkit-mask-image: -webkit-radial-gradient(center, white, black);"><div class="zI7 iyn Hsu">
I want to extract the number in front of: data-test-pin-id= and also the number in front of rel=, in the right order, I don’t want to be mixed, the above code selects couple of div codes, can anyone help me with code that clean all these html code?
thanks -
the critical part of the search regex will be
data-test-pin-id="(\d+)" rel="(\d+)"
, which puts the data test pin id into $1 and the rel value into $2.I am not sure how much of the rest you want to act on, so I cannot say what regex syntax should go before or after that critical part.
- Do you want to manipulate a whole line (until newline sequence)?
- Or do you want to manipulate just inside a single
<div ...>
start tag? - Or do you want to manipulate from the
<div...>
start tag to the next</div>
close? - Or do you want to manipulate from the
<div...>
start tag to its matching</div>
? (note, that is a complicated regex, since you show nested divs)
The best way to help us to help you would be to show multiple selections, and show us what you have before vs what you want to extract.
----
Do you want regex search/replace help? Then please be patient and polite, show some effort, and be willing to learn; answer questions and requests for clarification that are made of you. All example text should be marked as literal text using the
</>
toolbar button or manual Markdown syntax. To makeregex in red
(and so they keep their special characters like *), use backticks, like`^.*?blah.*?\z`
. Screenshots can be pasted from the clipboard to your post usingCtrl+V
to show graphical items, but any text should be included as literal text in your post so we can easily copy/paste your data. Show the data you have and the text you want to get from that data; include examples of things that should match and be transformed, and things that don’t match and should be left alone; show edge cases and make sure you examples are as varied as your real data. Show the regex you already tried, and why you thought it should work; tell us what’s wrong with what you do get. Read the official NPP Searching / Regex docs and the forum’s Regular Expression FAQ. If you follow these guidelines, you’re much more likely to get helpful replies that solve your problem in the shortest number of tries. -
Hello, @jessica-anderson , @perterjones and All,
Assuming the following hypotheses :
-
The
rel
attributes come always next to thedata-test-pin-id
attributes -
The search is not restricted to a specific section of the
HTML
file
A solution could be :
SEARCH
(data-test-pin-id|\G.+)="\K\d+
Of course, the
Regular expression
search mode is selected !Best Regards,
guy038
-
-
thanks for the reply,
the snippet I posted contains div and nested divs,
the two set of numbers that I look for are inside one dive that has the id of: data-test-id=‘pin’,
the search should look for div with this id to find the set of numbers.
as in the example I want to get: 19844054598293583
and 11917, these two numbers are related together so it should be shown on one line when found.<div data-test-id="pin" data-test-pin-id="19844054598293583" rel="1191" class="pinbox pinterest--block">
Do you want to manipulate a whole line (until newline sequence)?
since there are number of divs with the same id in the whole web page, and I want to look for them as well, for sure there are new lines as well until the end of html page, but I am not sure how the webpage code are designed, this codes are from pinterest page, so I want to look outside of nested loop as well.Or do you want to manipulate just inside a single <div …> start tag?
the data I am looking for are inside the <div…> area, the search should go to the whole page, to find div with the same id, if I understand the question.thank you guy038 for the code snippet
(data-test-pin-id|\G.+)="\K\d+
this works fine and highlight the first number which I want, but it works only in single find " find next", since there are many of these numbers, it will be time consuming, to copy and past, but when I use multiple finds, “find all in currect document” it won’t go beyond the couple of divs, I could not figure it out why, appreciate any help on it.
thanks -
@Jessica-Anderson said in unable to extract two different numbers:
as in the example I want to get: 19844054598293583
and 11917, these two numbers are related together so it should be shown on one line when found.You could wait for someone to create a complicated regex that will solve your problem in one hit or you could look at the problem as a series of steps which make it far easier to understand (for you) and something you could re-use with changes easily in the future.
The steps would be:- Use the Mark function to find the text you are looking for. My suggestion is
(?s)data-test-pin-id=.+?rel="\d+"
. As this is a regex, search mode must be regular expression. - The required text will now be highlighted and you can “copy marked text” (button on the Mark function) to another tab.
- further edit the lines to get just what you want.
It would seem from your opening post that you are somewhat familiar with regex and I have no doubt you could easily create the regexes to support the above steps.
Terry
PS edited to include (?s) as maybe required data extends over next line
- Use the Mark function to find the text you are looking for. My suggestion is
-
@Jessica-Anderson said in unable to extract two different numbers:
this works fine and highlight the first number which I want, but it works only in single find " find next", since there are many of these numbers
- Copy everything to a “safe file” – this file is going to remove everything except the critical numbers
Ctrl+Home
(go to the beginning of the file)- Replace Dialog:
- FIND =
(?s).*?data-test-id="pin"\s*data-test-pin-id="(\d+)"\s*rel="(\d+)"|.*\z
=> find all the stuff leading up to the next data-test-id=“pin”, and grab the id and rel numerical values into $1 and $2
=> if there are no further data-test-id followed by id and rel, grab everything else to the end of the file - REPLACE =
(?1(?2$1 , $2\r\n))
=> if groups 1 and 2 both have contents, put those numbers comma-separated on a single line.
=> everything else in the document, other than the id,rel pairs, will be deleted. This is why I said make it a “safe file”, not your original! - MODE = regular expression
- Find Next/Replace one at a time, or Replace All once you’re confident.
- FIND =
Good luck
-
@PeterJones Thanks for code,
it works fine, and that is what I wanted,
I am grateful for all the help.Thanks TerryR, the bookmark solution worked partially, when copied the bookmarked, it was highlighted the correct numbers, when pasted to a text file, html code was included, anyhow I am happy with the solution of PeterJones, thanks.
-
@Jessica-Anderson said in unable to extract two different numbers:
when pasted to a text file, html code was included
Very possibly as I said in step #3
further edit the lines to get what you want
. This step was to suggest removing extraneous text not wanted (yes html code).But as you are now happy there is no need to consider my approach. Do consider though in future in looking to split up a complicated problem into smaller steps which can often be much easier to solve.
Terry
-
anyhow I am happy with the solution of PeterJones,
While you are happy with the provided solution, what happens next time? The people here aren’t going to solve you problems time after time. You need to read about, and understand how the solutions given you work so that next time perhaps you can solve your own problem.
A great starting point is here:
https://community.notepad-plus-plus.org/topic/15765/faq-desk-where-to-find-regex-documentation