Find/Replace & deleting

Dave Dyet

Hi
I am new to Notepad++. I tried using this URL https://techbrij.com/copy-extract-html-drop-down-list-options-text but Notepad++ returned “bad command” maybe its the wrong syntax?

I am looking to delete everything except Country names?

Here is part of the file… thanks for the help! dd

<li class=“ui-menu-item” role=“presentation”><a id=“ui-id-1806” class=“ui-corner-all” tabindex=“-1”>Africa</a></li><li class=“ui-menu-item” role=“presentation”><a id=“ui-id-1807” class=“ui-corner-all” tabindex=“-1”>Angola</a></li><li class=“ui-menu-item” role=“presentation”><a id=“ui-id-1808” class=“ui-corner-all” tabindex=“-1”>Argentina</a></li><li class=“ui-menu-item” role=“presentation”><a id=“ui-id-1809” class=“ui-corner-all” tabindex=“-1”>Armenia</a></li><li class=“ui-menu-item” role=“presentation”><a id=“ui-id-1810” class=“ui-corner-all” tabindex=“-1”>Asia</a></li><li class=“ui-menu-item” role=“presentation”><a id=“ui-id-1811” class=“ui-corner-all” tabindex=“-1”>Australia</a></li><li class=“ui-menu-item” role=“presentation”><a id=“ui-id-1812” class=“ui-corner-all” tabindex=“-1”>Australia > New South Wales</a></li><li class=“ui-menu-item” role=“presentation”><a id=“ui-id-1813” class=“ui-corner-all” tabindex=“-1”>Australia > Northern Territory</a></li><li class=“ui-menu-item” role=“presentation”><a id=“ui-id-1814” class=“ui-corner-all” tabindex=“-1”>

PeterJones

@Dave-Dyet said:

I tried using this URL https://techbrij.com/copy-extract-html-drop-down-list-options-text but Notepad++ returned “bad command”

How are you using a URL as a command inside Notepad++?

If you mean you tried following the instructions listed at that URL, where specifically in the sequence does it say “bad command” for you?

Dave Dyet

@PeterJones said:

ried following the instructions listed at that URL

Hi
yes your correct, I mean tried following the instructions listed at that URL for Notepad++.

I tried it again with the same syntax <option[^>]>([^<])</option> … etc… now it says “0 occurrences were replaced”

i am trying to figure out how do I strip all the data out, leaving the country names behind and returning the names left justified. ie.

Africa
Angola
Argentina
Armenia

Thanks for the help!

Terry R

@Dave-Dyet said:

leaving the country names behind and returning the names left justified.

Hi Dave, it would appear from your example that the country names are the only text/words that start with a capital letter. If this is indeed correct we can use that to our advantage. My suggestion for a regex is:
Find What:(?-i)[^A-Z]+(\z|[A-Z].+?(?=</a))
Replace With:\1\r\n
Search mode MUST be regular expression.

The (?-i) at the start means this will be a case-sensitive search. We first look for any character as long as it’s NOT a capital letter. Once the capital letter is found we start capturing until we see just in the front the </a combination, depicting the end of the country name. The use of the \z is so once the last country name has been found, we continue until the end of the file and drop all those characters.

A quick test on your example seemed to work. Please note I assumed you want the state within the country also captured, thus Northern Territory is also captured.

Let us know how it went, if a problem arises we can possibly alter the regex if you provide the situation where it did NOT work.

Terry

PeterJones

@Dave-Dyet said:

<option[^>]>([^<])</option> … etc… now it says “0 occurrences were replaced”

Even assuming you used the actual <option[^>]*>([^<]*)</option> regex that they suggested rather than what we see in the forum (which was likely mangled by the forum when you pasted it in because you didn’t use Markdown formatting [see boilerplate below]): how did you expect that regex to work given your data? Their example was for HTML using the <option> tags, and extracting the values from there. Your example text had everything in <li> tag pairs… why would you think the work “option” would magically match “li”? Also, your data has <a> tags nested in the <li> tags.

Since you seem to just want to delete all tags in the example you provided, I’d probably do a two-step:

FIND = <[^>]*>, REPLACE = \n, MODE = regular expression – this will get rid of all the tags, but there are extra newlines
FIND = \R+, REPLACE = \n, MODE = regular expression – this will collapse all series of multiple newlines down into a single newline.

(Note: the original regex and mine both assume you want linux-style LF newlines \n rather than windows-style CRLF newlines \r\n)

-----
Boilerplate to help you with formatting:

This forum is formatted using Markdown, with a help link buried on the little grey ? in the COMPOSE window/pane when writing your post. For more about how to use Markdown in this forum, please see @Scott-Sumner’s post in the “how to markdown code on this forum” topic, and my updates near the end. It is very important that you use these formatting tips – using single backtick marks around small snippets, and using code-quoting for pasting multiple lines from your example data files – because otherwise, the forum will change normal quotes ("") to curly “smart” quotes (“”), will change hyphens to dashes, will sometimes hide asterisks (or if your text is c:\folder\*.txt, it will show up as c:\folder*.txt, missing the backslash). If you want to clearly communicate your text data to us, you need to properly format it.

PeterJones

@Terry-R said:

(?-i)[^A-Z]+(\z|[A-Z].+?(?=</a))

Ah, Terry beat me by a couple of minutes, and was able to find the alternation that would allow it in a single go rather than in two-pass. Use that one instead.

Dave Dyet

Thanks so much for the answers and explanation it was really helpful. I used Terry’s response worked great!

Cheers
Dave

guy038

Hello, @dave-dyet, @terry-r, @peterjones and All,

Here is a 3rd possible solution :

SEARCH (?s-i).+?(\u.+?)(?=<)|(?s).+

REPLACE ?1\1\r\n ( OR ?1\1\n for Unix files )

Notes :

The remainder of text, near the very end of file, is just wiped out. Indeed, when the second alternative (?s).+ is used, the group 1 does not exist. So, no replacement is done, because of the conditional replacement ?1....
I used the \u syntax which matches, when sensitive search is processed, any uppercase letter of any occidental Unicode script ( Latin, Greek, Cyrillic,… ). It’s probably useless, as in English/American language, no country begins with an accentuated character, anyway ! However, regarding this specific case, writing (?-i)\u is as easy as writing (?-i)[A-Z] ! Refer to the list of sovereign states, below :

https://en.wikipedia.org/wiki/List_of_sovereign_states

And we get the text, below :

Africa
Angola
Argentina
Armenia
Asia
Australia
Australia > New South Wales
Australia > Northern Territory

Peter, from your solutiion, I built a new version, which can do all the job, in one go ;-)) So, here is the 4th version :

SEARCH (?-s)<.+?>|^\h*\R?|(.+?)(?=<)

REPLACE ?1\1\r\n ( OR ?1\1\n for Unix files )

Notes :

This regex allows the pertinent items to begin with an lowercase letter, either !
If group 1 does not exist, then the <.....> blocks OR possible leading blank chars, followed with a possible line-break, are deleted
If group1 does exist, then the different items of the drop-down list, are listed, as usual, one per line

Best Regards,

guy038