How to find specific string (in my case its a link) that end with quotation mark.



  • I have a wall of text like this.
    https://imgur.com/ZyahUCT

    I wanna mark all the twitch links with a channel names and copy them , thats a basic idea.

    The only solution i have, is to search for this
    https://imgur.com/m4RdOIf

    After that (is where i stuck) it should mark up until the quotation mark
    like that
    https://imgur.com/oNneMqd

    I am sure there is a way to achive this using regex. Please help me guys.



  • Hello @Fujosej-Fujo
    The following regex will capture (highlight) the remainder of the text you seek. I’m not sure how it will help you though as you suggest you need to further manipulate the text which is highlighted.

    So on the Mark function use:
    Find What:https://www.twitch.tv/[^"]+

    As I was not able to determine the exact quote character you use this may need adjusting. If my regex does NOT get the right text, copy the closing quote character from your file and replace mine (in the regex). Quotes can be problematic as there isn’t just 1 kind.

    Let me know how you get on and especially if you require further help.

    Just so you know, how the regex works, look for the first bit (which you already had), followed by ANY character as long as it isn’t a quote, for as many characters as can be found. Therefore it will stop just short of the quote character.

    Terry



  • Works perfectly , thank you.

    Now i need to copy all this text.(370 matches)
    https://imgur.com/zcKjxAS

    Is there a way to do this, i think that shouldn’t be too difficult.



  • That was what I was referring to. Marking just shows you the occurances. It doesn’t help with additional processing.

    What I think you really want to do is to use a copy of the file which you will selectively destroy some of text in, leaving only the references you want.

    Here’s what I’d do. Make a copy of the file (into another tab of Notepad++).
    Use the following regex to remove all unwanted text. This is only a rough job. You may still need to remove lines where the text does NOT occur.
    Find What:^.+?(https://www.twitch.tv/[^"]+).+
    Replace With:\1

    So ANY line with the text you want will ONLY have that text remaining on the line. All other lines without it will be unchanged. I’d then use a line sort function (Edit, Line Operations, Sort Lines Lexicographically Ascending). This will put all the lines you want to keep together. Remove all others. I could spend more time on a regex that would do ALL this but this is quick and easy to do.

    See how you go with this.

    Terry



  • @Fujosej-Fujo

    I probably should ask the question, does any occurance of the text you want crossover lines. I see one of the highlighted instances was right at the end of a line. Currently I haven’t catered for that situation, so check (manually if need) for any other twitch.tv occurrences that did NOT get marked.

    Terry



  • In my file i have only 10 lines, like that
    https://imgur.com/s7JCMfP

    Maybe bacause of that i cant achive what i want.

    After Find and replace
    https://imgur.com/ESaV0VU
    it deletes almost everything i need and leaves me with 8 matches of twitch.

    Like this
    https://imgur.com/SGHI8W2

    Here is a txt file that i am using , maybe this way it will be easier to figure out.
    http://rgho.st/6TlbcFmlG



  • @Fujosej-Fujo
    Your file was a great help. Definitely needed to see that as it showed me that the lines were very long, with multiple occurences on each line of the text you want.

    So here is my revised set of steps.

    1. Make a copy in another tab of Notepad++
    2. Use the Replace Function to remove all line endings (carriage return line feeds).
      Find What:\R
      Replace with:empty field <—nothing in this field

    Now everything is on 1 line (it may not look that way if you have word wrap turned on)

    1. Use Replace function to remove ALL unwanted text.
      Find What:.+?(https://www.twitch.tv/[^"]+)
      Replace With:\1

    This will remove all unwanted text except for the last occurance.

    1. Now to put all occurances of the text we want on different lines.
      Find What:(https://www.twitch.tv/)
      Replace With:\r\n\1

    Once this step is completed, go to the last line and remove the extra text behind the portion you want to keep.

    Again this is a quick process, I haven’t spent much time on making it do everything. Sometimes quick and easy steps are better than trying to cover ALL bases and using a long winded approach.

    Have a go and let us know.

    Terry



  • @Fujosej-Fujo
    I have had a slightly longer look at the file you provided. I note that you mentioned about quotes, however your initial regex did NOT include those. In the file it would appear there are some instances of twitch.tv without quotes. I’m not sure you actually intend to capture those as well.

    I’ve made a revised regex which doesn’t need so many steps, however it will still require the final file to be edited a bit. Once you try it you will see what I mean. Some of the lines stick out very easily as not being correct.

    Find What:.+?"(https://www.twitch.tv/[^"]+)
    Replace With:\1\r\n

    So no need to remove carriage returns, but you will need to remove those lines that DON’T start with “https”. This can be done with Mark, also ticking bookmark, which can then be used to remove lines bookmarked.
    Find What:^[^h]

    Terry



  • This is amazing, thank you very much , works beautifully.



  • Hello @fujosej Fujo, @rerry-r and All,

    Thanks, @fujosej-fujo, for your new 6.txt text file. It’s always better to work on “real” data ;-))

    I think, @terry-r, that all work can be reduced to an unique, regex S/R, only ;-))


    So, @fujosej-fujo, basically, you’re searching for any area of text :

    • Beginning with "https://www.twitch.tv

    • Ending at the first next quote char "

    This regex, which searches for such an area, is :

    (?s-i)"https://www.twitch.tv.+?"

    Notes :

    • The (?s-i) modifiers, at beginning, means that :

      • Any meta-character dot ( . ) represents, absolutely, any single character, even EOL ones ( (?s) )

      • The regex engine will search in a non-insensitive way ( (?-i) )

    • Then, it searches the literal string "https://www.twitch.tv

    • Finally the part .+?" finds the shortest area of any character, till the first next quote char "


    Now that we built this first regex to match the zones to extract, we create a second regex which contains this first regex, using the syntax, where your regex is surrounded with parentheses, in order to store its value as group1, for future replacement :

    SEARCH (?s-i).+?(Your regex)|.+

    Thus, this leads to the correct regex S/R, below :

    SEARCH (?s-i).+?("https://www.twitch.tv.+?")|.+

    REPLACE \1\r\n

    => From your new 6.txt file, 366 replacements occur and you get a neat list of 365 links ;-))

    Notes :

    • After the modifiers, the part .+? matches the shortest part from, either, the beginning of file or the end of the previous match, until the expression "https://www.twitch.tv.+?"

    • In replacement, we rewrite, only, the expected group1, which must be extracted, followed with a line-break

    • Near the end of the file, when no more "https://www.twitch.tv can be found, the regex engine uses the second alternative .+, after the alternation symbol |, which will grab all text till the very end of the file, as the (?s) modifier is always active !

    • This time, as group1 is not defined, the replacement simply delete this last non-wanted part


    Refer also to this more complete post, on that topic ( How to extract all the results matched ) :

    https://notepad-plus-plus.org/community/topic/12710/marked-text-manipulation/8

    Best Regards,

    guy038



  • @guy038
    You’re mostly correct, except that I believe the OP didn’t want the quote characters included, they were just to delimit the text he DID want. Thus your regex should be
    Find What:(?s-i).+?"(https://www.twitch.tv.+?)"|.+
    the Replace With field is as stated.

    Terry



  • Once again , thank you guys! Works like a charm.



  • I have one more question , sorry)

    so lets say i have this list
    http://rgho.st/6k9G4rHSh
    i wanna add ‘s’ to http and also a ‘www.’ before twitch.
    Is there a way to do this with regex?



  • @Fujosej-Fujo
    this is actually very easy. You only need to search for http:// and have the replacement as https://www..
    You could even do this with the Replace function set to “normal” mode as there aren’t any special characters as used previously (.+?) etc.

    Find What:http://
    Replace With:https://www.

    As I said this can be either as normal mode or regular expression mode, it won’t matter.

    Terry



  • Thank you very much.That was easier than i thought.


Log in to reply