Regex to find mixed text, numbers and special characters



  • https://doi.org/10.4414/smw.2020.20225

    Ex: Like above text I need to find different DOI numbers and add external link as follows:

    <ext-link ext-link-type=“uri” xlink:href=“https://doi.org/10.4414/smw.2020.20225”>https://doi.org/10.4414/smw.2020.20225</ext-link>

    Can somebody say what would be the Regex expression for find and replace, Please?



  • @Sukanya__N ,

    Since I don’t know the rules for what’s valid after the doi.org domain, I just assumed that anything starting with https://doi.org/ would be turned into an <ext-link...>:

    • FIND = \Qhttps://doi.org/\E\S*
      ⇒ literal https://doi.org/, followed by 0 or more non-space characters
    • REPLACE = <ext-link ext-link-type="uri" xlink:href="$0">$0</ext-link>
      ⇒ the $0 in the replacement means “use whatever matched from the FIND”; the rest is just literal text
    • Search Mode = regular expression

    ----

    Do you want regex search/replace help? Then please be patient and polite, show some effort, and be willing to learn; answer questions and requests for clarification that are made of you. All example text should be marked as plain text using the </> toolbar button or manual Markdown syntax. Screenshots can be pasted from the clipbpard to your post using Ctrl+V to show graphical items, but any text should be included as literal text in your post so we can easily copy/paste your data. Show the data you have and the text you want to get from that data; include examples of things that should match and be transformed, and things that don’t match and should be left alone; show edge cases and make sure you examples are as varied as your real data. Show the regex you already tried, and why you thought it should work; tell us what’s wrong with what you do get… Read the official NPP Searching / Regex docs and the forum’s Regular Expression FAQ. If you follow these guidelines, you’re much more likely to get helpful replies that solve your problem in the shortest number of tries.



  • @PeterJones,

    Thanks a lot for this expression, this saved me a lot of time. But I need to search the text which doesn’t have https://doi.org

    For ex.:

    <mixed-citation>1. Wesmiller SW, Sereika SM, Bender CM, Bovbjerg D, Ahrendt G, Bonaventura M, et al. Exploring the multifactorial nature of postoperative nausea and vomiting in women following surgery for breast cancer. Auton Neurosci. 2017;202:102-7. 10.1016/j.autneu.2016.09.017</mixed-citation>

    And in the above text I need to add external link for “10.1016/j.autneu.2016.09.017”, I need the output as follows:

    <mixed-citation>1. Wesmiller SW, Sereika SM, Bender CM, Bovbjerg D, Ahrendt G, Bonaventura M, et al. Exploring the multifactorial nature of postoperative nausea and vomiting in women following surgery for breast cancer. Auton Neurosci. 2017;202:102-7. <ext-link ext-link-type=“uri” xlink:href=“https://doi.org/10.1016/j.autneu.2016.09.017”>https://doi.org/10.1016/j.autneu.2016.09.017</ext-link></mixed-citation>

    So I tried your expression in this way:

    FIND
    \Q10.\E\S* , any number followed with non-space characters

    REPLACE
    <ext-link ext-link-type=“uri” xlink:href=“https://doi.org/$0”>https://doi.org/$0</ext-link>

    and I received the output as:

    <mixed-citation>1. Wesmiller SW, Sereika SM, Bender CM, Bovbjerg D, Ahrendt G, Bonaventura M, et al. Exploring the multifactorial nature of postoperative nausea and vomiting in women following surgery for breast cancer. Auton Neurosci. 2017;202:102-7. <ext-link ext-link-type=“uri” xlink:href=“https://doi.org/10.1016/j.autneu.2016.09.017</mixed-citation>”>https://doi.org/10.1016/j.autneu.2016.09.017</mixed-citation></ext-link>

    Because of the search of non-breaking space, closing tag of </mixed-citation> is also appearing inside <ext-link…>

    Can you suggest whether my find and replace expression is correct and suggest me a other expression of not including <mixed-citation> within <ext-link>, please.



  • @Sukanya__N said in Regex to find mixed text, numbers and special characters:

    \Q10.\E\S* , any number followed with non-space characters

    It would have helped if you had followed the advice in my italic paragraphs, and used Markdown and the forum’s formatting toolbar to help make your post more readable. Example data should always be inside “literal text” (aka “code”) markers, which can be implemented using the </> on the toolbar when you’re inputting your post. And regexes are much more readable when inside the same kind of code block, or inside a simple pair of backticks: `\Q10.\E\S*` renders as \Q10.\E\S* , which is much more readable, and ensures that special characters don’t get modified by the forum.

    Back to your actual problem: the reason why the </mixed-citation> got captured by the regex is because the \S* part of the regex grabbed as many nonspace characters as it could… which meant that </mixed-citation>, which is 17 characters of non-space which match the regex. When you give examples that don’t include your edge cases, you get regexes that don’t work in your data.

    I did a bit of googling, and found that 10.\d{4,9}/[-._;()/:A-Z0-9]+ apparently matches many (most?) DOI references, so I’ll use that to help make the full regex more explicit. I will also allow it to be optionally prefixed by https://doi.org/ – since your original example had that, but your second example did not. This will mean that I will have to use advance features of regex. If you want to find out what each term in the regex I will use means, you’ll have to take it to one of the regex-explaining sites that are linked in the forum’s regex faq, which was included in my italic paragraph above.

    • FIND = (https?://doi.org/)?(10.\d{4,9}/[-._;()/:A-Z0-9]+)
    • REPLACE = <ext-link ext-link-type="uri" xlink:href="(?1:https://doi.org/)$0">(?1:https://doi.org/)$0</ext-link>
    • MODE = regular expression

    will convert

    <mixed-citation>1. Wesmiller SW, Sereika SM, Bender CM, Bovbjerg D, Ahrendt G, Bonaventura M, et al. Exploring the multifactorial nature of postoperative nausea and vomiting in women following surgery for breast cancer. Auton Neurosci. 2017;202:102-7. 10.1016/j.autneu.2016.09.017</mixed-citation>
    
    10.1016/j.autneu.2016.09.017
    
    https://doi.org/10.4414/smw.2020.20225
    

    into

    <mixed-citation>1. Wesmiller SW, Sereika SM, Bender CM, Bovbjerg D, Ahrendt G, Bonaventura M, et al. Exploring the multifactorial nature of postoperative nausea and vomiting in women following surgery for breast cancer. Auton Neurosci. 2017;202:102-7. <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.autneu.2016.09.017">https://doi.org/10.1016/j.autneu.2016.09.017</ext-link></mixed-citation>
    
    <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.1016/j.autneu.2016.09.017">https://doi.org/10.1016/j.autneu.2016.09.017</ext-link>
    
    <ext-link ext-link-type="uri" xlink:href="https://doi.org/10.4414/smw.2020.20225">https://doi.org/10.4414/smw.2020.20225</ext-link>
    

    If this is insufficient for you, you will need to read and understand my italic paragraph above, and prove in your reply that you are willing to follow that advice, otherwise I will not be able to help further, and likely other regulars will have difficulty helping you either.



  • @PeterJones

    Thank you so much!! This helped me a lot.



  • @Sukanya__N said in Regex to find mixed text, numbers and special characters:

    <ext-link ext-link-type=“uri” xlink:href=“https://doi.org/10.4414/smw.2020.20225”>https://doi.org/10.4414/smw.2020.20225</ext-link>

    you could try <ext-link ext-link-type.*/ext-link>
    but see that the outer brackets are not included in the find


Log in to reply