Regex: I want to check if the words between the tag <spam class> </spam> start with diacritics



  • hello. I want to check if the words between <spam class> </spam> start with diacritics.

    Exemple: <p class="text_obisnuit"><span class="text_obisnuit2">This Immediacy Is A Product</span> of the direct particularity of purpose-oriented information that the network caters to.</p>

    So, the regex should match all tags that contains words that doesn’t start with diacritics such as: This immediacy is a product

    And if does, to replace the first letter from those words it with diacritics.



  • please consider diacritics = capital letters = uppercase



  • @Vasile-Caraus said:

    diacritics = capital letters = uppercase

    Thank you for the clarification. (In my mind, “diacritics” are accented characters, like à, and I was confused.)

    Also, thank you for both the want-to-match and the don’t-want-to-match examples.

    I have a solution that works for me given your example text, though I needed to run it multiple times, because I don’t know how to “back up” the search point; @guy038 will probably come up with a one-shot regex.

    If I start with the data

    <p class="text_obisnuit"><span class="text_obisnuit2">This Immediacy Is A Product</span> of the direct particularity of purpose-oriented information that the network caters to.</p>
    <p class="text_obisnuit"><span class="text_obisnuit2">This immediacy is a product</span> of the direct particularity of purpose-oriented information that the network caters to.</p>
    <p class="text_obisnuit"><span class="text_obisnuit2">This Immediacy Is A Product</span> of the direct particularity of purpose-oriented information that the network caters to.</p>
    <p class="text_obisnuit"><span class="text_obisnuit2">This immediacy is a product</span> of the direct particularity of purpose-oriented information that the network caters to.</p>
    

    Where some have words only starting with capitals and others have some words inside that start with lowercase, my thought process is “inside of the span tag, possibly after other words, look for a word boundary followed by a lowercase, and convert that lowercase to an uppercase”.

    • FIND = (?-i)<span class=[^>]*>.*?\K\b[a-z](?=.*?</span>)
    • REPLACE = \u$0
    • MODE = regular expression

    Since the longest phrase inside the <span>...</span> had four words that didn’t start with a capital letter, I had to run Replace All 4 times to get all the words capitalized. But in the end, I had:

    <p class="text_obisnuit"><span class="text_obisnuit2">This Immediacy Is A Product</span> of the direct particularity of purpose-oriented information that the network caters to.</p>
    <p class="text_obisnuit"><span class="text_obisnuit2">This Immediacy Is A Product</span> of the direct particularity of purpose-oriented information that the network caters to.</p>
    <p class="text_obisnuit"><span class="text_obisnuit2">This Immediacy Is A Product</span> of the direct particularity of purpose-oriented information that the network caters to.</p>
    <p class="text_obisnuit"><span class="text_obisnuit2">This Immediacy Is A Product</span> of the direct particularity of purpose-oriented information that the network caters to.</p>
    

    which is what I believe you want.
    -----
    FYI: I often add this to my response in regex threads, unless I am sure the original poster has seen it before. Here is some helpful information for finding out more about regular expressions, and for formatting posts in this forum (especially quoting data) so that we can fully understand what you’re trying to ask:

    This forum is formatted using Markdown, with a help link buried on the little grey ? in the COMPOSE window/pane when writing your post. For more about how to use Markdown in this forum, please see @Scott-Sumner’s post in the “how to markdown code on this forum” topic, and my updates near the end. It is very important that you use these formatting tips – using single backtick marks around small snippets, and using code-quoting for pasting multiple lines from your example data files – because otherwise, the forum will change normal quotes ("") to curly “smart” quotes (“”), will change hyphens to dashes, will sometimes hide asterisks (or if your text is c:\folder\*.txt, it will show up as c:\folder*.txt, missing the backslash). If you want to clearly communicate your text data to us, you need to properly format it.

    If you have further search-and-replace (“matching”, “marking”, “bookmarking”, regular expression, “regex”) needs, study this FAQ and the documentation it points to.

    ps: thanks again for the match and don’t-match; it allowed me to cut a paragraph-and-a-half out of my boilerplate for you. :-)



  • Great answer, @PeterJones thanks a lot !


Log in to reply