Community
    • Login

    Regex: Select all html tags which no contain characters, and onother regex that contain only tags with simbols

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    19 Posts 4 Posters 1.0k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Hellena CrainicuH
      Hellena Crainicu @Lycan Thrope
      last edited by

      @Lycan-Thrope said in Regex: Select all html tags which no contain characters, and onother regex that contain only tags with simbols:

      <p>\s*(?![a-zA-Z\u4e00-\u9fff]).*</p>

      That worked, except for the underscores…which I think is included with the [a-z][A-Z] aspect of regex, as acceptable characters

      Yes, strange. I try this, but still cannot find _ underscores

      <p>\s*(?![a-zA-Z\u4e00-\u9fff_]+\s*<\/p>)[^<>]*<\/p>
      or
      <p>\s*(?![a-zA-Z\u4e00-\u9fff_]+\s*<\/p>)[^\w_]*<\/p>

      Lycan ThropeL 2 Replies Last reply Reply Quote 0
      • Lycan ThropeL
        Lycan Thrope @Hellena Crainicu
        last edited by

        @Hellena-Crainicu said in Regex: Select all html tags which no contain characters, and onother regex that contain only tags with simbols:

        Yes, strange. I try this, but still cannot find _ underscores

        <p>\s*(?![a-zA-Z\u4e00-\u9fff_]+\s*</p>)[^<>]*</p>

        This regex recaptures the kanji

        this one works by not finding the kanji

        or
        <p>\s*(?![a-zA-Z\u4e00-\u9fff_]+\s*</p>)[^\w_]*</p>

        1 Reply Last reply Reply Quote 0
        • Lycan ThropeL
          Lycan Thrope @Hellena Crainicu
          last edited by

          @Hellena-Crainicu ,
          Maybe you should wait for the gurus. :-)
          I can’t find a way to allow _ or #95 or #5F to be allowed to show.

          Perhaps some new eyes might help. Good luck.

          1 Reply Last reply Reply Quote 1
          • guy038G
            guy038
            last edited by guy038

            Hello, @hellena-crainicu, @lycan-thrope and All,

            Both of you did the same mistake : the \u#### syntax, to describe a specific character, DOES NOT exist for the Boost regex engine ! The RIGHT syntax which works is \x{####} ;-))


            Secondly, @helena-crainicu, even if you’re using the correct syntax, below :

            <p>\s*(?![a-zA-Z\x{4e00}-\x{9fff}]).*</p>

            This line <p>---别条款适用</p> line, which begins with a symbol, after <p>, would also match !

            May be, you would have preferred the following syntax which matches the entire line, ONLY IF all chars, between <p> and </p>, are different from a usual letter and different from a Chinese character

            MARK (?s-i)^\s*<p>((?![a-zA-Z\x{4e00}-\x{9fff}]).)*</p>

            Thus, from this INPUT file :

            <p>下列特别条款适用</p>
            <p>         。</p>
            <p>---</p>  
            <p>---别条款适用</p>
             <p>______</p>  
             <p> </p> 
            <p>于本保险单的各个部分,若其</p>
            

            It would ONLY mark these sections :

            <p>         。</p>
            <p>---</p>  
             <p>______</p>  
             <p> </p> 
            

            Now, I improved your regex to a multi line regex, expressed with the free-spacing mode, where any char considered as a word character ( \w ) is forbidden except for the underscore character, thanks to the syntax [\W_], in a positive look-ahead, at any position between <p> and </p>

            MARK (?xs-i) ^ \s* <p> ( (?= [\W_] ) . )* </p>

            So, for example, with this INPUT text :

            <p>下列特
            别条款适用</p>
                <p>         。</p>
            <p>--
            -</p>  
               <p>__
             ____</p>  
            <p>---下列特
            别条款适用 @@@</p>
             <p>   </p> 
            <p>于本保险单的
            各个部分,若其
            </p>
              <p>_____
            ----
            ####  		
               </p>
            <p> Not allowed</p>
               <p>~~~
               ~~</p>
            

            It would ONLY mark these lines, in 6 matches :

                <p>         。</p>
            <p>--
            -</p>  
               <p>__
             ____</p>  
             <p>   </p> 
              <p>_____
            ----
            ####  		
               </p>
               <p>~~~
               ~~</p>
            

            Best Regards,

            guy038

            Note that the syntax (?xs-i) ^ \s* <p> ( (?= [\W_] ) . )*? </p>, with a question mark, near the end of the regex, is not mandatory because the </p> string CANNOT be found between the boundaries <p> and </p> because each char must not be a word char, thus different for the p letter !

            Hellena CrainicuH 2 Replies Last reply Reply Quote 2
            • Hellena CrainicuH
              Hellena Crainicu @guy038
              last edited by

              @guy038 super answer. Thanks a lot. Merry Christmas !

              1 Reply Last reply Reply Quote 0
              • Hellena CrainicuH
                Hellena Crainicu @guy038
                last edited by

                by the way, @guy038

                Is there a vice-versa formula regex? For example, I want to use a regex to find the opposite. Only tags that containe characters.

                In the example, I need to find only the first and the last line.

                <p>下列特别条款适用</p>
                <p>         。</p>
                <p>---</p>  
                 <p>______</p>  
                 <p> </p> 
                <p>于本保险单的各个部分,若其</p>
                

                What would be the regex formula in this case?

                mkupperM 1 Reply Last reply Reply Quote 0
                • mkupperM
                  mkupper @Hellena Crainicu
                  last edited by mkupper

                  @Hellena-Crainicu said in Regex: Select all html tags which no contain characters, and onother regex that contain only tags with simbols:

                  by the way, @guy038

                  Is there a vice-versa formula regex? For example, I want to use a regex to find the opposite. Only tags that containe characters.

                  In the example, I need to find only the first and the last line.

                  <p>下列特别条款适用</p>
                  <p>         。</p>
                  <p>---</p>  
                   <p>______</p>  
                   <p> </p> 
                  <p>于本保险单的各个部分,若其</p>
                  

                  What would be the regex formula in this case?

                  The CJK Unified Ideographs run from \x{4E00} to \x{9FFF}.
                  You also use , which is \x{FF0C} or a fullwidth comma,
                  You can use <p>[\x{4E00}-\x{9FFF}\x{FF0C}]+</p> to match any paragraph containing only CJK Unified Ideographs and/or the fullwidth comma.

                  A not in regular expressions is tricky as it means everything in the character set that is not something. For example [^\x{4E00}-\x{9FFF}\x{FF0C}] matches everything in the character set that is not a CJK Unified Ideograph or the fullwidth comma. It matches end of line characters and everything else including the characters <, /, p, and >.

                  Thus we use <p>[^\x{4E00}-\x{9FFF}\x{FF0C}<]*</p> which will match the leading <p> followed by [^\x{4E00}-\x{9FFF}\x{FF0C}<]* to match anything not a CJK Unified Ideograph, or the fullwidth comma, or the < left angle bracket which is followed by the trailing </p> which starts with a < left angle bracket.

                  Note that this works well for Basic Multilingual Plane characters which are \x{0000} to \x{ffff}. If your text files and/or regular expressions contain extended Unicode characters from U+10000 to U+10FFFF then you will can run into issues.

                  Hellena CrainicuH 1 Reply Last reply Reply Quote 2
                  • Hellena CrainicuH
                    Hellena Crainicu @mkupper
                    last edited by

                    @mkupper your regex is good for chinese. thanks. But I forgot to change the last line. Instead of chinesse, should be latino words.

                    In this case, your regex would not find the last line. So, both line must be find.

                    <p>下列特别条款适用</p>
                    <p>         。</p>
                    <p>---</p>  
                     <p>______</p>  
                     <p> </p> 
                    <p>I love reading books</p>
                    
                    mkupperM 1 Reply Last reply Reply Quote 0
                    • mkupperM
                      mkupper @Hellena Crainicu
                      last edited by

                      @Hellena-Crainicu said in Regex: Select all html tags which no contain characters, and onother regex that contain only tags with simbols:

                      In this case, your regex would not find the last line. So, both line must be find.

                      As I have posted before, I encourage you to experiment and think as that is what will lead to learning how to use something. You have already been provided with the nuts and bolts of how to create regular expressions including things such as (this|or|that) that could easily be used to handle (Chinese|Latino).

                      1 Reply Last reply Reply Quote 0
                      • guy038G
                        guy038
                        last edited by guy038

                        Hi, @hellena-crainicu,

                        I suppose that the best would be to know which characters your’re looking for ( either word and symbol/punctuation characters )

                        So, from these two tables below, could you tell us which characters you are supposed to search for, among all your files :


                        1) Regarding Word characters in :

                            •----•------------------------------------•-------------•---•---------------------------•---------------------------------------------•
                            |  A | Basic LATIN ( ASCII )              | 0000 - 007F | x | (?s).*[\x{0000}-\x{007F}] | http://www.unicode.org/charts/PDF/U0000.pdf |
                            |  B | LATIN-1 Supplement                 | 0080 - 00FF | • | (?s).*[\x{0080}-\x{00FF}] | http://www.unicode.org/charts/PDF/U0080.pdf |
                            |  C | LATIN Extended-A                   | 0100 - 017F | • | (?s).*[\x{0100}-\x{017F}] | http://www.unicode.org/charts/PDF/U0100.pdf |
                            |  D | LATIN Extended-B                   | 0180 - 024F | • | (?s).*[\x{0180}-\x{024F}] | http://www.unicode.org/charts/PDF/U0180.pdf |
                            |  E | LATIN Extended Additional          | 1E00 - 1EFF | • | (?s).*[\x{1E00}-\x{1EFF}] | http://www.unicode.org/charts/PDF/U1E00.pdf |
                            |  F | LATIN Extended-C                   | 2C60 - 2C7F | • | (?s).*[\x{2C60}-\x{2C7F}] | http://www.unicode.org/charts/PDF/U2C60.pdf |
                            |  G | LATIN Extended-D                   | A720 - A7FF | • | (?s).*[\x{A720}-\x{A7FF}] | http://www.unicode.org/charts/PDF/UA720.pdf |
                            |  H | LATIN Extended-E                   | AB30 - AB6F | • | (?s).*[\x{AB30}-\x{AB6F}] | http://www.unicode.org/charts/PDF/UAB30.pdf |
                            |  I | HALFWIDTH/FULLWITH Forms           | FF00 - FFEF | • | (?s).*[\x{FF00}-\x{FFEF}] | http://www.unicode.org/charts/PDF/UFF00.pdf |
                            •----•------------------------------------•-------------•---•---------------------------•---------------------------------------------•
                            |  J | CJK Radicals Supplement            | 2E80 - 2EFF | • | (?s).*[\x{2E80}-\x{2EFF}] | http://www.unicode.org/charts/PDF/U2E80.pdf |
                            |  K | KANGXI Radicals                    | 2F00 - 2FDF | • | (?s).*[\x{2F00}-\x{2FDF}] | http://www.unicode.org/charts/PDF/U2F00.pdf |
                            |  L | CJK Unified Ideographs Extension A | 3400 - 4DBF | • | (?s).*[\x{3400}-\x{4DBF}] | http://www.unicode.org/charts/PDF/U3400.pdf |
                            |  M | CJK Unified Ideographs (Han )      | 4E00 - 9FFF | x | (?s).*[\x{4E00}-\x{9FFF}] | http://www.unicode.org/charts/PDF/U4E00.pdf |
                            |  N | CJK Compatibility Ideographs       | F900 - FAFF | • | (?s).*[\x{F900}-\x{FAFF}] | http://www.unicode.org/charts/PDF/UF900.pdf |
                            •----•------------------------------------•-------------•---•---------------------------•---------------------------------------------•
                        

                        2) Regarding Symbol/Punctuation characters in :

                            •----•------------------------------------•-------------•---•---------------------------•---------------------------------------------•
                            |  1 | ASCII Punctuation                  | 0000 - 007F | x | (?s).*[\x{0000}-\x{007F}] | http://www.unicode.org/charts/PDF/U0000.pdf |
                            |  2 | LATIN-1 Punctuation                | 0080 - 00FF | • | (?s).*[\x{0080}-\x{00FF}] | http://www.unicode.org/charts/PDF/U0080.pdf |
                            |  3 | GENERAL Punctuation                | 2000 - 206F | • | (?s).*[\x{2000}-\x{206F}] | http://www.unicode.org/charts/PDF/U2000.pdf |
                            |  4 | SUPPLEMENTAL Punctuation           | 2E00 - 2E7F | • | (?s).*[\x{2E00}-\x{2E7F}] | http://www.unicode.org/charts/PDF/U2E00.pdf |
                            |  5 | VERTICAL Forms                     | FE10 - FE1F | • | (?s).*[\x{FE10}-\x{FE1F}] | http://www.unicode.org/charts/PDF/UFE10.pdf |
                            |  6 | SMALL form Variants                | FE50 - FE6F | • | (?s).*[\x{FE50}-\x{FE6F}] | http://www.unicode.org/charts/PDF/UFE50.pdf |
                            |  7 | HALFWIDTH/FULLWIDTH LATIN Forms    | FF00 - FFEF | x | (?s).*[\x{FF00}-\x{FFEF}] | http://www.unicode.org/charts/PDF/UFF00.pdf |
                            •----•------------------------------------•-------------•---•---------------------------•---------------------------------------------•
                            |  8 | Ideographic Desciption characters  | 2FF0 - 2FFF | • | (?s).*[\x{2FF0}-\x{2FFF}] | http://www.unicode.org/charts/PDF/U2FF0.pdf |
                            |  9 | CJK Symbols and Punctuation        | 3000 - 303F | x | (?s).*[\x{3000}-\x{303F}] | http://www.unicode.org/charts/PDF/U3000.pdf |
                            | 10 | Enclosed CJK Letters and Months    | 3200 - 32FF | • | (?s).*[\x{3200}-\x{32FF}] | http://www.unicode.org/charts/PDF/U3200.pdf |
                            | 11 | CJK Compatibility                  | 3300 - 33FF | • | (?s).*[\x{3300}-\x{33FF}] | http://www.unicode.org/charts/PDF/U3300.pdf |
                            | 12 | CJK Stokes                         | 31C0 - 31EF | • | (?s).*[\x{31C0}-\x{31EF}] | http://www.unicode.org/charts/PDF/U31C0.pdf |
                            | 13 | CJK Compatibility Forms            | FE30 - FE4F | • | (?s).*[\x{FE30}-\x{FE4F}] | http://www.unicode.org/charts/PDF/UFE30.pdf |
                            | 14 | Halfwidth CJK Punctuation          | FF61 - FF64 | • | (?s).*[\x{FF61}-\x{FF64}] | http://www.unicode.org/charts/PDF/UFF00.pdf |
                            •----•------------------------------------•-------------•---•---------------------------•---------------------------------------------•
                        

                        Then, it should be easier to see what must be matched / unmatched ;-))


                        Not a big task, anyway ! We already know that :

                        • The word chars, in lines A and M, must be considered

                        • The Symbol/Punctuation chars, in lines 1, 7 ( because \x{FF0C} ) and 9 ( because \x{3002} ), must be taken in account, too


                        Thus, if all your files are located in a single folder :

                        • Open the Find in Files dialog ( Shift + Ctrl + F )

                        • Uncheck the Match whole word only and Match case options, if necessary

                        • Enter $0 in the Replace with : zone

                        • Enter your file extension in the Filters : zone

                        • Enter your searched folder in the Directory : one

                        • Select the Regular expression search mode

                        • For each line which contain a • in the fourth column of the above tables

                          • Enter the corresponding regex in the Find what : zone

                          • Click on the Find All button ( Only 1 result per file )

                          • If one or several result(s) for a specific line occur(s), this means that its searched range must be taken in account

                        => So, just notice us about all the remaining lines which should be considered too !


                        Two remarks :

                        • I voluntarily omitted all Unicode ranges over the BMP, so with value over \x{FFFF}

                        • I also omitted any Greek, Cyrillic, Hebrew, Arabic, Hangul and Japanese Unicode ranges. If necessary, tell me about it !

                        Best Regards,

                        guy038

                        1 Reply Last reply Reply Quote 2
                        • Hellena CrainicuH
                          Hellena Crainicu
                          last edited by Hellena Crainicu

                          @guy038 @mkupper

                          I find the solution:

                          <p>下列特别条款适用</p>
                          <p>     。</p>
                          <p>---</p>  
                           <p>______</p>  
                           <p> </p> 
                          <p>I love reading books</p>
                          <p>:    </p>
                          

                          These 2 regex below find also latino and chinesse characters tags:

                          FIND: <p>.*[\x{4e00}-\x{9fff}a-zA-Z]+.*<\/p>

                          ot this:

                          FIND: <p>.*[\x{4E00}-\x{9FFF}\p{Latin}].*<\/p>

                          I made other tests, such as the following:

                          Find only chinese characters tags:

                          <p>(?=.*[\x{4e00}-\x{9fff}])(?=.*[a-zA-Z]).*<\/p>
                          <p>(?=.*[\x{4e00}-\x{9fff}])[a-zA-Z\x{4e00}-\x{9fff}].*<\/p>
                          <p>(?=[^\x00-\x7F]+)[^\x00-\x7F]+<\/p>
                          <p>(?:(?=[^\x00-\x7F]+)[^\x00-\x7F]+|[a-zA-Z]+)<\/p>
                          <p>(?:(?![a-zA-Z])[^\x00-\x7F]+|[a-zA-Z]+)<\/p>
                          <p>(?:[\x{4e00}-\x{9fff}]|[a-zA-Z])+<\/p>

                          Find only latino characters tags:

                          <p>.*([\p{Latin}\p{Han}]).*<\/p>

                          1 Reply Last reply Reply Quote 1
                          • guy038G
                            guy038
                            last edited by guy038

                            Hello, @hellena-crainicu and All,

                            I can assure you that your sub-regexes [\x{4E00}-\x{9FFF}\p{Latin}] and [\p{Latin}\p{Han}] should NOT have worked as expected, during your texts !!

                            Indeed, our present Boost regex engine does NOT support :

                            • The normal Unicode syntax \p{name} like, for example, in \p{Lu} or \p{IsCyrillic}

                            • The inverse Unicode syntax \P{name} like, for example in \P(C) or \P{IsGreek}


                            Thus, more simply :

                            • Your sub-regex [\x{4E00}-\x{9FFF}\p{Latin}] would match any char from the CJK Unified Ideographs Unicode block or the latin letters p, L, a, t, i, n and the two symbols { and }

                            • Your sub-regex [\p{Latin}\p{Han}] would match the Latin letters H, L, a, i, n, p, t and the two symbols { and }


                            IF Boost had been compiled to support Unicode and ICU, the regex syntaxes, mentionned at the beginning, would have been possible.

                            For instance, the regex (?=P{L})\p{IsLetterlikeSymbols} or the regex (?!p{L})\p{IsLetterlikeSymbols} would match ANY single Letterlike Symbols character which does NOT belong to the General Category Letter

                            So, given my present Consolas font, installed on my Windows10 laptop, it would return the 34 characters, of the list below, out of the 80 characters of the Letterlike Symbols block !

                            Refer to http://www.unicode.org/charts/PDF/U2100.pdf

                            •--------•------------------------------------•-------•--------------------•---------•
                            |  Code  |           Character Name           |   GC  |  General Category  |   Char  |
                            •--------•------------------------------------•-------•--------------------•---------•
                            |  2118  |  SCRIPT CAPITAL P                  |   Sm  |    Symbol, math    |    ℘    |
                            |  2140  |  DOUBLE-STRUCK N-ARY SUMMATION     |   Sm  |    Symbol, math    |    ⅀    |
                            |  2141  |  TURNED SANS-SERIF CAPITAL G       |   Sm  |    Symbol, math    |    ⅁    |
                            |  2142  |  TURNED SANS-SERIF CAPITAL L       |   Sm  |    Symbol, math    |    ⅂    |
                            |  2143  |  REVERSED SANS-SERIF CAPITAL L     |   Sm  |    Symbol, math    |    ⅃    |
                            |  2144  |  TURNED SANS-SERIF CAPITAL Y       |   Sm  |    Symbol, math    |    ⅄    |
                            |  214B  |  TURNED AMPERSAND                  |   Sm  |    Symbol, math    |    ⅋    |
                            •--------•------------------------------------•-------•--------------------•---------•
                            |  2100  |  ACCOUNT OF                        |   So  |    Symbol, other   |    ℀    |
                            |  2101  |  ADDRESSED TO THE SUBJECT          |   So  |    Symbol, other   |    ℁    |
                            |  2103  |  DEGREE CELSIUS                    |   So  |    Symbol, other   |    ℃    |
                            |  2104  |  CENTRE LINE SYMBOL                |   So  |    Symbol, other   |    ℄    |
                            |  2105  |  CARE OF                           |   So  |    Symbol, other   |    ℅    |
                            |  2106  |  CADA UNA                          |   So  |    Symbol, other   |    ℆    |
                            |  2108  |  SCRUPLE                           |   So  |    Symbol, other   |    ℈    |
                            |  2109  |  DEGREE FAHRENHEIT                 |   So  |    Symbol, other   |    ℉    |
                            |  2114  |  L B BAR SYMBOL                    |   So  |    Symbol, other   |    ℔    |
                            |  2116  |  NUMERO SIGN                       |   So  |    Symbol, other   |    №    |
                            |  2117  |  SOUND RECORDING COPYRIGHT         |   So  |    Symbol, other   |    ℗    |
                            |  211E  |  PRESCRIPTION TAKE                 |   So  |    Symbol, other   |    ℞    |
                            |  211F  |  RESPONSE                          |   So  |    Symbol, other   |    ℟    |
                            |  2120  |  SERVICE MARK                      |   So  |    Symbol, other   |    ℠    |
                            |  2121  |  TELEPHONE SIGN                    |   So  |    Symbol, other   |    ℡    |
                            |  2122  |  TRADE MARK SIGN                   |   So  |    Symbol, other   |    ™    |
                            |  2123  |  VERSICLE                          |   So  |    Symbol, other   |    ℣    |
                            |  2125  |  OUNCE SIGN                        |   So  |    Symbol, other   |    ℥    |
                            |  2127  |  INVERTED OHM SIGN                 |   So  |    Symbol, other   |    ℧    |
                            |  2129  |  TURNED GREEK SMALL LETTER IOTA    |   So  |    Symbol, other   |    ℩    |
                            |  212E  |  ESTIMATED SYMBOL                  |   So  |    Symbol, other   |    ℮    |
                            |  213A  |  ROTATED CAPITAL Q                 |   So  |    Symbol, other   |    ℺    |
                            |  213B  |  FACSIMILE SIGN                    |   So  |    Symbol, other   |    ℻    |
                            |  214A  |  PROPERTY LINE                     |   So  |    Symbol, other   |    ⅊    |
                            |  214C  |  PER SIGN                          |   So  |    Symbol, other   |    ⅌    |
                            |  214D  |  AKTIESELSKAB                      |   So  |    Symbol, other   |    ⅍    |
                            |  214F  |  SYMBOL FOR SAMARITAN SOURCE       |   So  |    Symbol, other   |    ⅏    |
                            •--------•------------------------------------•-------•--------------------•---------•
                            

                            Best Regards,

                            guy038

                            Hellena CrainicuH 1 Reply Last reply Reply Quote 1
                            • Hellena CrainicuH
                              Hellena Crainicu @guy038
                              last edited by

                              @guy038 thanks, do you have a better solution?

                              1 Reply Last reply Reply Quote 0
                              • guy038G
                                guy038
                                last edited by

                                Hello, @hellena-crainicu,

                                Well, simply answer my previous post !

                                Best Regards

                                guy038

                                Hellena CrainicuH 1 Reply Last reply Reply Quote 1
                                • Hellena CrainicuH
                                  Hellena Crainicu @guy038
                                  last edited by

                                  @guy038 happy new year, all notepad++ team !! The best editor on earth !

                                  1 Reply Last reply Reply Quote 3
                                  • First post
                                    Last post
                                  The Community of users of the Notepad++ text editor.
                                  Powered by NodeBB | Contributors