Community
    • Login

    How to delete all text except proper names?

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    8 Posts 5 Posters 627 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Petr  AndreevP
      Petr Andreev
      last edited by Petr Andreev

      I have a text. It is necessary to leave only the list of proper names.

      Here is:
      Sent by them to the north in search of new lands, the daring daredevils of the Swamp Orkhon, Holokhoi Oyuun, Symattai the blacksmith, Hara Tumen, Uluu Horo, the elder Omogoy and Ellei Bootur, proceeding from the banks of the Great Ebe from top to bottom and from bottom to top, found three spacious valleys.

      I need to make like this:
      Swamp Orkhon
      Holokhoi Oyuun
      Symattai the Blacksmith
      Hara Tumen
      Uluu Horo
      Omogoy
      Ellei Bootur
      Great Ebe

      Terry RT astrosofistaA 2 Replies Last reply Reply Quote 0
      • Terry RT
        Terry R @Petr Andreev
        last edited by

        @Petr-Andreev said in How to delete all text except proper names?:

        It is necessary to leave only the list of proper names.

        I would suggest this will be an almost impossible task for a regular expression to get totally right.

        Assuming we refer to words with a capitalized first character not a problem. But then the first word in a sentence is always capitalized and as it may be a person’s name rather than as the example shows the word “Sent” do we include this or not.

        Then there’s the situation of the “Symattai the blacksmith”, where I might add you altered it so the “blacksmith” became “Blacksmith”. So we now consider 2 capitalized words separated by one other word. But now that would select “Omogoy and Ellei” and leave Bootur.

        I think this requires a level of heuristics and regex isn’t that. Regex requires strict rules that are followed to produce results and I don’t think the rules can be created well enough in this case to be certain of a good and trustworthy outcome.

        Terry

        1 Reply Last reply Reply Quote 3
        • Terry RT
          Terry R
          last edited by

          @Terry-R said in How to delete all text except proper names?:

          I would suggest this will be an almost impossible task for a regular expression to get totally right.

          I decided to see how futile the effort would be to attempt a (series of) regex(es) to try and get close to the goal. I changed two words in the example (blacksmith to Blacksmith and added “The” between “Swamp” and “Orkhon”) in my quest to see how I might fare.

          1. We need to break lines apart at the punctuation marks. I figured no proper names would have these (hopefully).
            Using the Replace function I have
            Find What:[[:punct:]]
            Replace With:\r\n

          2. I capture sets of multiple capitalized words and send them to another line using the Replace function.
            Find What:(?-i)((\b[A-Z]\w+\h*){2,})
            Replace With:\r\n\1\r\n

          3. Look for 2 capitalised words separated by another word and send them to another line using the Replace function.
            Find What:(?-i)((\b[A-Z]\w+\h)\w+\h(\b[A-Z]\w+))
            Replace With:\r\n\1\r\n

          4. Remove any lower case words starting from the beginning of the line up until a capitalised word (or end of line) which will NOT be captured using the Replace function.
            Find What:(?-i)^(\h*[a-z]\w+\h*)+
            Replace With: empty field here

          5. Similar to step 4 but for the remaining portion of lines (behind the last capitalised word on a line) until the end, again using the Replace function.
            Find What:(?-i)(\h*\b[a-z]\w+\h*)+$
            Replace With: empty field here

          6. Use the Line Operation function to “Remove empty lines (containing blank characters)”.

          I am left with the following. It does get close and because I was aware of certain issues that might cause problems I did the “sets of capitalised” words earlier than other steps to keep them together. It does however keep the first word in the sentence, as predicted.

          Resulting lines after the operations above (with slightly altered text as specified):

          Sent
          Swamp The Orkhon
          Holokhoi Oyuun
          Symattai the Blacksmith
          Hara Tumen
          Uluu Horo
          Omogoy
          Ellei Bootur
          Great Ebe
          

          Again I will reiterate my statement that this cannot be trusted to produce a good result, although as my test shows, it might get reasonably close. Close inspection of the output might show easily where some additional work is warranted. This will however do the bulk of it, get you to within 95% of the goal. Regexes are not “sentient”, they will follow blindly whatever the tasks before them, whether right or wrong! It requires a “more sentient” engine to correctly complete this task. Perhaps something like the Google Search engine.

          Terry

          PeterJonesP 1 Reply Last reply Reply Quote 3
          • PeterJonesP
            PeterJones @Terry R
            last edited by

            @Terry-R said in How to delete all text except proper names?:

            see how futile

            @Petr-Andreev ,

            In case you weren’t yet convinced as to the futility of this request, let’s add a second paragraph:

            As a second paragraph, we will follow d'Artagnan on his quest to find the bones of St. Francis of Assisi, along with fictional characters and Dr. Watson, and Bill S. Preston, Esq, and George Wexford-Smyth III.  They were really curious whether Dutch cyclist Mathieu van der Poel would lose his "van der", and whether Dutch would be kept because it's a proper adjective (not a proper noun).
            

            After @Terry-R’s noble attempt at setting up an algorithm from your example paragraph, when this second paragraph is processed… :-(

            As a
            d
            Artagnan on his quest to find the bones of St
             
            Francis of Assisi
            
            Dr
             Watson
            Bill S
             Preston
             Esq
            
            George Wexford
            
            
            Smyth III
            
              They
            Dutch cyclist Mathieu
            Poel
            
            
            Dutch
            s a
            a
            

            As @Terry-R has indicated: this problem takes a level of intelligence (or enumeration of rules) that regex just doesn’t have.

            1 Reply Last reply Reply Quote 4
            • Terry RT
              Terry R
              last edited by

              @PeterJones said in How to delete all text except proper names?:

              when this second paragraph is processed… :-(

              Thanks for that 2nd paragraph. I knew in my mind it was a futile effort but I (sadly) had to try (to prove my hypothesis). Of course my test actually appeared to show it was possible, something I’d hoped would not happen. But the reason it did was because my series of regexes were designed entirely with the first example in mind, to MAKE it work.

              @PeterJones example showed that there are so many different types of Proper Names lurking out there that it is impossible for regex to cover them all. My other statement as to the ability of my solution to get part (most) of the way was RUBBISH, as has been proven.

              About the ONLY thing I proved was that unless the OPs present us with examples which provide a good representation of the actual data, what they will receive back is next to useless. Oh and how many times have we seen that occur!

              Terry

              1 Reply Last reply Reply Quote 3
              • astrosofistaA
                astrosofista @Petr Andreev
                last edited by

                @Petr-Andreev, @Terry-R, @PeterJones:

                I agree with both of you, Terry and Peter - regex is a resource of a syntactic order and not a semantic one, therefore it knows nothing about proper names. On the other hand, there is no need to use this tool alone, as it can be combined with others more suitable for the task, for example, a dictionary.

                Let me explain my point.

                As we all know, proper names usually begin with a capital letter and in this particular case the rule applies. But also many other common words can be capitalized, so that criterion alone is not enough. However there is another fact that we can take into account - the proper names quoted in OP’s text are completely non-English words, so they would be highlighted as misspelled words in a purely English dictionary and this would allow us to distinguish them from English words that begin with a capital letter. This fact suggested me an alternative approach to this particular text, an approach not suitable for other ones, as @PeterJones showed above.

                The approach then is to use the English dict provided by the DSpellCheck plugin to collect a list of proper names. First it’s needed to modify the text with a couple of regexes to make the task doable. I am also going to consider Blacksmith instead of blacksmith.

                a) Replace spaces between first and last names with underscores:

                Search: (?-i)([A-Z][a-z]+)\K (?=[A-Z][a-z]+)
                Replace: _
                

                b) Replace spaces for Symattai and his office —run it twice—:

                Search: (?-i)(([A-Z][a-z]+)\K\x20(?=the))|((?<=_the)\x20(?=[A-Z][a-z]+))
                Replace: _
                

                to get the following text:

                Sent by them to the north in search of new lands, the daring daredevils of the Swamp_Orkhon, Holokhoi_Oyuun, Symattai_the_Blacksmith, Hara_Tumen, Uluu_Horo, the elder Omogoy and Ellei_Bootur, proceeding from the banks of the Great_Ebe from top to bottom and from bottom to top, found three spacious valleys.
                

                c) Select DSpellCheck from the Plugins menu and

                • Change Current Language -> English (United States) - Only this one, don’t mix languages!

                • In Settings...
                  Simple -> Library -> select Hunspell - Is the one I had installed, maybe other dicts work fine as well
                  Advanced -> Ignore words -> Uncheck these three options: Starting with Capital Letter, Having not first Capital Letter, Having _.

                • Select Spell Check Document Automatically and all proper names should be highlighted. Then select Additional Actions -> Copy All Misspelled Words to Clipboard to get an ordered list of unique proper names, as follows:

                Ellei_Bootur
                Great_Ebe
                Hara_Tumen
                Holokhoi_Oyuun
                Omogoy
                Swamp_Orkhon
                Symattai_the_Blacksmith
                Uluu_Horo
                

                Later you can replace back the spaces between the collected names.

                I think it does a good job with OP’s text, and although it may fail —surely it would!— with other examples or in other cases, it can also be improved. On the other hand, my main point was to show another approach to deal with this kind of issues. Hope it can suggest people new ideas to solve some recurrent problems.

                Have fun!

                1 Reply Last reply Reply Quote 3
                • Terry RT
                  Terry R
                  last edited by

                  @astrosofista said in How to delete all text except proper names?:

                  On the other hand, my main point was to show another approach to deal with this kind of issues.

                  Dare I say it (or as Sherlock Holmes said):
                  “The game is afoot”!

                  It actually looks now like a possibility. But do we throw time and resources to it? And to what end (a proof of concept)? I dare say collectively we could create quite a passable solution. That’s the beauty of this “collective mind” on the forum, we can all feed off each other’s ideas.

                  However I’m content that we’ve shown the OP how hard it is, especially when presented with another different paragraph. I think it would require a lot more effort by someone to take this to the bitter (sweet) end. All i can say is good luck to anyone who wishes to give it a go.

                  Terry

                  1 Reply Last reply Reply Quote 2
                  • guy038G
                    guy038
                    last edited by

                    Hello, @petr-andreev, @terry-R, @astrosofista and All,

                    @terry-r and @peterjones are right about it. For instance, the first word of your example, Sent is obviously not a proper name but this long sentence could have started with a proper name, too ! Regexes are really not a fair tool to solve semantic problems ;-))

                    @Astrosofista, I do like your approach, too ! As for me, I succeeded to build up a regex which catches, “more or less”, all the proper names and compound proper names of the OP’s text !

                    I modified your example, adding some dummy text as well as the @peterjones text, for a deeper text. Of course, the first word of any sentence is considered as a proper name. And, although your first sentence, slightly modified, seems to give good results, the @peterjones text, coming next, is not so pertinent !

                    Sent by them to the north in search of new lands, the daring daredevils of the Swamp Orkhon, Holokhoi Oyuun, Symattai the Blacksmith, Hara Tumen, Uluu Horo, the elder Omogoy and Ellei Bootur, as well as the general MacArthur, O'Neil the Scotsman , A SpecialNicknameForTest; Cardinal Mazarin, la Marquise de Maintenon,  Louis "Le Grand Dauphin" and Louis XIV the Great or Sun King, his father, proceeding from the banks of the Great Ebe from top to bottom and from bottom to top, found three spacious valleys.
                    
                    As a second paragraph, we will follow d'Artagnan on his quest to find the bones of St. Francis of Assisi, along with fictional characters and Dr. Watson, and Bill S. Preston, Esq, and George Wexford-Smyth III.  They were really curious whether Dutch cyclist Mathieu van der Poel would lose his "van der", and whether Dutch would be kept because it's a proper adjective (not a proper noun).
                    

                    The following regex S/R :

                    SEARCH (?x-is).*?(\u[\l\u'."]+(((?!and|or)[^,;.:?!\r\n]){1,5}?\u[\l\u'."]+){0,3})|.+(\R)?

                    REPLACE ?1\1(?4:\r\n)

                    Would output this text :

                    Sent
                    Swamp Orkhon
                    Holokhoi Oyuun
                    Symattai the Blacksmith
                    Hara Tumen
                    Uluu Horo
                    Omogoy
                    Ellei Bootur
                    MacArthur
                    O'Neil the Scotsman
                    SpecialNicknameForTest
                    Cardinal Mazarin
                    Marquise de Maintenon
                    Louis "Le Grand Dauphin"
                    Louis XIV the Great
                    Sun King
                    Great Ebe
                    
                    As
                    Artagnan
                    St. Francis of Assisi
                    Dr. Watson
                    Bill S. Preston
                    Esq
                    George Wexford-Smyth III.
                    They
                    Dutch
                    Mathieu
                    Poel
                    Dutch
                    

                    Remark : if you add the new rule that “Any sentence will never begin with a proper name”, it should even be possible to avoid the first word of any sentence ! However, the @peterjones and @Terry-R thoughts should convince you of the limits of regexes, in this matter !

                    Test on your real text to see if any other problem occurs ! I do not have time, presently, for explanations of this search regex, but I will, next time !

                    See you later,

                    Best Regards,

                    guy038

                    1 Reply Last reply Reply Quote 3
                    • First post
                      Last post
                    The Community of users of the Notepad++ text editor.
                    Powered by NodeBB | Contributors