How to delete all text except proper names?



  • I have a text. It is necessary to leave only the list of proper names.

    Here is:
    Sent by them to the north in search of new lands, the daring daredevils of the Swamp Orkhon, Holokhoi Oyuun, Symattai the blacksmith, Hara Tumen, Uluu Horo, the elder Omogoy and Ellei Bootur, proceeding from the banks of the Great Ebe from top to bottom and from bottom to top, found three spacious valleys.

    I need to make like this:
    Swamp Orkhon
    Holokhoi Oyuun
    Symattai the Blacksmith
    Hara Tumen
    Uluu Horo
    Omogoy
    Ellei Bootur
    Great Ebe



  • @Petr-Andreev said in How to delete all text except proper names?:

    It is necessary to leave only the list of proper names.

    I would suggest this will be an almost impossible task for a regular expression to get totally right.

    Assuming we refer to words with a capitalized first character not a problem. But then the first word in a sentence is always capitalized and as it may be a person’s name rather than as the example shows the word “Sent” do we include this or not.

    Then there’s the situation of the “Symattai the blacksmith”, where I might add you altered it so the “blacksmith” became “Blacksmith”. So we now consider 2 capitalized words separated by one other word. But now that would select “Omogoy and Ellei” and leave Bootur.

    I think this requires a level of heuristics and regex isn’t that. Regex requires strict rules that are followed to produce results and I don’t think the rules can be created well enough in this case to be certain of a good and trustworthy outcome.

    Terry



  • @Terry-R said in How to delete all text except proper names?:

    I would suggest this will be an almost impossible task for a regular expression to get totally right.

    I decided to see how futile the effort would be to attempt a (series of) regex(es) to try and get close to the goal. I changed two words in the example (blacksmith to Blacksmith and added “The” between “Swamp” and “Orkhon”) in my quest to see how I might fare.

    1. We need to break lines apart at the punctuation marks. I figured no proper names would have these (hopefully).
      Using the Replace function I have
      Find What:[[:punct:]]
      Replace With:\r\n

    2. I capture sets of multiple capitalized words and send them to another line using the Replace function.
      Find What:(?-i)((\b[A-Z]\w+\h*){2,})
      Replace With:\r\n\1\r\n

    3. Look for 2 capitalised words separated by another word and send them to another line using the Replace function.
      Find What:(?-i)((\b[A-Z]\w+\h)\w+\h(\b[A-Z]\w+))
      Replace With:\r\n\1\r\n

    4. Remove any lower case words starting from the beginning of the line up until a capitalised word (or end of line) which will NOT be captured using the Replace function.
      Find What:(?-i)^(\h*[a-z]\w+\h*)+
      Replace With: empty field here

    5. Similar to step 4 but for the remaining portion of lines (behind the last capitalised word on a line) until the end, again using the Replace function.
      Find What:(?-i)(\h*\b[a-z]\w+\h*)+$
      Replace With: empty field here

    6. Use the Line Operation function to “Remove empty lines (containing blank characters)”.

    I am left with the following. It does get close and because I was aware of certain issues that might cause problems I did the “sets of capitalised” words earlier than other steps to keep them together. It does however keep the first word in the sentence, as predicted.

    Resulting lines after the operations above (with slightly altered text as specified):

    Sent
    Swamp The Orkhon
    Holokhoi Oyuun
    Symattai the Blacksmith
    Hara Tumen
    Uluu Horo
    Omogoy
    Ellei Bootur
    Great Ebe
    

    Again I will reiterate my statement that this cannot be trusted to produce a good result, although as my test shows, it might get reasonably close. Close inspection of the output might show easily where some additional work is warranted. This will however do the bulk of it, get you to within 95% of the goal. Regexes are not “sentient”, they will follow blindly whatever the tasks before them, whether right or wrong! It requires a “more sentient” engine to correctly complete this task. Perhaps something like the Google Search engine.

    Terry



  • @Terry-R said in How to delete all text except proper names?:

    see how futile

    @Petr-Andreev ,

    In case you weren’t yet convinced as to the futility of this request, let’s add a second paragraph:

    As a second paragraph, we will follow d'Artagnan on his quest to find the bones of St. Francis of Assisi, along with fictional characters and Dr. Watson, and Bill S. Preston, Esq, and George Wexford-Smyth III.  They were really curious whether Dutch cyclist Mathieu van der Poel would lose his "van der", and whether Dutch would be kept because it's a proper adjective (not a proper noun).
    

    After @Terry-R’s noble attempt at setting up an algorithm from your example paragraph, when this second paragraph is processed… :-(

    As a
    d
    Artagnan on his quest to find the bones of St
     
    Francis of Assisi
    
    Dr
     Watson
    Bill S
     Preston
     Esq
    
    George Wexford
    
    
    Smyth III
    
      They
    Dutch cyclist Mathieu
    Poel
    
    
    Dutch
    s a
    a
    

    As @Terry-R has indicated: this problem takes a level of intelligence (or enumeration of rules) that regex just doesn’t have.



  • @PeterJones said in How to delete all text except proper names?:

    when this second paragraph is processed… :-(

    Thanks for that 2nd paragraph. I knew in my mind it was a futile effort but I (sadly) had to try (to prove my hypothesis). Of course my test actually appeared to show it was possible, something I’d hoped would not happen. But the reason it did was because my series of regexes were designed entirely with the first example in mind, to MAKE it work.

    @PeterJones example showed that there are so many different types of Proper Names lurking out there that it is impossible for regex to cover them all. My other statement as to the ability of my solution to get part (most) of the way was RUBBISH, as has been proven.

    About the ONLY thing I proved was that unless the OPs present us with examples which provide a good representation of the actual data, what they will receive back is next to useless. Oh and how many times have we seen that occur!

    Terry



  • @Petr-Andreev, @Terry-R, @PeterJones:

    I agree with both of you, Terry and Peter - regex is a resource of a syntactic order and not a semantic one, therefore it knows nothing about proper names. On the other hand, there is no need to use this tool alone, as it can be combined with others more suitable for the task, for example, a dictionary.

    Let me explain my point.

    As we all know, proper names usually begin with a capital letter and in this particular case the rule applies. But also many other common words can be capitalized, so that criterion alone is not enough. However there is another fact that we can take into account - the proper names quoted in OP’s text are completely non-English words, so they would be highlighted as misspelled words in a purely English dictionary and this would allow us to distinguish them from English words that begin with a capital letter. This fact suggested me an alternative approach to this particular text, an approach not suitable for other ones, as @PeterJones showed above.

    The approach then is to use the English dict provided by the DSpellCheck plugin to collect a list of proper names. First it’s needed to modify the text with a couple of regexes to make the task doable. I am also going to consider Blacksmith instead of blacksmith.

    a) Replace spaces between first and last names with underscores:

    Search: (?-i)([A-Z][a-z]+)\K (?=[A-Z][a-z]+)
    Replace: _
    

    b) Replace spaces for Symattai and his office —run it twice—:

    Search: (?-i)(([A-Z][a-z]+)\K\x20(?=the))|((?<=_the)\x20(?=[A-Z][a-z]+))
    Replace: _
    

    to get the following text:

    Sent by them to the north in search of new lands, the daring daredevils of the Swamp_Orkhon, Holokhoi_Oyuun, Symattai_the_Blacksmith, Hara_Tumen, Uluu_Horo, the elder Omogoy and Ellei_Bootur, proceeding from the banks of the Great_Ebe from top to bottom and from bottom to top, found three spacious valleys.
    

    c) Select DSpellCheck from the Plugins menu and

    • Change Current Language -> English (United States) - Only this one, don’t mix languages!

    • In Settings...
      Simple -> Library -> select Hunspell - Is the one I had installed, maybe other dicts work fine as well
      Advanced -> Ignore words -> Uncheck these three options: Starting with Capital Letter, Having not first Capital Letter, Having _.

    • Select Spell Check Document Automatically and all proper names should be highlighted. Then select Additional Actions -> Copy All Misspelled Words to Clipboard to get an ordered list of unique proper names, as follows:

    Ellei_Bootur
    Great_Ebe
    Hara_Tumen
    Holokhoi_Oyuun
    Omogoy
    Swamp_Orkhon
    Symattai_the_Blacksmith
    Uluu_Horo
    

    Later you can replace back the spaces between the collected names.

    I think it does a good job with OP’s text, and although it may fail —surely it would!— with other examples or in other cases, it can also be improved. On the other hand, my main point was to show another approach to deal with this kind of issues. Hope it can suggest people new ideas to solve some recurrent problems.

    Have fun!



  • @astrosofista said in How to delete all text except proper names?:

    On the other hand, my main point was to show another approach to deal with this kind of issues.

    Dare I say it (or as Sherlock Holmes said):
    “The game is afoot”!

    It actually looks now like a possibility. But do we throw time and resources to it? And to what end (a proof of concept)? I dare say collectively we could create quite a passable solution. That’s the beauty of this “collective mind” on the forum, we can all feed off each other’s ideas.

    However I’m content that we’ve shown the OP how hard it is, especially when presented with another different paragraph. I think it would require a lot more effort by someone to take this to the bitter (sweet) end. All i can say is good luck to anyone who wishes to give it a go.

    Terry



  • Hello, @petr-andreev, @terry-R, @astrosofista and All,

    @terry-r and @peterjones are right about it. For instance, the first word of your example, Sent is obviously not a proper name but this long sentence could have started with a proper name, too ! Regexes are really not a fair tool to solve semantic problems ;-))

    @Astrosofista, I do like your approach, too ! As for me, I succeeded to build up a regex which catches, “more or less”, all the proper names and compound proper names of the OP’s text !

    I modified your example, adding some dummy text as well as the @peterjones text, for a deeper text. Of course, the first word of any sentence is considered as a proper name. And, although your first sentence, slightly modified, seems to give good results, the @peterjones text, coming next, is not so pertinent !

    Sent by them to the north in search of new lands, the daring daredevils of the Swamp Orkhon, Holokhoi Oyuun, Symattai the Blacksmith, Hara Tumen, Uluu Horo, the elder Omogoy and Ellei Bootur, as well as the general MacArthur, O'Neil the Scotsman , A SpecialNicknameForTest; Cardinal Mazarin, la Marquise de Maintenon,  Louis "Le Grand Dauphin" and Louis XIV the Great or Sun King, his father, proceeding from the banks of the Great Ebe from top to bottom and from bottom to top, found three spacious valleys.
    
    As a second paragraph, we will follow d'Artagnan on his quest to find the bones of St. Francis of Assisi, along with fictional characters and Dr. Watson, and Bill S. Preston, Esq, and George Wexford-Smyth III.  They were really curious whether Dutch cyclist Mathieu van der Poel would lose his "van der", and whether Dutch would be kept because it's a proper adjective (not a proper noun).
    

    The following regex S/R :

    SEARCH (?x-is).*?(\u[\l\u'."]+(((?!and|or)[^,;.:?!\r\n]){1,5}?\u[\l\u'."]+){0,3})|.+(\R)?

    REPLACE ?1\1(?4:\r\n)

    Would output this text :

    Sent
    Swamp Orkhon
    Holokhoi Oyuun
    Symattai the Blacksmith
    Hara Tumen
    Uluu Horo
    Omogoy
    Ellei Bootur
    MacArthur
    O'Neil the Scotsman
    SpecialNicknameForTest
    Cardinal Mazarin
    Marquise de Maintenon
    Louis "Le Grand Dauphin"
    Louis XIV the Great
    Sun King
    Great Ebe
    
    As
    Artagnan
    St. Francis of Assisi
    Dr. Watson
    Bill S. Preston
    Esq
    George Wexford-Smyth III.
    They
    Dutch
    Mathieu
    Poel
    Dutch
    

    Remark : if you add the new rule that “Any sentence will never begin with a proper name, it should even be possible to avoid the first word of any sentence ! However, the @peterjones and @Terry-R thoughts should convince you of the limits of regexes, in this matter !

    Test on your real text to see if any other problem occurs ! I do not have time, presently, for explanations of this search regex, but I will, next time !

    See you later,

    Best Regards,

    guy038


Log in to reply