Hello, @petr-andreev, @terry-R, @astrosofista and All,
@terry-r and @peterjones are right about it. For instance, the first word of your example, Sent is obviously not a proper name but this long sentence could have started with a proper name, too ! Regexes are really not a fair tool to solve semantic problems ;-))
@Astrosofista, I do like your approach, too ! As for me, I succeeded to build up a regex which catches, “more or less”, all the proper names and compound proper names of the OP’s text !
I modified your example, adding some dummy text as well as the @peterjones text, for a deeper text. Of course, the first word of any sentence is considered as a proper name. And, although your first sentence, slightly modified, seems to give good results, the @peterjones text, coming next, is not so pertinent !
Sent by them to the north in search of new lands, the daring daredevils of the Swamp Orkhon, Holokhoi Oyuun, Symattai the Blacksmith, Hara Tumen, Uluu Horo, the elder Omogoy and Ellei Bootur, as well as the general MacArthur, O'Neil the Scotsman , A SpecialNicknameForTest; Cardinal Mazarin, la Marquise de Maintenon, Louis "Le Grand Dauphin" and Louis XIV the Great or Sun King, his father, proceeding from the banks of the Great Ebe from top to bottom and from bottom to top, found three spacious valleys.
As a second paragraph, we will follow d'Artagnan on his quest to find the bones of St. Francis of Assisi, along with fictional characters and Dr. Watson, and Bill S. Preston, Esq, and George Wexford-Smyth III. They were really curious whether Dutch cyclist Mathieu van der Poel would lose his "van der", and whether Dutch would be kept because it's a proper adjective (not a proper noun).
The following regex S/R :
SEARCH (?x-is).*?(\u[\l\u'."]+(((?!and|or)[^,;.:?!\r\n]){1,5}?\u[\l\u'."]+){0,3})|.+(\R)?
REPLACE ?1\1(?4:\r\n)
Would output this text :
Sent
Swamp Orkhon
Holokhoi Oyuun
Symattai the Blacksmith
Hara Tumen
Uluu Horo
Omogoy
Ellei Bootur
MacArthur
O'Neil the Scotsman
SpecialNicknameForTest
Cardinal Mazarin
Marquise de Maintenon
Louis "Le Grand Dauphin"
Louis XIV the Great
Sun King
Great Ebe
As
Artagnan
St. Francis of Assisi
Dr. Watson
Bill S. Preston
Esq
George Wexford-Smyth III.
They
Dutch
Mathieu
Poel
Dutch
Remark : if you add the new rule that “Any sentence will never begin with a proper name”, it should even be possible to avoid the first word of any sentence ! However, the @peterjones and @Terry-R thoughts should convince you of the limits of regexes, in this matter !
Test on your real text to see if any other problem occurs ! I do not have time, presently, for explanations of this search regex, but I will, next time !
See you later,
Best Regards,
guy038