regex for making an acronym from a complete name (European Community into EC)

Jos Maas

In Guy Thevenot’s " SYNTAXE des expressions RÉGULIÈRES PRCE, de NOTEPAD++ v6.0 et PLUS " I found a way to convert a string with words starting with a capital into an acronym e.g. “Brabants Historisch Informatie Centrum” into BHIC or “European Community” into EC (Chapter V RECHERCHES et/ou REMPLACEMENTS STANDARD, section 32 SUPPRESSION de TOUT texte, ENTRE DEUX occurrences SUCCESSIVES de Re dans CHAQUE ligne d’un FICHIER): (?<=(Re)).*?(?=\1)

For Re I choose ([A-Z]) (the capital) and for the characters between the two occurences of ([A-Z]) I choose [a-z \x20]* instead of .* making:
(?<=([A-Z]))[a-z \x20]*?(?=\1)

I put the cursor at the start of the above mentioned string and pushed the button Find next with radiobutton on regular expression. I got the message Find: Can’t find the text “(?<=([A-Z]))[a-z \x20]*?(?=\1)”. Pushing the replace button I got the message no occurence. In Regex Helper I did not get an error message, nor a match was found. I don’t understand what I did wrong. Please, help is kindly asked.

MAPJe71

(?-i)(?<=(?<OneCapital>\b[A-Z]))[a-z\x20]+?(?=(?&OneCapital)|\b)

Jos Maas

@MAPJe71 said:

(?-i)(?<=(?<OneCapital>\b[A-Z]))[a-z\x20]+?(?=(?&OneCapital)|\b)

Thanks for your reply! I tried your solution on a part of an index:
Vader van de bruid
Joannes Christoffel Struning
Moeder van de bruid
Joanna Lutkie
Gebeurtenis
Huwelijk
Datum
06-06-1851
Gebeurtenisplaats
's-Hertogenbosch
Documenttype
BS Huwelijk
Erfgoedinstelling
Brabants Historisch Informatie Centrum

And here is what I get with the searchstring in the searchfield and an empty replacefield and pushing the replace all button.

V van de bruid
J C S
M van de bruid
J L
G
H
D
06-06-1851
G
's-H
D
BS H
E
B H I C

I think your suggestion is useful, but not yet completely suited, because it leaves spaces in the acronym (B H I C instead of BHIC).

And further I have to find out how to plug it in or around the named expression (<REPONM> …) in the searchstring of which a snippet is:
<snippet>
|
(Erfgoedinstelling\R+
(?<REPONM>[A-z ])\R
(
( ([A-z ]laats[A-z ])\R+
(?<REPOPLCNM>[A-z '-ë])\R+
)
|
( (Collectiegebied)\R+
(?<COLLGEBNM>[A-z \x20 - '])\R+
)
)
)
|
<\snippet>

And to have something like $+{REPOACRONYM} - which is made in the OneCapitalesection of the searchstring mentionned above - in the replacestring of which a snippet hereafter is:
<snippet>
0 @S00@ SOUR\r\n
1 TITL $+{BRONNM}\r\n
1 AUTH $+{CTPLC}\r\n
1 PUBL $+{REPOACRONYM}-$+{TOEGNR}-$+INVNR}\r\n

The string Brabants Historisch Informatie Centrum itself, found in the subexpression (<REPONM> … ) must be kept, because it is used in the replace, see next snippet:
<snippet>
0 @R0000@ REPO\r\n
1 NAME $+{REPONM}, $+{REPOPLCNM}\r\n
<\snippet>

I think that my first question in this post is more complex than I realized. Nevetheless I hope you have the opportunity to have look on the problem - thanks in advance.

Jos Maas

MAPJe71

Could you provide a description (or link to one) of the “index” format and the complete search and replace strings you’ve got?

Jos Maas

Nor the notepad++ forum, nor my blog accept uploading of files, so I have copied the searchstring as a post in my blog:
https://maashoeven.blogspot.nl/2017/07/searchstring-structured-layout.html.

The replacstring is in : https://maashoeven.blogspot.nl/2017/07/replacestring-for-ged-file.html

The index is https://maashoeven.blogspot.nl/2017/07/voorbeeld-van-een-index.html - note, this is a post about indexes and its use in general, that I keep in myblog, although it still needs a finishing touch. The index you asked for is in this post.

Because * < and > cause trouble in webpages I have changed * by =+=star=+=, < by =+=LT=+= and > by =+=GR=+=. In extended searchmode (accepting \r \t \ r etc.) replace all \r, \n and \t by an empty string and =+=*=+= by * etc. By the way, the * has disappeared in my post in the notepad++ forum as well.

If this is suitable for you I am happy. If not, let me know, perhaps an email suits better.

Greetz, Jos Maas

guy038

Hello, @Jos-maas,

I didn’t investigate about all your project, but, specifically, about creation of acronyms, you could use the regex S/R :

SEARCH ([\u\l])\l*|\h

REPLACE ?1\U\1

So, from your original list, below :

Vader van de bruid
Joannes Christoffel Struning
Moeder van de bruid
Joanna Lutkie
Gebeurtenis
Huwelijk
Datum
06-06-1851
Gebeurtenisplaats
's-Hertogenbosch
Documenttype
BS Huwelijk
Erfgoedinstelling
Brabants Historisch Informatie Centrum

I obtained the following text :

VVDB
JCS
MVDB
JL
G
H
D
06-06-1851
G
'S-H
D
BSH
E
BHIC

Are these acronyms OK, for you ?

And concerning my tutorial, about N++ regular expressions syntax, from Christian CUVIER’s site, it is quite outdated, by now ( June 2013 ! )

I wanted to wait a bit for the implementation of the enhanced regex syntax, of François-R Boyer, which corrects some bugs of the present N++ regex engine. Unfortunately, for various reasons, this integration never happens :-((( So, since June 2013, if I consider that :

I’ve learned some more things about Boost regex syntax
I, of course, detected some errors or omissions about this tutorial
This tutorial should be re-written in American-English

I just have to add, in my To-Do list : Do a complete re-writing of my N++ regex tutorial !! However, be patient before getting this new version ! I probably will update, first, the French version, adding some examples from some of my posts, on the N++ Community and, secondly, try to get a decent translation !!

Anyway, even in that old version, some tables and methods could be of interest, to some people ! You can get it, on various forms ( txt, html or pdf ) , at the address :

http://oedoc.free.fr/Regex/TutorielRegex.zip

Best Regards,

guy038

Jos Maas

The regex searchstring makes a correct acronym and could be used … if there is a possibility also to keep the original string.

The string in this example - in the named expression < REPONM> - “Brabants Historisch Informatie Centrum” so, the acronym is - a named expression <REPOACRNM> - “BHIC”. In the replace string both named expressions are used:
1 PUBL $+{REPOACRNM}-$+{something else}\r\n and 1 NAME $+{REPONM}, $+{REPOPLCNM}\r\n.

In what I have tried myself I have the feeling that there is a conflict in the requirements that both strings, name and acronym, have to be kept for use in the replacestring.

(?<ACRNM>
(?<ACRNMLTR>
(?<REPONM
(?<REPONMWRD
(?<=(?<HFDLTR>\b[A-Z]))
[a-z\x20]+
)+
)
?(?=(?&HFDLTR)|\b)
)*</p>

MAPJe71

FYI your last snippet is missing a closing parenthesis.

Jos Maas

I have seen, that the solutions guy038 and MAPJe71 suggest for my questions use a searchstring, a replacestring and use of the Replace all button. I think I have to explain my philosophy for the converting mechanism index to ged-file. I want to make a searchstring that encompasses all possibilities an index can have and that generates named expressions. Then I push the Search next button after positioning the cursus at the head of the index and the whole index should be selected. If not completely selected, I have made a mistake in the searchstring or I have missed a possibility an index can have. I have to correct the searchstring or to add the discovered missing possibility to the search string before I can continue. Pushing the Replace once button I replace the completely selected index by the replacestring in which I use the named expressions from the searchstring.

It is a pity I could not implement the solutions you both suggested in this method.

I have made a new version of a part of my searchstring, that describes the mechanism of selecting and naming the reposition plus generating an acronym of that reposition. I found out that the syntax is wrong, but I hope it makes clear what my aim is.

The acronym ACRNM consists of characters ACRNMCHR formed from a name REPONM. The name consist of words REPONMWRD, a word consists of characters of which the first one in Capital CAPTL followed by unnamed characters and spaces. The capitals CAPTL should be added to the acronym; $+{CAPTL} in the searchstring seems logical to me, but is incorrect syntax in the searcstring. Here is the searchstring:

(?<ACRNM>(?<ACRNMCHR>(?<REPONM>(?<REPONMWRD>((?<CAPTL>\b[A-Z]))[a-z\x20]+)+)$+{CAPTL})+)

I hope I made clear the philosophy of my method. Thanks in advance for further help!

Best regards, Jos Maas

MAPJe71

What’s the reason for adding the capitals to the search string?
Is it to be able to match e.g. Brabants Historisch Informatie Centrum BHIC?

guy038

Hi, @Jos-maas,

I began to study your last SEARCH regex :

(?<ACRNM>(?<ACRNMCHR>(?<REPONM>(?<REPONMWRD>((?<CAPTL>\b[A-Z]))[a-z\x20]+)+)\g<CAPTL>)+)

I noticed two errors :

You repeat grouping of the CAPTL group ! So, instead of the part ((?<CAPTL>\b[A-Z])), the right syntax is, only, (?<CAPTL>\b[A-Z])
Secondly, you CANNOT use the $+{Name} syntax, as a back-reference in the SEARCH regex. The $+{Name} syntax is reserved to the REPLACE regex !!

Instead, you can use one of the six syntaxes, below, for a back-reference to a named group, previously defined in current regex :

\g{Name} OR \g<Name> OR \g'Name'

\k{Name} OR \k<Name> OR \k'Name'

Personally, I prefer the syntax <Name> to the two others ! The name seems easier to identify ! I also prefer the \g form to the \k one, as the letter g make you think, surely, of the word group !

Then, little to little, I increased a sub-regex of your regex, to get this one :

SEARCH (?-i)(?<REPONM>(?<REPONMWRD>(?<CAPTL>\b[A-Z])[a-z\x20]+)+)

And I added a replace regex, below, in order to capture the values of each named group

REPLACE REPONM = $+{REPONM}\r\nREPONMWRD = $+{REPONMWRD}\r\nCAPTL = $+{CAPTL}

When you execute this regex S/R, against the simple text :

Brabants Historisch Informatie Centrum

The SEARCH regex matches the whole string Brabants Historisch Informatie Centrum and, after replacement, we get :

REPONM = Brabants Historisch Informatie Centrum
REPONMWRD = Centrum
CAPTL = C

Notes :

I preferred to begin the regex by the syntax (?-i) to forces the search to be sensitive ( NON insensitive ! )
You’ve, certainly, noticed that the capturing values are always the value of the last repetition, for each group !
Be aware that, UNLIKE script languages, as Python, or Lua, regexes CANNOT store all successive values of the groups !
Anyway, the good thing is that this SEARCH regex is correct and select all the text of any line, composed of successive words, beginning, each, with a single capital letter :-))

So, now, let’s try, the upper level SEARCH regex :

(?-i)(?<ACRNMCHR>(?<REPONM>(?<REPONMWRD>(?<CAPTL>\b[A-Z])[a-z\x20]+)+)\g<CAPTL>)

Remarks :

The part \g<CAPTL>, as said, above, is a back-reference, to the previously defined named group CAPTL
However, although this regex is correct, NO match can be found. Quite logical, indeed : You’re trying to find a complete line , as explained, above, immediately followed by the capital letter of the last word of the line !

Indeed, this regex would match any text, composed of words, beginning with a single capital letter, and ending by the LAST capital letter of current line

Brabants Historisch Informatie CentrumC
Joannes Christoffel StruningS
Moeder van de bruidM
Joanna LutkieL
BS HuwelijkH

So what ??

Moreover, your inner syntax (?<CAPTL>\b[A-Z])[a-z\x20]+ matches each individual word, followed by a space character of the string Brabants Historisch Informatie Centrum. But, it would, also, match the string Abcd efgh ijkl mnop qrst, in one go ! Is it what you expect to ?

Finally, it seems that from the denomination Brabants Historisch Informatie Centrum, you would like to obtain its acronym ( BHIC ), while keeping stored the values of all the named groups, previously defined ? To my mind, this goal cannot be achieved by regexes !

Cheers,

guy038

Jos Maas

Thanks both of you, guy038 and MAPJe71! A lot of stuff to be studied - I am really learning by doing!

Helas, the last paragraph “Finally, it seems that from the denomination Brabants Historisch Informatie Centrum, you would like to obtain its acronym ( BHIC ), while keeping stored the values of all the named groups, previously defined ? To my mind, this goal cannot be achieved by regexes !” indeed destroyed my hope to find a solution for keeping the original string and making and saving an acronym for use of both in the replace string.

I realize that I have to do a second S/R action in which I replace on the right spot the string by the acronym. Because the spot for the complete name is in the string “1 NAME $+{REPONM}, $+{REPOPLCNM}\r\n” and the place for the acronym is in the string “1 PUBL $+{ACRONM}-something”, no mistake is possible. It is a pity that my aim to do the S/R once is impossible, but it is not the end of the world.

I think I can go further now. Thanks for your help!

Greetings, Jos Maas

guy038

Hello, @Jos-maas,

Don’t be so sorry about my last statement ! May be, we can go further on :-) When a problem seems complex, it must be split up in several pieces !

So, to begin with, given this unique item of your index, below :

Brabants Historisch Informatie Centrum

How must it looks like, after replacement ? I suppose that you want to repeat, at least, the string Brabants Historisch Informatie Centrum, as well as its acronym, BHIC, with other material, in one or several lines ?

Remark : In all your posts, you’re using named groups, in your regexes. Be aware that named groups are just a work-around for a better understanding of regexes. But they cannot be re-used, outside the current regex, unlike in script languages !

BTW, some names of your groups, seem to be duplicate ! Could you produce an unique list of all these named groups and mention, for each group, if it should be re-used or not, in the replacement part !

Cheers,

guy038

Jos Maas

Hello, guy038,

You must be a real optimist, and maybe you can glue the pieces of this complex problem together!

Indeed, the string Brabants Historisch Informatie Centrum and its acronym BHIC are used in the replacestring. The string is used as a title and occurs on a single line together with the name of the place (“Brabants Historisch Informatie Centrum, 's-Hertogenbosch”, given by: “1 NAME $+{REPONM}, $+{REPOPLCNM}”. The acronym is used in a code representing uniquely the source of the index, being, acronym of reposition, archive-ident and inventorynumber, given as “1 PUBL $+{REPOACRONM}-$+{TOEGNR}-$+{INVNR}”

Hereafter is a table you asked for, with columns for the names of the groups, yes or no in the replacestring and for better understanding the meaning of the group and one or more remarks.

named group		to be replaced	meaning	remarks
BRMNM	-	yes	name of groom
BRMGVN	-	yes	given name of groom;	subexpression in BRMNM
BRMSFX	-	yes	suffix of groom;	subexpression in BRMNM
BRMSRN	-	yes	surname of groom;	subexpression in BRMNM
BRMGEBDAT	-	no	date of birth groom;
BRMGEBDD	-	yes	day of birth groom;	subexpression in BRMGEBDAT
BRMGEBMM	-	yes	month of birth groom;	subexpression in BRMGEBDAT
BRMGEBYY	-	yes	year of birth groom;	subexpression in BRMGEBDAT
BRMGEBPLACE	-	yes	place of birth groom;

the same kind of named expressions above for the bride: instead of BRM read BRD

named expression	-	used in replace	meaning	remarks
VABGNM	-	no	name of grooms father;
VABGGVN	-	yes	given name of grooms father;	subexpression in VABGNM
VABGSFX	-	yes	suffix in name of grooms father;	subexpression in VABGNM
VABGSRN	-	yes	surname of grooms father;	subexpression in VABGNM

the same kind of named expressions above for the grooms mother: instead of VABG read MOBG
the same kind of named expressions above for the brides father: instead of VABG read VABD

t	he same kind of named expressions above for the brides mother: instead of VABG read MOBD
named group		used in replace	meaning	remarks
REPONM	-	yes	name of reposition (archive)	used in title of repo
REPOACRONM	-	yes	acronym for name of reposition;	used in indentification of act, derived from REPONM
REPOPLCNM	-	yes	name of settlement of repo;
COLLGEBNM	-	yes	part of the collection of a repo;
EV		yes	event
EVDAT		no	date of event	day month and year due to convention: index dd-mm-yyyy >> dd/mm/yyyy
EVDD		yes	day of event
EVMM		yes	month of event
EVYY		yes	year of event
EVPLACE		yes	name of place of event
BRONNM		yes	name of source
BRONTYPE		yes	type of source	civil or church registration, particular archive a.s.o.
BRONCATLETTER		yes	one character, G for Birth, O for death, H for marriage, D for christening, B for burial
ARCHNM		no
TOEGNR		yes	number of global entry in archivesystem
INVNR		yes	subnumber of entry in archivesystem
CTNUMMER		yes	number that specifies (within the entry) the act from which the information is cited
CTDAT		no	the date the act is registrerd	used for getting DD, MM and YYYY
CTDD		yes	day of registration	see remarks on DATE before
CTMM		yes	month of registration
CTYY		yes	year of registration
CTPLC		yes	name of place where act is registerd	can be different from place of event
CTSRT		no		item can occur in index, so the searchstring has to find this text.
CTOPM		yes	notation in act f.i. groom is widower
WLNK		yes	weblink to site of reposition
WPAG		yes	specific page on site where index is found

Bon courage! Jos Maas

MAPJe71

Questions / remarks:

Why name/catch a group when it’s not used in the replace string?
Is there a difference in -yes vs. yes and -no vs. no in the used in replace column?
“for instance” is abbreviated as “e.g.” ;)
You could simplify the search and replace expressions when you update/correct the date notation format in a separate search-replace action.

Jos Maas

Hello, @MAPJe71

ad 1) just for myself in understanding what I am doing. I am planning to wiping out those names, because I have the impression np++ is limited in the number of names. E.g. (thanks for 3. I remembered exempli gratia from my secundary school) I got a find error that did not return after wiping out some of unused names.
ad 2) No, It has to do with the limited facilities to present a nicely formatted table in Markdown; so I used “-” in an extra column, but helas not consequently.
ad 4) Do you have a suggestion how?

Thanks for the help.

MAPJe71

Hmm, my reply is considered spam by Aksimet.com.

MAPJe71

Do you have a suggestion how?

Convert date formats:
search for: (\d{2})-(\d{2})-(\d{4})
replace with: \1/\2/\3
Convert index to GED format after updating every “date” group in your search and replace expressions from e.g. (?'BRMGEBDAT'(?'BRMGEBDD'\d\d)-(?'BRMGEBMM'\d\d)-(?'BRMGEBYY'\d\d\d\d)) and DATE \k'BRMGEBDD'/\k'BRMGEBMM'/\k'BRMGEBYY' to (?'BRMGEBDAT'\d{2}/\d{2}/\d{4}) and DATE \k'BRMGEBDAT' respectively.

MAPJe71

Askimet.com apparently does not like the $<...> and $+{...} format.

Jos Maas

@guy038
Hello, Guy,
In a reply of about a montha ago, you wrote “Be aware that, UNLIKE script languages, as Python, or Lua, regexes CANNOT store all successive values of the groups !”.
The good news is now that I have a set of working regexes for some sorts of indexes. I would go further now, but It turned out, that the amount of characters that a regex can handle is too small for my goal. So I think I have to use python to do the trick. I know a bit of programming (I learned the basics of algol and fortran some 50 years ago), but I did not do that job for years, so I fear that it will take some time before I am able to make working python-scripts. Therefor, I would like to ask you some questions so I don’t have to read lots of documentaries and forum-discussions which might be irrelevant for my limited goal.

Can named groups from a regex used in write-statements?
If Yes, could you give an clarifying example?
does python have limitations in the amount of characters in regexes?

Thanks in advance, best regards, Jos