Community
    • Login

    regex for making an acronym from a complete name (European Community into EC)

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    20 Posts 3 Posters 9.0k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • guy038G
      guy038
      last edited by guy038

      Hi, @Jos-maas,

      I began to study your last SEARCH regex :

      (?<ACRNM>(?<ACRNMCHR>(?<REPONM>(?<REPONMWRD>((?<CAPTL>\b[A-Z]))[a-z\x20]+)+)\g<CAPTL>)+)

      I noticed two errors :

      • You repeat grouping of the CAPTL group ! So, instead of the part ((?<CAPTL>\b[A-Z])), the right syntax is, only, (?<CAPTL>\b[A-Z])

      • Secondly, you CANNOT use the $+{Name} syntax, as a back-reference in the SEARCH regex. The $+{Name} syntax is reserved to the REPLACE regex !!

      Instead, you can use one of the six syntaxes, below, for a back-reference to a named group, previously defined in current regex :

      \g{Name} OR \g<Name> OR \g'Name'

      \k{Name} OR \k<Name> OR \k'Name'

      Personally, I prefer the syntax <Name> to the two others ! The name seems easier to identify ! I also prefer the \g form to the \k one, as the letter g make you think, surely, of the word group !


      Then, little to little, I increased a sub-regex of your regex, to get this one :

      SEARCH (?-i)(?<REPONM>(?<REPONMWRD>(?<CAPTL>\b[A-Z])[a-z\x20]+)+)

      And I added a replace regex, below, in order to capture the values of each named group

      REPLACE REPONM = $+{REPONM}\r\nREPONMWRD = $+{REPONMWRD}\r\nCAPTL = $+{CAPTL}

      When you execute this regex S/R, against the simple text :

      Brabants Historisch Informatie Centrum
      

      The SEARCH regex matches the whole string Brabants Historisch Informatie Centrum and, after replacement, we get :

      REPONM = Brabants Historisch Informatie Centrum
      REPONMWRD = Centrum
      CAPTL = C
      

      Notes :

      • I preferred to begin the regex by the syntax (?-i) to forces the search to be sensitive ( NON insensitive ! )

      • You’ve, certainly, noticed that the capturing values are always the value of the last repetition, for each group !

      • Be aware that, UNLIKE script languages, as Python, or Lua, regexes CANNOT store all successive values of the groups !

      • Anyway, the good thing is that this SEARCH regex is correct and select all the text of any line, composed of successive words, beginning, each, with a single capital letter :-))


      So, now, let’s try, the upper level SEARCH regex :

      (?-i)(?<ACRNMCHR>(?<REPONM>(?<REPONMWRD>(?<CAPTL>\b[A-Z])[a-z\x20]+)+)\g<CAPTL>)

      Remarks :

      • The part \g<CAPTL>, as said, above, is a back-reference, to the previously defined named group CAPTL

      • However, although this regex is correct, NO match can be found. Quite logical, indeed : You’re trying to find a complete line , as explained, above, immediately followed by the capital letter of the last word of the line !

      Indeed, this regex would match any text, composed of words, beginning with a single capital letter, and ending by the LAST capital letter of current line

      Brabants Historisch Informatie CentrumC
      Joannes Christoffel StruningS
      Moeder van de bruidM
      Joanna LutkieL
      BS HuwelijkH
      

      So what ??

      Moreover, your inner syntax (?<CAPTL>\b[A-Z])[a-z\x20]+ matches each individual word, followed by a space character of the string Brabants Historisch Informatie Centrum. But, it would, also, match the string Abcd efgh ijkl mnop qrst, in one go ! Is it what you expect to ?


      Finally, it seems that from the denomination Brabants Historisch Informatie Centrum, you would like to obtain its acronym ( BHIC ), while keeping stored the values of all the named groups, previously defined ? To my mind, this goal cannot be achieved by regexes !

      Cheers,

      guy038

      Jos MaasJ 1 Reply Last reply Reply Quote 1
      • Jos MaasJ
        Jos Maas
        last edited by

        Thanks both of you, guy038 and MAPJe71! A lot of stuff to be studied - I am really learning by doing!

        Helas, the last paragraph “Finally, it seems that from the denomination Brabants Historisch Informatie Centrum, you would like to obtain its acronym ( BHIC ), while keeping stored the values of all the named groups, previously defined ? To my mind, this goal cannot be achieved by regexes !” indeed destroyed my hope to find a solution for keeping the original string and making and saving an acronym for use of both in the replace string.

        I realize that I have to do a second S/R action in which I replace on the right spot the string by the acronym. Because the spot for the complete name is in the string “1 NAME $+{REPONM}, $+{REPOPLCNM}\r\n” and the place for the acronym is in the string “1 PUBL $+{ACRONM}-something”, no mistake is possible. It is a pity that my aim to do the S/R once is impossible, but it is not the end of the world.

        I think I can go further now. Thanks for your help!

        Greetings, Jos Maas

        1 Reply Last reply Reply Quote 0
        • guy038G
          guy038
          last edited by

          Hello, @Jos-maas,

          Don’t be so sorry about my last statement ! May be, we can go further on :-) When a problem seems complex, it must be split up in several pieces !

          So, to begin with, given this unique item of your index, below :

          Brabants Historisch Informatie Centrum
          

          How must it looks like, after replacement ? I suppose that you want to repeat, at least, the string Brabants Historisch Informatie Centrum, as well as its acronym, BHIC, with other material, in one or several lines ?

          Remark : In all your posts, you’re using named groups, in your regexes. Be aware that named groups are just a work-around for a better understanding of regexes. But they cannot be re-used, outside the current regex, unlike in script languages !

          BTW, some names of your groups, seem to be duplicate ! Could you produce an unique list of all these named groups and mention, for each group, if it should be re-used or not, in the replacement part !

          Cheers,

          guy038

          1 Reply Last reply Reply Quote 0
          • Jos MaasJ
            Jos Maas
            last edited by

            Hello, guy038,

            You must be a real optimist, and maybe you can glue the pieces of this complex problem together!

            Indeed, the string Brabants Historisch Informatie Centrum and its acronym BHIC are used in the replacestring. The string is used as a title and occurs on a single line together with the name of the place (“Brabants Historisch Informatie Centrum, 's-Hertogenbosch”, given by: “1 NAME $+{REPONM}, $+{REPOPLCNM}”. The acronym is used in a code representing uniquely the source of the index, being, acronym of reposition, archive-ident and inventorynumber, given as “1 PUBL $+{REPOACRONM}-$+{TOEGNR}-$+{INVNR}”

            Hereafter is a table you asked for, with columns for the names of the groups, yes or no in the replacestring and for better understanding the meaning of the group and one or more remarks.

            named group to be replaced meaning remarks
            BRMNM - yes name of groom
            BRMGVN - yes given name of groom; subexpression in BRMNM
            BRMSFX - yes suffix of groom; subexpression in BRMNM
            BRMSRN - yes surname of groom; subexpression in BRMNM
            BRMGEBDAT - no date of birth groom;
            BRMGEBDD - yes day of birth groom; subexpression in BRMGEBDAT
            BRMGEBMM - yes month of birth groom; subexpression in BRMGEBDAT
            BRMGEBYY - yes year of birth groom; subexpression in BRMGEBDAT
            BRMGEBPLACE - yes place of birth groom;

            the same kind of named expressions above for the bride: instead of BRM read BRD

            named expression - used in replace meaning remarks
            VABGNM - no name of grooms father;
            VABGGVN - yes given name of grooms father; subexpression in VABGNM
            VABGSFX - yes suffix in name of grooms father; subexpression in VABGNM
            VABGSRN - yes surname of grooms father; subexpression in VABGNM

            the same kind of named expressions above for the grooms mother: instead of VABG read MOBG
            the same kind of named expressions above for the brides father: instead of VABG read VABD

            t he same kind of named expressions above for the brides mother: instead of VABG read MOBD
            named group used in replace meaning remarks
            REPONM - yes name of reposition (archive) used in title of repo
            REPOACRONM - yes acronym for name of reposition; used in indentification of act, derived from REPONM
            REPOPLCNM - yes name of settlement of repo;
            COLLGEBNM - yes part of the collection of a repo;
            EV yes event
            EVDAT no date of event day month and year due to convention: index dd-mm-yyyy >> dd/mm/yyyy
            EVDD yes day of event
            EVMM yes month of event
            EVYY yes year of event
            EVPLACE yes name of place of event
            BRONNM yes name of source
            BRONTYPE yes type of source civil or church registration, particular archive a.s.o.
            BRONCATLETTER yes one character, G for Birth, O for death, H for marriage, D for christening, B for burial
            ARCHNM no
            TOEGNR yes number of global entry in archivesystem
            INVNR yes subnumber of entry in archivesystem
            CTNUMMER yes number that specifies (within the entry) the act from which the information is cited
            CTDAT no the date the act is registrerd used for getting DD, MM and YYYY
            CTDD yes day of registration see remarks on DATE before
            CTMM yes month of registration
            CTYY yes year of registration
            CTPLC yes name of place where act is registerd can be different from place of event
            CTSRT no item can occur in index, so the searchstring has to find this text.
            CTOPM yes notation in act f.i. groom is widower
            WLNK yes weblink to site of reposition
            WPAG yes specific page on site where index is found

            Bon courage! Jos Maas

            1 Reply Last reply Reply Quote 0
            • MAPJe71M
              MAPJe71
              last edited by

              Questions / remarks:

              1. Why name/catch a group when it’s not used in the replace string?
              2. Is there a difference in -yes vs. yes and -no vs. no in the used in replace column?
              3. “for instance” is abbreviated as “e.g.” ;)
              4. You could simplify the search and replace expressions when you update/correct the date notation format in a separate search-replace action.
              1 Reply Last reply Reply Quote 0
              • Jos MaasJ
                Jos Maas
                last edited by

                Hello, @MAPJe71

                ad 1) just for myself in understanding what I am doing. I am planning to wiping out those names, because I have the impression np++ is limited in the number of names. E.g. (thanks for 3. I remembered exempli gratia from my secundary school) I got a find error that did not return after wiping out some of unused names.
                ad 2) No, It has to do with the limited facilities to present a nicely formatted table in Markdown; so I used “-” in an extra column, but helas not consequently.
                ad 4) Do you have a suggestion how?

                Thanks for the help.

                1 Reply Last reply Reply Quote 0
                • MAPJe71M
                  MAPJe71
                  last edited by

                  Hmm, my reply is considered spam by Aksimet.com.

                  1 Reply Last reply Reply Quote 0
                  • MAPJe71M
                    MAPJe71
                    last edited by

                    Do you have a suggestion how?

                    1. Convert date formats:
                      search for: (\d{2})-(\d{2})-(\d{4})
                      replace with: \1/\2/\3
                    2. Convert index to GED format after updating every “date” group in your search and replace expressions from e.g. (?'BRMGEBDAT'(?'BRMGEBDD'\d\d)-(?'BRMGEBMM'\d\d)-(?'BRMGEBYY'\d\d\d\d)) and DATE \k'BRMGEBDD'/\k'BRMGEBMM'/\k'BRMGEBYY' to (?'BRMGEBDAT'\d{2}/\d{2}/\d{4}) and DATE \k'BRMGEBDAT' respectively.
                    1 Reply Last reply Reply Quote 0
                    • MAPJe71M
                      MAPJe71
                      last edited by MAPJe71

                      Askimet.com apparently does not like the $<...> and $+{...} format.

                      1 Reply Last reply Reply Quote 0
                      • Jos MaasJ
                        Jos Maas @guy038
                        last edited by

                        @guy038
                        Hello, Guy,
                        In a reply of about a montha ago, you wrote “Be aware that, UNLIKE script languages, as Python, or Lua, regexes CANNOT store all successive values of the groups !”.
                        The good news is now that I have a set of working regexes for some sorts of indexes. I would go further now, but It turned out, that the amount of characters that a regex can handle is too small for my goal. So I think I have to use python to do the trick. I know a bit of programming (I learned the basics of algol and fortran some 50 years ago), but I did not do that job for years, so I fear that it will take some time before I am able to make working python-scripts. Therefor, I would like to ask you some questions so I don’t have to read lots of documentaries and forum-discussions which might be irrelevant for my limited goal.

                        1. Can named groups from a regex used in write-statements?
                        2. If Yes, could you give an clarifying example?
                        3. does python have limitations in the amount of characters in regexes?

                        Thanks in advance, best regards, Jos

                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post
                        The Community of users of the Notepad++ text editor.
                        Powered by NodeBB | Contributors