Regex -- how would I do this ...



  • Looking at this example from an old dictionary text file,

    [babafa]
    {n.}
    scent organ of possum

    Is there a regex expression that can handle a variable length string (i.e. different words) inside the square brackets, recognize the presence of the left curly bracket on the next line, and place a # character to the left of the left-square-bracket, e.g.

    #[babafa]
    {n.}
    scent organ of possum

    What this is about: I have an unformatted text dump of a 1980s dictionary with thousands of these sort of entries that need a unique field code in font of each headword. The # character will do for now as the unique character.

    Unfortunately, the square brackets in the source file are used in multiple contexts, such as speech examples, because they delimit [any vernacular language expression].

    So what I am relying on here, to find and tag the headwords that begin a dictionary entry, is that after every headword, the next line has a {part of speech} encased by curly brackets.

    That’s why I’m hoping there is a way using regex to tag the beginning of the headword string with a ‘#’ character or some such, based on the presence of a ‘{’ left curly bracket at the beginning of the next line.



  • The following code works, but is there a better way to write it?

    Using Regular expression, to have Notebook++ replace all text strings of the type:
    [babafa]
    {n.}

    with

    #[babafa]
    {n.}

    by relying on the appearance of a left-curly-bracket on the line below the string enclosed in square brackets:

    Find what: ^[(.*)]\r\n+({)
    Replace with: #[$1]\r\n$2



  • @Ian-Alex ,

    I’m not sure how the find regex you specified worked for you; it did not work for me…I see some obvious problems with it. The big thing is that some of the symbols you are searching for (brackets and braces) have special meaning to the regex engine, and if you want to search for them literally, they have to be “escaped”, that is, preceded with a backslash.

    THIS MIGHT EXPLAIN IT: Perhaps when you posted here, you didn’t examine the preview window close enough; sometimes posting on this website gobbles up your intended characters–you have to use the escape/backslash technique here, too!

    Regardless, this simplified find and replace pair worked for me on your sample data:

    Find what: ^[.+]\R\{
    Replace with: #$0

    Some points to note:

    \R is a shorthand line-ending notation, and will match \r\n on Windows

    $0 in the replace is a shorthand notation for the entirety of what text matched in the find phase


Log in to reply