Line break before every UPPERCASE word

floyddebarber

Hey!

I have text files of scanned tables that OCRed into a single line. The original table was essentially 3 columns: An UPPERCASE surname, a number (rating), a dividing dash and a couple of senteces of text (I don’t even need that).
Something like this:

MÜLLER 6 - Blahblah. SMITH 5 - Asdds. Asdsd. DI CARLO 8,5 - And. Maybe even. Multiple. Sentences here.

to

MÜLLER 6 - Blahblah. 
SMITH 5 - Asdds. Asdsd. 
DI CARLO 8,5 - And. Maybe even. Multiple. Sentences here.

Can you help me out with an expression to break the lines before every completely UPPERCASE word, but not at every Sentence?
Also, is there an elegant way to replace the leading space between the name and the number withour affecting the spaces in multipart names?

Thank you!

PeterJones

@floyddebarber said in Line break before every UPPERCASE word:

MÜLLER 6 - Blahblah. SMITH 5 - Asdds. Asdsd. DI CARLO 8,5 - And. Maybe even. Multiple. Sentences here.

FIND = (?-i)\h+(\b\u{2}[\u\x20]+)
REPLACE = \r\n$1
SEARCH MODE = regular expression

important concepts:

\h and \u and [...] = character classes: https://npp-user-manual.org/docs/searching/#character-classes
+ and {2} = multiplying operators: https://npp-user-manual.org/docs/searching/#multiplying-operators
\b = anchors: https://npp-user-manual.org/docs/searching/#anchors
(?-i) = search modifiers: https://npp-user-manual.org/docs/searching/#search-modifiers
(...) = capture groups: https://npp-user-manual.org/docs/searching/#capture-groups-and-backreferences
\r\n = control characters: https://npp-user-manual.org/docs/searching/#control-characters
$1 = substitution escape sequences: https://npp-user-manual.org/docs/searching/#substitution-escape-sequences

edit: the boundary \b isn’t necessary; I had that in there from an early version, but I had added the \h+ before to prevent MÜLLER from getting an extra CRLF before it, so the boundary was no longer needed.

floyddebarber

Wow, many thanks for the fast and detailed reply!

guy038

Hello @floyddebarber, @peterjones and All,

An alternative solution would be :

SEARCH (?-i)(?<=\.)\h*(?=\u\u)

REPLACE \r\n

So, for instance, from this INPUT text :

MÜLLER 6 - Blahblah.         SMITH 5 - Asdds. Asdsd.DI CARLO 8,5 - And. Maybe even. Multiple. Sentences here.

you would get the OUTPUT text :

MÜLLER 6 - Blahblah.
SMITH 5 - Asdds. Asdsd.
DI CARLO 8,5 - And. Maybe even. Multiple. Sentences here.

Notes :

This regex searches a range of horizontal blank chars ( \x20, \x09 or \x85 ), possibly null, but ONLY IF :
- It is preceded with a literal full period due to the positive look-behind (?<=\.)
- It is followed with two upper-case letters, accentuated or not, due to the positive look-around (?=\u\u)
And, in replacement, this range is just replaced by a Windows line-break ( \r\n ) ( Use \n only if working on Unix files )

Best Regards,

guy038