Help with Trimming text-Remove before and after words

Terry R

@Saltshaker2112 said in Help with Trimming text-Remove before and after words:

Can anyone help me out with what can remove the rest so I end up with just the words/titles on each line and no spaces at the end?

How about
(\s*)?(-\s*)?[$\\[:\d\\]$]+(\s*)?(-\s*)?
to remove all those extraneous numbers and spaces and - and braces of different sorts.

As your example was possibly interpreted by the posting engine the 01 was made into a number for a possible list.

See the FAQ post here on how to correctly show examples so they aren’t modified in posting.

Terry

Coises

@Saltshaker2112 Try:
\s[0-9\s\W]+$
— same principle, but anchored at the end instead of the beginning. I added the extra \s at the beginning so that if you have titles like:
We Always Knew (It Would End This Way) (5:15)
Jump! (4:30)
the expression won’t capture the non-word characters at the end of the title.

Saltshaker2112

@Terry-R said in Help with Trimming text-Remove before and after words:

@Saltshaker2112 said in Help with Trimming text-Remove before and after words:

Can anyone help me out with what can remove the rest so I end up with just the words/titles on each line and no spaces at the end?

How about
(\s*)?(-\s*)?[$\[:\d\]$]+(\s*)?(-\s*)?
to remove all those extraneous numbers and spaces and - and braces of different sorts.

As your example was possibly interpreted by the posting engine the 01 was made into a number for a possible list.

See the FAQ post here on how to correctly show examples so they aren’t modified in posting.

Terry

Thanks for the quick response. Yeah it seems like the first number was misinterpreted as a list. In any case, the expression basically puts everything into one line:
Song Title Second Song Title Another Song Title
But that is close.

Saltshaker2112

@Coises said in Help with Trimming text-Remove before and after words:

@Saltshaker2112 Try:
\s[0-9\s\W]+$
— same principle, but anchored at the end instead of the beginning. I added the extra \s at the beginning so that if you have titles like:
We Always Knew (It Would End This Way) (5:15)
Jump! (4:30)
the expression won’t capture the non-word characters at the end of the title.

this is close…thanks! By itself it removes everything at the end except for one space. How do I group it with the first expression?

Terry R

@Saltshaker2112 said in Help with Trimming text-Remove before and after words:

n any case, the expression basically puts everything into one line:
Song Title Second Song Title Another Song Title
But that is close.

Sorry about that. I had intended to have an actual space character where every \s shows and when it is that way it will leave each on it’s own line. I didn’t do a final test with \s. You see the \s also includes newline characters (amongst others).

Terry

Terry R

@Saltshaker2112

I came up with a neater version. I realised that if the song title contained any numbers in the title it could possibly remove those as well. This latest version ties it to either the start of the line, or the end.

Try (^)?[$\\[:\d\\]$\- ]+(?(1)|$)

So it starts by looking for a start of line, then grabbing numbers and possible spaces, braces and dashes. If it didn’t find a a start of line then it will grab the numbers, spaces, braces and dashes and expects to find an end of line. The (?(1)|$) is a special conditional subexpression. The first bit states that if the outcome of capture group 1 was true then do nothing else (that’s the portion before the | which is nothing) . If capture group 1 was false it forces the expression to find the end of line, since this is the portion behind the | (else statement).

Please note there is a space character in the last position in the group.

Saltshaker2112

@Terry-R said in Help with Trimming text-Remove before and after words:

@Saltshaker2112

I came up with a neater version. I realised that if the song title contained any numbers in the title it could possibly remove those as well. This latest version ties it to either the start of the line, or the end.

Try (^)?[$[:\d]$\- ]+(?(1)|$)

So it starts by looking for a start of line, then grabbing numbers and possible spaces, braces and dashes. If it didn’t find a a start of line then it will grab the numbers, spaces, braces and dashes and expects to find an end of line. The (?(1)|$) is a special conditional subexpression. The first bit states that if the outcome of capture group 1 was true then do nothing else (that’s the portion before the | which is nothing) . If capture group 1 was false it forces the expression to find the end of line, since this is the portion behind the | (else statement).

Please note there is a space character in the last position in the group.

Oh that works pretty well. Wow, thank you! I do have one challenge that I did not take into account. Of course I working on setlists for Rush so one song is 2112. Is there a way to put in a variable that treats “2112” as a word so that it work keeps that in the line?

Terry R

@Saltshaker2112 said in Help with Trimming text-Remove before and after words:

Is there a way to put in a variable that treats “2112” as a word so that it work keeps that in the line?

That issue was in the back of my mind and I just hoped there wouldn’t be an instance of such a song title.

There should be a way, but it might take me a while to think it up. Currently all I can think of is that as soon as a number has been grabbed, no further numbers are allowed to be grabbed, but that wouldn’t work on the trailing number as the : gets in the middle of that. Creating a subexpression for all those ideas though may be tricky.

So EVERY line has a preceding number that needs removing? Because if it didn’t then it might just be impossible to cater for.

Terry

Terry R

@Saltshaker2112

Actually I just tested another regex and got it first go. Hopefully it will help.

Try ^\d+[- ]*|[- ]* [$\\[$\\]:\d ]+$

Terry

Saltshaker2112

@Terry-R said in Help with Trimming text-Remove before and after words:

@Saltshaker2112 said in Help with Trimming text-Remove before and after words:

Is there a way to put in a variable that treats “2112” as a word so that it work keeps that in the line?

That issue was in the back of my mind and I just hoped there wouldn’t be an instance of such a song title.

There should be a way, but it might take me a while to think it up. Currently all I can think of is that as soon as a number has been grabbed, no further numbers are allowed to be grabbed, but that wouldn’t work on the trailing number as the : gets in the middle of that. Creating a subexpression for all those ideas though may be tricky.

So EVERY line has a preceding number that needs removing? Because if it didn’t then it might just be impossible to cater for.

Terry

In some cases the setlists do not have times, others do. But unfortunately I have the daunting task of editing over a 1000 of them and each one is unique. But honestly this is not the end of the world. What you have provided is a big help and saves a lot of time editing each line. In this case, it will still leave the line empty so I know I can simply put that back in. So as much as I would love to have that variable, its not a deal breaker for me and I really appreciate your time. Its been a big help. Thank you.

Terry R

@Saltshaker2112 said in Help with Trimming text-Remove before and after words:

So as much as I would love to have that variable, its not a deal breaker for me and I really appreciate your time.

I’ve been trying to figure out a way to exclude any lines which have no text on them (thus the song is a number). It was getting messy, however I think I have a way. It involves splitting one of my previous regexes. So now there would be 3 steps:

Remove any leading number, ^\d+ *-*
Remove any trailing number with spaces, braces etc, (?!^)[- ]* [$\\[$\\]:\d ]+$
Remove leading spaces using Edit, Blank operations, Trim Leading Spaces (this is a built-in menu option).

However this will still require that there is always a preceding number at the start of a line, otherwise the “number” song title will still be removed.

Anyway, it’s there for you to play with. Hopefully you have plenty of ideas on how you might handle that edge case. Often we find those edge cases are where it takes the most effort to figure out.

Good luck
Terry

Mark Olson

Figured it out.

Try this example text:

1. Keine Lust- 4:03
02 - Stairway to Heaven(2:33)
03 I drink alone [3:40]
4.3434 - 5:15
5. 10,000 fists -100:52
6. 11:11 - 4:53
7-A twist in the myth-1:43

Replace (?x-s)\d+\.? \h*(?:-\h*)? (.*?) \h* (?:-\h*\d+:\d\d | [$\[]\d+:\d\d[$\]]) with $1.

Relevant documentation: https://npp-user-manual.org/docs/searching/

Essentially four parts:

Flags: (?x-s) (verbose, . does not match newline
Song number with optional . or - and some space: \d+\.? \h*(?:-\h*)?
(.*?): song name
\h* (?:-\h*\d+:\d\d | [$\[]\d+:\d\d[$\]]): optional whitespace, then a dash and a song duration or a song duration enclosed in brackets or parens.

guy038

Hello, @saltshaker2112, @terry-r, @coises, @mark-olson and All,

And… here is my version !

First, I tried to find a fair and complete song list for testing and… guess what ? I found out a list of Beatles songs on GitHub !! Refer to :

https://github.com/inteligentni/Class-05-Feature-engineering/blob/master/The Beatles songs dataset%2C v1%2C no NAs.csv

From that list, I simply extracted a list of 27 songs, below, keeping only the 3 columns Rank, Title and Duration :

|  01  |  A Hard Day's Night                                         |  2:32  |
|  02  |  12-Bar Original                                            |  2:54  |
|  03  |  Baby, You're a Rich Man                                    |  3:03  |
|  04  |  Back in the U.S.S.R.                                       |  2:43  |
|  05  |  Being for the Benefit of Mr. Kite!                         |  2:37  |
|  06  |  Christmas Time (Is Here Again)                             |  3:03  |
|  07  |  Do You Want to Know a Secret?                              |  1:56  |
|  08  |  Everybody's Got Something to Hide Except Me and My Monkey  |  2:24  |
|  09  |  Hello, Goodbye                                             |  3:27  |
|  10  |  Help!                                                      |  2:18  |
|  11  |  Here, There and Everywhere                                 |  2:25  |
|  12  |  I Want You (She's So Heavy)                                |  7:47  |
|  13  |  I'll Follow the Sun                                        |  1:46  |
|  14  |  I'm Happy Just to Dance with You                           |  1:58  |
|  15  |  Long, Long, Long                                           |  3:04  |
|  16  |  Mean Mr. Mustard                                           |  1:06  |
|  17  |  Ob-La-Di, Ob-La-Da                                         |  3:07  |
|  18  |  Oh! Darling                                                |  3:26  |
|  19  |  One After 909                                              |  2:52  |
|  20  |  P.S. I Love You                                            |  2:06  |
|  21  |  Rain                                                       |  2:59  |
|  22  |  Sgt. Pepper's Lonely Hearts Club Band                      |  1:59  |
|  23  |  She's a Woman                                              |  3:03  |
|  24  |  There's a Place                                            |  1:49  |
|  25  |  When I'm Sixty-Four                                        |  2:37  |
|  26  |  Why Don't We Do It in the Road?                            |  1:42  |
|  27  |  You Can't Do That                                          |  2:37  |

Then, I changed these lines in order to simulate a bad formatting list, which will be our INPUT text :

01  |  a Hard Day's Night                     -  2:32  |
  02 12-bar Original |  [2:54]  |
-  03  |  Baby, You're a Rich Man -  3:03       
               
					
.. 04  |  Back in the U.s.s.R.        [2:43]    
being for the Benefit of Mr. Kite!    |
|  0.6  |  Christmas Time (is Here Again) -  3:03
     07  |  Do You Want to Know a Secret?       [1:56]
|  08  -  Everybody's Got Something to Hide except Me and my Monkey    /  2:24
|  09  )  Hello, Goodbye     |  (3:27)    
|  10 Help! |  (2:18)
|  11 Here, There and Everywhere | - 2:25  |
12 I Want You (She's So heavy) |  7:47  |


13 I'll Follow the Sun   | - 1:46      
   14  |  I'm Happy Just to Dance with You |  1:58  |
15   Long, Long, Long  | - 3:04
...16  |  Mean Mr. Mustard [ 1:06]  |
.. 17  |  Ob-La-Di, Ob-la-Da  [ 3:07]          
| (18) |  Oh! Darling [ 3:26]
(19) |  One After 909           ( 2:52) |
   (20) P.s. I Love You ( 2:06)         
#21 ---  Rain ................................ 2:59



  [22]  |  Sgt. Pepper's Lonely Hearts Club Band        ( 1:59)
[23]  |  She's a Woman   |  {3:03}  |
|  [24]  |  There's a Place |  {1:49}        
|  25  |  When I'm Sixty-four       |  {2:37}

-  26  -      Why Don't We Do It in the Road?  {1:42}  |
you Can't Do That {2:37}

With this first regex S/R below, we rewrite only the title of each song, one per line, ignoring the empty lines and the lines with blank chars only :

SEARCH (?x-i) ^ [0-9\s\W]+ \h+ | (?: \l \x20 \d+ )? \K \h+ [0-9\h\W]+ $
REPLACE Leave EMPTY

Due the \K syntax, you must use the Replace All button (Do not use the Replace button )

=> 52 occurrences occurred and you should get this temporary text :

a Hard Day's Night
12-bar Original
Baby, You're a Rich Man
Back in the U.s.s.R.
being for the Benefit of Mr. Kite!
Christmas Time (is Here Again)
Do You Want to Know a Secret?
Everybody's Got Something to Hide except Me and my Monkey
Hello, Goodbye
Help!
Here, There and Everywhere
I Want You (She's So heavy)
I'll Follow the Sun
I'm Happy Just to Dance with You
Long, Long, Long
Mean Mr. Mustard
Ob-La-Di, Ob-la-Da
Oh! Darling
One After 909
P.s. I Love You
Rain
Sgt. Pepper's Lonely Hearts Club Band
She's a Woman
There's a Place
When I'm Sixty-four
Why Don't We Do It in the Road?
you Can't Do That

Now, whith this second regex S/R, we rewrite any lowecase letter, following a space, a dot, an opening parenthesis or a dash character, by its uppercase equivalent :

SEARCH (?x-i) (?: ^ | (?<= [\x20.(-] ) ) \l
REPLACE \u$0

=> 31 occurrences occurred and here is your expected OUTPUT text :

A Hard Day's Night
12-Bar Original
Baby, You're A Rich Man
Back In The U.S.S.R.
Being For The Benefit Of Mr. Kite!
Christmas Time (Is Here Again)
Do You Want To Know A Secret?
Everybody's Got Something To Hide Except Me And My Monkey
Hello, Goodbye
Help!
Here, There And Everywhere
I Want You (She's So Heavy)
I'll Follow The Sun
I'm Happy Just To Dance With You
Long, Long, Long
Mean Mr. Mustard
Ob-La-Di, Ob-La-Da
Oh! Darling
One After 909
P.S. I Love You
Rain
Sgt. Pepper's Lonely Hearts Club Band
She's A Woman
There's A Place
When I'm Sixty-Four
Why Don't We Do It In The Road?
You Can't Do That

Best Regards,

guy038

Saltshaker2112

@guy038 said in Help with Trimming text-Remove before and after words:

(?x-i) ^ [0-9\s\W]+ \h+ | (?: \l \x20 \d+ )? \K \h+ [0-9\h\W]+ $

Thanks!! This works pretty good too. I dont think the other ones worked but I still have the issue with “2112”
So heres a real setlist:

01) - Bastille Day  5:19
02. - Lakeside Park  4:41
[03] - Bytor And The Snowdog  5:43
04 - Xanadu  12:06
05 - A Farewell To Kings  6:35
06 - Something For Nothing  4:13
07 - Cygnus X-1  10:22
01 - Anthem  4:15
02 - Closer To The Heart  3:35
03 - 2112  18:23
04 - Working Man / Fly By Night / In The Mood / Drum Solo  15:16
05 - Cinderella Man  5:14

Which results in with 2112 missing:
Bastille Day
Lakeside Park
Bytor And The Snowdog
Xanadu
A Farewell To Kings
Something For Nothing
Cygnus X-1
Anthem
Closer To The Heart
Working Man / Fly By Night / In The Mood / Drum Solo
Cinderella Man

Still trying some variables but no luck yet but thank you to all so far. This is awesome work.

Coises

@Saltshaker2112 Try this:
^[^\w\r\n]*\d+[^\w\r\n]*([^\r\n]*\w[^\w\h\r\n]*)\h+[^\w\r\n]*\d+:\d+[^\w\r\n]*$
using this:
\1
as the replacement string.

For me, it’s easier to match a whole line and use a capture expression (the parenthesized part, which is substituted for the \1 in the replacement) rather than try to figure out how to avoid matching troublesome bits like the 2112.

EDIT: Above is still wrong; for example, given:
20 (Your Love Has Lifted Me) Higher and Higher (2:30)
it loses the opening parenthesis.
Make it:
^[^\w\r\n]*\d+[^\w\r\n]*\h([^\r\n]*\w[^\w\h\r\n]*)\h+[^\w\r\n]*\d+:\d+[^\w\r\n]*$
with:
\1
as the replacement string.

Saltshaker2112

@Coises

Wow, that looks like it did the trick!!! Thank you and thanks you everyone here. I gotta say, all of you guys are awesome and I appreciate this very much. It saves me a lot of time! Thanks again.

Mark Olson

OK, here’s my master regex that should deal with maximally pathological examples in all the formats you’ve shown me:
Replace (?-s)[\[$]?\d+\.?[$\]]?\h*(?:-\h*)?(\S.*?\S)\h*(?:-\h*)?[\[$]?\d+:\d\d[$\]]? with $1

Tested on your setlist, plus the maximally evil song title 11:11 by Rodrigo y Gabriela:

10 - 11:11 4:49

And thank you, @Saltshaker2112 , for providing us with interesting regex challenges. I have progressed substantially as a regex-er by hanging out in this forum and working on puzzles like this.

guy038

Hi, @saltshaker2112, @terry-r, @coises, @mark-olson, @coises and All,

Ah… OK. But, if we have to be less restritive on the text to keep, we must be more restrictive regarding the text to get rid of ! Thus :

The part BEFORE the song’s title, which will be deleted, is :
- Any NON-word text followed with a number, followed by anything with a final dash AND, at least, ONE blank char
- Any number, up to three digits, possibly preceded with blank chars and followed with, at least, ONE blank char
The part AFTER the song’s title, which will be deleted, is :
- At least ONE blank char, followed by any char among ([{<_-, followed by possible space chars, followed with a duration ( \d{1,2}(:)\d{2} ), followed with possible space chars, followed with any char among )]}>_- and finally followed with a combination of blank and new-line chars
- This part, which manages possible line-breaks, is then replaced by a single line-break ONLY

So, starting with the INPUT text, below :



01) - Bastille Day  5:19


02. - Lakeside Park [   4:41      ]       
[03] - Bytor And The Snowdog  5:43
04 - Xanadu  12:06
05 - A Farewell To Kings ( 6:35 )
Something For Nothing   4:13
                

							
((07 - Cygnus X-1  10:22
01 - Anthem  4:15
02- Closer To The Heart 999 -  3:35    -
[03  ] - 2112  18:23

(  03) - (2112) This Is A Test     [2012 ]  18:23
03}} - [  2112  ] This Is An Other Test  2012 <18:23   >
04 Working Man / Fly By Night / In The Mood / Drum Solo  _15:16_
05 - Cinderella Man  5:14

Here is my new version of the first regex S/R, which get a clean list of the song’s titles :

SEARCH (?x) ^ \h* (?: \W* \d+ \W* \h* - | \d{1,3} ) \h+ | \h+ [([{<_-]? \x20* \d{1,2} ( : ) \d{2} \x20* [)]}>_-]? ( \h* \R )+
REPLACE ?1\r\n

And you get this OUTPUT text :

Bastille Day
Lakeside Park
Bytor And The Snowdog
Xanadu
A Farewell To Kings
Something For Nothing
Cygnus X-1
Anthem
Closer To The Heart 999
2112
(2112) This Is A Test     [2012 ]
[  2112  ] This Is An Other Test  2012
Working Man / Fly By Night / In The Mood / Drum Solo
Cinderella Man

Hope that it’s the expected one !!

Of course, the second regex, regarding case changes, is the same as in my previous post !

BR

guy038

P.S. Note that the simple lines, below :

123 789 15:47
00 15:47

With a song’s title containing less than four digits ONLY, with or without a leading rank, would wongly end up to :

15:47
03:19

I chose the limit of three digits, in order that lines with a leading rank up to three digits, immediately followed by the title, as below, are correctly handled ! Indeed :

456 The most beautiful song of all the times (12:53)

Would correctly result as :

The most beautiful song of all the times