Parallel searching for 2 Names

Ekopalypse

This is something that can normally be achieved with regular expressions (regex).
Something like name_one.{1000}name_two means thatn search for name_one followed by 1000 characters and then name_two must appear.
But regex can get really complicated.
You can find more information here.

Alan Kilborn

@erich-siebenhaar

Building on what @Ekopalypse said…

I’d think (?s)regex1.{0,1000}?regex2 meets the need (let us start with the “within 1000 characters” spec).

Can you show us how it doesn’t meet the need?
Sample data is certainly welcome, and probably helps.

Neil Schipper

@erich-siebenhaar:

search for 2 Names (or expressions) at the same time

These are all different things one could try to match (limited by the range):

either John or Mary
<John><arbitrary text><Mary>
either <John><arbitrary text><Mary> or <Mary><arbitrary text><John>
for <John><text><Mary>, if we encounter <John><text><John><text><Mary>, the whole match could start from the first or the last occurrence of <John>

Each would require its own expression.

Alan Kilborn

@neil-schipper

Don’t give the OP ideas about how to upscope his need! :-)

Erich Siebenhaar

@alan-kilborn
Thank you!!
This does in fact solve the problem.
First I thought it does not, because in this text:

1 Botvinnik,Mikhail Moiseevich * ½ ½ ½ ½ 1 0 ½ ½ 1 1 1 1 1 1 1 11.0/15 71.75
2 Smyslov,Vassily Vasilievich ½ * ½ ½ ½ ½ ½ ½ ½ 1 1 1 1 1 1 1 11.0/15 71.50
3 Taimanov,Mark Evgenievich ½ ½ * ½ 1 1 ½ ½ ½ ½ ½ 1 ½ 1 1 1 10.5/15
4 Gligoric,Svetozar ½ ½ ½ * 0 ½ ½ ½ 1 ½ ½ 1 1 1 1 1 10.0/15
5 Bronstein,David Ionovich ½ ½ 0 1 * ½ ½ ½ ½ ½ 1 ½ 1 ½ 1 1 9.5/15
6 Najdorf,Miguel 0 ½ 0 ½ ½ * ½ ½ 1 ½ ½ ½ 1 1 1 1 9.0/15
7 Keres,Paul 1 ½ ½ ½ ½ ½ * 1 0 ½ 0 ½ ½ ½ 1 1 8.5/15 61.25
8 Pachman,Ludek ½ ½ ½ ½ ½ ½ 0 * ½ ½ ½ ½ ½ 1 1 1 8.5/15 56.00
9 Unzicker,Wolfgang ½ ½ ½ 0 ½ 0 1 ½ * 1 ½ ½ ½ 1 0 1 8.0/15 56.25
10 Stahlberg,Anders Gideon Tom 0 0 ½ ½ ½ ½ ½ ½ 0 * ½ ½ 1 1 1 1 8.0/15 48.25
11 Szabo,Laszlo 0 0 ½ ½ 0 ½ 1 ½ ½ ½ * ½ ½ ½ 0 ½ 6.0/15
12 Padevsky,Nikola Bochev 0 0 0 0 ½ ½ ½ ½ ½ ½ ½ * 0 ½ 1 ½ 5.5/15 34.75
13 Uhlmann,Wolfgang 0 0 ½ 0 0 0 ½ ½ ½ 0 ½ 1 * 1 ½ ½ 5.5/15 32.50
14 Ciocaltea,Victor 0 0 0 0 ½ 0 ½ 0 0 0 ½ ½ 0 * 1 ½ 3.5/15
15 Sliwa,Bogdan 0 0 0 0 0 0 0 0 1 0 1 0 ½ 0 * ½ 3.0/15
16 Golombek,Harry 0 0 0 0 0 0 0 0 0 0 ½ ½ ½ ½ ½ * 2.5/15

(?s)Botvinnik.{0,1000}?Golombek

does not find the two players, but

(?s)Botvinnik.{0,1000}?Uhlmann works.

Why do i need (?s)Botvinnik.{0,1800}?Golombek to find the expressions, even though the lines are less than 80 characters long?

Anyway, I will search for 2000 and find everything I need.

Alan Kilborn

@erich-siebenhaar said in Parallel searching for 2 Names:

Why do i need (?s)Botvinnik.{0,1800}?Golombek to find the expressions, even though the lines are less than 80 characters long?

I your example text, I see that the end of Botvinnik and the start of Golombek are 1092 positions apart. Thus using 1000 instead of 1800 isn’t going to find it.

Interestingly, however is your use of the UTF-8 multibyte character ½. This character is encoded into 2 bytes each time it occurs.

If I replace ½ with a single-byte character, e.g. 1, and repeat the search using 1000, it succeeds in finding the match, because now the position difference between the two words are less than 1000.

Thus it appears that the regex count qualifiers are unaware of multibyte character encoding. :-( I don’t like this… something like .{1000} should match 1000 characters, not 1000 bytes. @guy038 , do you have some comment on this?

guy038

Hello, @erich-siebenhaar, @ekopalypse, @alan-kilborn and All,

Alan, don’t worry ! the regex dot symbol ( . ) does count characters and not bytes ;-))

Don’t know which was your current encoding when you tested or it could be a wrong selection !

I will consider the text :

1 Botvinnik,Mikhail Moiseevich * ½ ½ ½ ½ 1 0 ½ ½ 1 1 1 1 1 1 1 11.0/15 71.75
2 Smyslov,Vassily Vasilievich ½ * ½ ½ ½ ½ ½ ½ ½ 1 1 1 1 1 1 1 11.0/15 71.50
3 Taimanov,Mark Evgenievich ½ ½ * ½ 1 1 ½ ½ ½ ½ ½ 1 ½ 1 1 1 10.5/15
4 Gligoric,Svetozar ½ ½ ½ * 0 ½ ½ ½ 1 ½ ½ 1 1 1 1 1 10.0/15
5 Bronstein,David Ionovich ½ ½ 0 1 * ½ ½ ½ ½ ½ 1 ½ 1 ½ 1 1 9.5/15
6 Najdorf,Miguel 0 ½ 0 ½ ½ * ½ ½ 1 ½ ½ ½ 1 1 1 1 9.0/15
7 Keres,Paul 1 ½ ½ ½ ½ ½ * 1 0 ½ 0 ½ ½ ½ 1 1 8.5/15 61.25
8 Pachman,Ludek ½ ½ ½ ½ ½ ½ 0 * ½ ½ ½ ½ ½ 1 1 1 8.5/15 56.00
9 Unzicker,Wolfgang ½ ½ ½ 0 ½ 0 1 ½ * 1 ½ ½ ½ 1 0 1 8.0/15 56.25
10 Stahlberg,Anders Gideon Tom 0 0 ½ ½ ½ ½ ½ ½ 0 * ½ ½ 1 1 1 1 8.0/15 48.25
11 Szabo,Laszlo 0 0 ½ ½ 0 ½ 1 ½ ½ ½ * ½ ½ ½ 0 ½ 6.0/15
12 Padevsky,Nikola Bochev 0 0 0 0 ½ ½ ½ ½ ½ ½ ½ * 0 ½ 1 ½ 5.5/15 34.75
13 Uhlmann,Wolfgang 0 0 ½ 0 0 0 ½ ½ ½ 0 ½ 1 * 1 ½ ½ 5.5/15 32.50
14 Ciocaltea,Victor 0 0 0 0 ½ 0 ½ 0 0 0 ½ ½ 0 * 1 ½ 3.5/15
15 Sliwa,Bogdan 0 0 0 0 0 0 0 0 1 0 1 0 ½ 0 * ½ 3.0/15
16 Golombek,Harry 0 0 0 0 0 0 0 0 0 0 ½ ½ ½ ½ ½ * 2.5/15

As for me, the number of characters right after the word Botvinnik till right before the word Golombek is exactly 975 chars. So :

The regex (?s)Botvinnik.{975}Golombek does find the range of chars and both words
The regex (?s)Botvinnik.{974}Golombek does not find anything as well as the regex (?s)Botvinnik.{976}Golombek

Like you, I was rather upset that the count operation would have concerned bytes and not chars :-((

Now, Erich, here is an improved regex to find each word, with their exact case, whatever their order :

SEARCH (?s-i)(?:(Name_1)|(Name_2)).{0,2000}?(?(1)(?2)|(?1))

For instance, with your example :

SEARCH (?s-i)(?:(Botvinnik)|(Golombek)).{0,2000}?(?(1)(?2)|(?1))

SEARCH (?s-i)(?:(Padevsky)|(Gligoric)).{0,2000}?(?(1)(?2)|(?1))

Here is a second regex to find each word, with their exact case, whatever the order too, but :

A first click, on the Find Next button, finds the first word
A second click, on the Find Next button, find the second word

SEARCH (?s-i).*?\K(?:(Name_1)|(Name_2))|.{0,2000}?\K(?(1)(?2)|(?1))

Always with your example :

SEARCH (?s-i).*?\K(?:(Botvinnik)|(Golombek))|.{0,2000}?\K(?(1)(?2)|(?1))

SEARCH (?s-i).*?\K(?:(Padevsky)|(Gligoric))|.{0,2000}?\K(?(1)(?2)|(?1))

Best Regards,

guy038

guy038

Hi, @erich-siebenhaar, @ekopalypse, @alan-kilborn and All,

Sorry, I forgot to discuss your other case : find two words separated by, let’s say, not more than 50 lines

In that case, the first regex, matching the both words and the lines in between is :

SEARCH (?-si)(?:(Name_1)|(Name_2)).*\R(.*\R){0,50}.*(?(1)(?2)|(?1))

Test these two regexes, below, against your example :

SEARCH (?-si)(?:(Botvinnik)|(Golombek)).*\R(.*\R){0,50}.*(?(1)(?2)|(?1))

SEARCH (?-si)(?:(Padevsky)|(Gligoric)).*\R(.*\R){0,50}.*(?(1)(?2)|(?1))

Unfortunately, when dealing with lines rather than characters, I was unable to find out the second regex version, which would have searched the first word, then the second !

Note : Of course, if you do not mind about case, change any -i modifier when the i modifier, which leads to :

(?si)..., in my previous post
(?i-s)..., in this present post !

BR

guy038

Alan Kilborn

@guy038 said in Parallel searching for 2 Names:

Alan, don’t worry ! the regex dot symbol ( . ) does count characters and not bytes ;-))

Not sure what I originally did when I experimented with the data.
I’m sure that file encoding was UTF-8.
But trying it again now (?s)Botvinnik.{0,1000}?Golombek definitely does work on the OP’s data, so…sorry for the noise.

Alan Kilborn

@guy038 said in Parallel searching for 2 Names:

SEARCH (?s-i)(?:(Name_1)|(Name_2)).{0,2000}?(?(1)(?2)|(?1))

I think if you are making this into a generic formula, it should be:

(?s-i)(?:(Name_1)|(Name_2)).{0,Max_chars}?(?(1)(?2)|(?1))