How to find two or more non-consecutive tabs in a line?

glossar

Hi Alan,

Thank you but sadly it won’t work. It finds only two tabs, each in every other line, at least in my file, whereas it should locate a line that contain 2 or more tabs in it. (e.g.: blah [tab] blah blah more blah [tab] (blah blah [tab] blah)… ).

Alan Kilborn

This raises maybe an interesting discussion: When are characters inside a character class notation, which means inside [ and ] non literal? On first crafting the above regex, I thought, this isn’t going to work, it is going to look for \ or t separately, not “tab” characters. But lo and behold, it does look for tabs. What are the rules for this?

I know that [\R] will match \ or R and not match \R but that may be a special case and invalid because it can match possibly 2 characters, not just one.

But there must be some general rules on what is special inside […] and [^…] … besides the “specialness” of - when used as a ranger, example [a-z] and the special way needed to get ] to be included in the set…

Alan Kilborn

@glossar said:

Thank you but sadly it won’t work.

Hmmm. Works for me with a Mark operation shown here:

Imgur

I copied your text from this thread, did a regex replace on it for \[tab\] with \t…and then applied the regex specified earlier to redmark the text.

glossar

I can confirm that it finds a line that contains two tabs but if a line doesn’t meet the criteria, it looks further (greedy, you say? :) )and hence finds the following line together, which in the end looks like “every other line”. But I’m pretty sure it skips the \r\n.of a line if this line contains only one tab. Can you limit the regex, so it should look for and within only one line (by line, I mean anything between ^ and \r\n).

Alan Kilborn

@glossar

Ah, yes, okay, that makes sense. The [^\t]+ will capture across line-boundaries. At this point I will bow out and let the regex master @guy038 step in… :)

And maybe he can comment on my “interesting disussion” post above as well.

Meta Chuh

maybe a screenshot helps:
Imgur

glossar

I can’t see the screenshots above - neither on this page nor when clicking on it. All I see is a broken-image-file-icon and “Imgur” next to it.

Alan Kilborn

Okay, one more try. It could be as simple(!) as changing it to this:

(?-s)^.*?\t(?!\t).+?\t.*?$

:)

glossar

Thanks, that now works like a charm! :)

While we are at it, how about building another regex that locates a line that contains no tab? :)

Alan Kilborn

@glossar said:

regex that locates a line that contains no tab?

There might be better ones, but this one seems to work:

^((?!\t).)*$

guy038

Hi, @glossar, @alan-kilborn, and All,

A second solution could be :

SEARCH (?-s)(?=.*\t.*\t).+

A third solution could be, using the Mark dialog, w/o checking the Bookmark line option :

MARK (?-s)\t.*\t

Note, @alan-kilborn, that your regex should be changed into :

SEARCH (?-s)^.*?\t[^\t\r\n]+\t.*?$

To avoid wrong multi-lines match. However, this solution still misses some possibilities !

You may test these 3 regexes, above, against the sample test, below :

---------------------------- 1 TEXT block without TAB -----> KO <----- ( because NO tabulation )
abcd
---------------------------- 1 TAB  without TEXT ----------> KO <----- ( because ONE tabulation ONLY )
	
---------------------------- 2 TABs without TEXT ----------- OK ------
		
---------------------------- 3 TABs without TEXT ----------- OK ------
			
---------------------------- 1 TAB  + 1 TEXT block --------> KO <----- ( because ONE tabulation ONLY )
abcd	
	abcd
---------------------------- 1 TAB  + 2 TEXT blocks -------> KO <----- ( because ONE tabulation ONLY )
abcd	efgh
---------------------------- 2 TABs + 1 TEXT block --------- OK ------
efgh		
	efgh	
		efgh
---------------------------- 2 TABs + 2 TEXT blocks -------- OK ------
abcd	efgh	
abcd		ijkm
	efgh	ijkl
---------------------------- 2 TABs + 3 TEXT blocks -------- OK ------
abcd	efgh	ijkl
---------------------------- 3 TABs + 1 Text block --------- OK ------
abcd			
	efgh		
		ijkl	
			mnop
---------------------------- 3 TABs + 2 Text blocks -------- OK ------
abcd	efgh		
abcd		ijkl	
abcd			monp
	efgh	ijkl	
	efgh		monp
		ijkl	monp
---------------------------- 3 TABs + 3 Text blocks -------- OK ------
abcd	efgh	ijkm	
	efgh	ijkl	mnop
---------------------------- 3 TABs + 4 Text blocks -------- OK ------
abcd	efgh	ijkl	mnop

Best Regards,

guy038

PeterJones

@glossar , @Alan-Kilborn , @Meta-Chuh , et alia,

Unfortunately, the (?-s) only changes the behavior of . with respect to newlines; it doesn’t change character classes, so [^\t]+ means “one or more characters that don’t match a TAB, even if those characters are newlines”. By changing the full regex to (?-s)^.*?\t[^\t\r\n]+\t.*?$, I was able to get it to skip lines like @Meta-Chuh 's example of x instead of the TAB. The class [^\t\r\n] means “match one or more characters that isn’t any of TAB, CR (carriage return), or LF (line-feed)”

I am not as regex expert as @guy038, so I may be misinterpreting; however, the boost docs say (emphasis mine)

Escaped Characters
All the escape sequences that match a single character, or a single character class are permitted within a character class definition. For example [[]] would match either of [ or ] while [\W\d] would match any character that is either a “digit”, or is not a “word” character.

Since \R doesn’t match a “single character” (it can match a single character or ~~a pair of characters~~ more than one character, see boost’s “Matching Line Endings” section), it doesn’t fall within the allowable escape sequences permitted in the character class.

edit: while typing this up, four more posts were made. Hopefully, I still added to the discussion.
edit 2: clarify the \R

Alan Kilborn

@PeterJones said:

Hopefully, I still added to the discussion.

You did, and you helped make it an “interesting discussion”. thanks.

glossar

Alan, the second one that finds no-tab :), works, thank you.

Guy and Peter - Thank you for stepping-in! :) Much appreciated!

Have a nice day!

guy038

Hi, @glossar, @alan-kilborn, @meta-chuh, @peterjones, and All,

Here is an other solution, which looks for all contents of lines containing, at least , 2 tabulation chars ( can’t do shorter ! ) :

SEARCH (?-s).*\t.*\t.*

Just for information, an other formulation of the Alan’s regex, which searches lines which do not contain any tabulation char, could be :

SEARCH (?!.*\t)^.+

Negative character classes are often misunderstood, Indeed ! When you’re using, for instance, the negative class character below :

[^<char1><char2><char3>-<char4>]

It will match ANY Unicode character which is DIFFERENT from, either <char1>, <char2> and all characters between <char3> and <char4> included. So, most of the time, it probably matches the \r and \n END of Line characters. To avoid matching these line-break chars, just insert \r and \n, inside the negative class, at any location, after the ^, except in ranges :

[^<char1>\n<char2>\t<char3>-<char4>]

Cheers,

guy038

glossar

@Alan-Kilborn said:

@glossar said:

regex that locates a line that contains no tab?

There might be better ones, but this one seems to work:

^((?!\t).)*$

Hi @alan-kilborn,
Is it possible for you to modify this regex so shat it should skip blank lines, i.e. the ones containing no characters at all, just (if applicable, ^ and) \r\n. Currently the regex finds blank lines as well since they , too, meet the criteria “no-tab”.

Thanks in advance!

guy038

Hi, @glossar, @alan-kilborn, @meta-chuh, @peterjones, and All,

I may be mistaken but I think that the regex (?!.*\t)^.+, of my previous post, just meet your needs, doesn’t it ?

Cheers,

guy038

Alan Kilborn

@glossar said:

Is it possible for you to modify this regex so shat it should skip blank lines

So we should look at what the original means:

^((?!\t).)*$

It says (basically) to match zero or more occurrences (because of the use of *) of anything that is not TAB. If we change it to match ONE or more occurrences (we’re going to change * to + to do this) of anything that is not TAB). Because we have to match at least ONE thing, empty/blank lines are no longer matched:

^((?!\t).)+$

Which is basically what @guy038 said, but I wanted to elaborate a bit!

guy038

Hi, @glossar, @alan-kilborn, @meta-chuh, @peterjones and All,

Fundamentally, the new Alan’s solution and mine give the same right results, i.e. to match any non-empty line which does not contain a tabulation character !

By the way, we, both, forget to add the leading in-line-modifier (?-s) to be sure that, even you previously ticked the . matches newline option, the regex engine will suppose that any . char does match a single standard character, only !

So, our two solutions should be :

Alan : (?-s)^((?!\t).)+$

Guy : (?-s)(?!.*\t)^.+

However, note that the logic, underlying these 2 regular expressions, is a bit different :

In the Alan’s regex, from beginning of line ( ^ ), the regex engine matches for one or more standard characters, till the end of line ( $ ), ONLY IF each standard character encountered is not a tabulation character, due to the negative look-ahead (?!\t), located right before the . regex character
In the Guy’s regex, the regex engine matches for all the standard characters of a line, ( ^.+ ), ONLY IF ( implicitly at beginning of line ) it cannot find a tabulation character further on, at any position of current line, due to the negative look-ahead (?!.*\t)

I did a test with a file of 2,500,000 lines, half of which contained 1 tabulation character and, clearly, the Alan’s version is faster ! ( 2 mn 15 s for Alan instead of 5mn for my version )

BR

guy038