How to find two or more non-consecutive tabs in a line?
-
-
Hi Alan,
Thank you but sadly it won’t work. It finds only two tabs, each in every other line, at least in my file, whereas it should locate a line that contain 2 or more tabs in it. (e.g.: blah [tab] blah blah more blah [tab] (blah blah [tab] blah)… ).
-
This raises maybe an interesting discussion: When are characters inside a character class notation, which means inside
[
and]
non literal? On first crafting the above regex, I thought, this isn’t going to work, it is going to look for\
ort
separately, not “tab” characters. But lo and behold, it does look for tabs. What are the rules for this?I know that
[\R]
will match\
orR
and not match\R
but that may be a special case and invalid because it can match possibly 2 characters, not just one.But there must be some general rules on what is special inside
[
…]
and[^
…]
… besides the “specialness” of-
when used as a ranger, example[a-z]
and the special way needed to get]
to be included in the set… -
@glossar said:
Thank you but sadly it won’t work.
Hmmm. Works for me with a Mark operation shown here:
I copied your text from this thread, did a regex replace on it for
\[tab\]
with\t
…and then applied the regex specified earlier to redmark the text. -
I can confirm that it finds a line that contains two tabs but if a line doesn’t meet the criteria, it looks further (greedy, you say? :) )and hence finds the following line together, which in the end looks like “every other line”. But I’m pretty sure it skips the \r\n.of a line if this line contains only one tab. Can you limit the regex, so it should look for and within only one line (by line, I mean anything between ^ and \r\n).
-
-
maybe a screenshot helps:
-
I can’t see the screenshots above - neither on this page nor when clicking on it. All I see is a broken-image-file-icon and “Imgur” next to it.
-
Okay, one more try. It could be as simple(!) as changing it to this:
(?-s)^.*?\t(?!\t).+?\t.*?$
:)
-
Thanks, that now works like a charm! :)
While we are at it, how about building another regex that locates a line that contains no tab? :)
-
@glossar said:
regex that locates a line that contains no tab?
There might be better ones, but this one seems to work:
^((?!\t).)*$
-
Hi, @glossar, @alan-kilborn, and All,
A second solution could be :
SEARCH
(?-s)(?=.*\t.*\t).+
A third solution could be, using the Mark dialog, w/o checking the
Bookmark line
option :MARK
(?-s)\t.*\t
Note, @alan-kilborn, that your regex should be changed into :
SEARCH
(?-s)^.*?\t[^\t\r\n]+\t.*?$
To avoid wrong multi-lines match. However, this solution still misses some possibilities !
You may test these
3
regexes, above, against the sample test, below :---------------------------- 1 TEXT block without TAB -----> KO <----- ( because NO tabulation ) abcd ---------------------------- 1 TAB without TEXT ----------> KO <----- ( because ONE tabulation ONLY ) ---------------------------- 2 TABs without TEXT ----------- OK ------ ---------------------------- 3 TABs without TEXT ----------- OK ------ ---------------------------- 1 TAB + 1 TEXT block --------> KO <----- ( because ONE tabulation ONLY ) abcd abcd ---------------------------- 1 TAB + 2 TEXT blocks -------> KO <----- ( because ONE tabulation ONLY ) abcd efgh ---------------------------- 2 TABs + 1 TEXT block --------- OK ------ efgh efgh efgh ---------------------------- 2 TABs + 2 TEXT blocks -------- OK ------ abcd efgh abcd ijkm efgh ijkl ---------------------------- 2 TABs + 3 TEXT blocks -------- OK ------ abcd efgh ijkl ---------------------------- 3 TABs + 1 Text block --------- OK ------ abcd efgh ijkl mnop ---------------------------- 3 TABs + 2 Text blocks -------- OK ------ abcd efgh abcd ijkl abcd monp efgh ijkl efgh monp ijkl monp ---------------------------- 3 TABs + 3 Text blocks -------- OK ------ abcd efgh ijkm efgh ijkl mnop ---------------------------- 3 TABs + 4 Text blocks -------- OK ------ abcd efgh ijkl mnop
Best Regards,
guy038
-
@glossar , @Alan-Kilborn , @Meta-Chuh , et alia,
Unfortunately, the
(?-s)
only changes the behavior of.
with respect to newlines; it doesn’t change character classes, so[^\t]+
means “one or more characters that don’t match a TAB, even if those characters are newlines”. By changing the full regex to(?-s)^.*?\t[^\t\r\n]+\t.*?$
, I was able to get it to skip lines like @Meta-Chuh 's example ofx
instead of the TAB. The class[^\t\r\n]
means “match one or more characters that isn’t any ofTAB
,CR
(carriage return), orLF
(line-feed)”I am not as regex expert as @guy038, so I may be misinterpreting; however, the boost docs say (emphasis mine)
Escaped Characters
All the escape sequences that match a single character, or a single character class are permitted within a character class definition. For example [[]] would match either of [ or ] while [\W\d] would match any character that is either a “digit”, or is not a “word” character.Since
\R
doesn’t match a “single character” (it can match a single character ora pair of charactersmore than one character, see boost’s “Matching Line Endings” section), it doesn’t fall within the allowable escape sequences permitted in the character class.edit: while typing this up, four more posts were made. Hopefully, I still added to the discussion.
edit 2: clarify the\R
-
@PeterJones said:
Hopefully, I still added to the discussion.
You did, and you helped make it an “interesting discussion”. thanks.
-
Alan, the second one that finds no-tab :), works, thank you.
Guy and Peter - Thank you for stepping-in! :) Much appreciated!
Have a nice day!
-
Hi, @glossar, @alan-kilborn, @meta-chuh, @peterjones, and All,
Here is an other solution, which looks for all contents of lines containing, at least ,
2
tabulation chars ( can’t do shorter ! ) :SEARCH
(?-s).*\t.*\t.*
Just for information, an other formulation of the Alan’s regex, which searches lines which do not contain any tabulation char, could be :
SEARCH
(?!.*\t)^.+
Negative character classes are often misunderstood, Indeed ! When you’re using, for instance, the negative class character below :
[^<char1><char2><char3>-<char4>]
It will match ANY Unicode character which is DIFFERENT from, either
<char1>
,<char2>
and all characters between<char3>
and<char4>
included. So, most of the time, it probably matches the\r
and\n
END of Line characters. To avoid matching these line-break chars, just insert\r
and\n
, inside the negative class, at any location, after the^
, except in ranges :[^<char1>\n<char2>\t<char3>-<char4>]
Cheers,
guy038
-
@Alan-Kilborn said:
@glossar said:
regex that locates a line that contains no tab?
There might be better ones, but this one seems to work:
^((?!\t).)*$
Hi @alan-kilborn,
Is it possible for you to modify this regex so shat it should skip blank lines, i.e. the ones containing no characters at all, just (if applicable, ^ and) \r\n. Currently the regex finds blank lines as well since they , too, meet the criteria “no-tab”.Thanks in advance!
-
Hi, @glossar, @alan-kilborn, @meta-chuh, @peterjones, and All,
I may be mistaken but I think that the regex
(?!.*\t)^.+
, of my previous post, just meet your needs, doesn’t it ?Cheers,
guy038
-
@glossar said:
Is it possible for you to modify this regex so shat it should skip blank lines
So we should look at what the original means:
^((?!\t).)*$
It says (basically) to match zero or more occurrences (because of the use of
*
) of anything that is not TAB. If we change it to match ONE or more occurrences (we’re going to change*
to+
to do this) of anything that is not TAB). Because we have to match at least ONE thing, empty/blank lines are no longer matched:^((?!\t).)+$
Which is basically what @guy038 said, but I wanted to elaborate a bit!
-
Hi, @glossar, @alan-kilborn, @meta-chuh, @peterjones and All,
Fundamentally, the new Alan’s solution and mine give the same right results, i.e. to match any non-empty line which does not contain a tabulation character !
By the way, we, both, forget to add the leading in-line-modifier
(?-s)
to be sure that, even you previously ticked the. matches newline
option, the regex engine will suppose that any.
char does match a single standard character, only !So, our two solutions should be :
Alan :
(?-s)^((?!\t).)+$
Guy :
(?-s)(?!.*\t)^.+
However, note that the logic, underlying these
2
regular expressions, is a bit different :-
In the Alan’s regex, from beginning of line (
^
), the regex engine matches for one or more standard characters, till the end of line ($
), ONLY IF each standard character encountered is not a tabulation character, due to the negative look-ahead(?!\t)
, located right before the.
regex character -
In the Guy’s regex, the regex engine matches for all the standard characters of a line, (
^.+
), ONLY IF ( implicitly at beginning of line ) it cannot find a tabulation character further on, at any position of current line, due to the negative look-ahead(?!.*\t)
I did a test with a file of
2,500,000
lines, half of which contained 1 tabulation character and, clearly, the Alan’s version is faster ! (2 mn 15 s
for Alan instead of5mn
for my version )BR
guy038
-