Regex to find any lines that do NOT have a specific number of a character
-
ok, I hope I finally understood this sentence
Match pattern independently of surrounding patterns, and don’t backtrack into it. Failure to match will cause the whole subject not to match.
which then means that my first attempt, which I was questioning, did backtrack.
which makes your regex is the one which I, and hopefully @Mark-Yorkovich were looking for.@Alan-Kilborn,
Alan, ja, I guess you are right.@Mark-Yorkovich, so does this work on your data and the procedure described by
@dinkumoil ? -
@Ekopalypse said:
ok, I hope I finally understood this sentence
I got the following hint at https://regex101.com/ when trying your regex:
A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you’re not interested in the data.
Then I read https://www.regular-expressions.info/atomic.html
Together it made me to give the non-capturing group a try.
-
@dinkumoil
I followed your instructions, but I’m not getting any matches. -
make sure your caret is on the first line if you have not checked wrap around
-
@Ekopalypse
Yup, sure is. - No matches - double-checked my settings.To reiterate: My file is mostly 9 pipes/10 cols per line, but some have less and a few more than that and I need to find those.
-
I generated with the test data of @Ekopalypse a file of 146545 lines and did that what I’ve suggested above - I got the expected result.
Be sure that the pipe character in your file is really a pipe character (code 124). There is another one (code 166 in Windows-1252 character encoding) which looks nearly identical:
Pipe character:
|
The other one:¦
-
@dinkumoil said:
I generated with the test data of @Ekopalypse a file of 146545 lines and did that what I’ve suggested above - I got the expected result.
Be sure that the pipe character in your file is really a pipe character (code 124). There is another one (code 166 in Windows-1252 character encoding) which looks nearly identical:
Pipe character:
|
The other one:¦
Yup - they’re pipes.
Here is a good sample of what I’m working with. Lines 1, 9, 10, 11, 16 thru 20 and 36, 37 are single-line records with 9 pipes and 10 columns. Lines 2 thru 8 are one record and together have 9 pipes/10 cols. Similarly, lines 12 through 15 are a single record, and lines 21 thru 35 are a single record.
LOREM120|8 |3 |1 |1 |0 |0 |||INST020
LOREM120|9 |1 |1 |0 |0 |0 ||Lorem Ipsum Dolor]
LOREM: BS/BPLOREM IPSUM:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.|
IPSUM16|1 |1 |1 |1 |0 |0 |||3001479
IPSUM16|1 |2 |1 |1 |0 |0 |||3003077
IPSUM16|11 |0 |1 |0 |0 |0 |||
IPSUM16|13 |0 |1 |0 |0 |0 ||Lorem ipsum dolor sit amet
consectetur adipiscing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.DOLOR53 1 1 1 2 0 0 3003084 DOLOR53 2 3 1 1 0 0 Lorem ipsum DOLOR53 2 4 1 1 0 0 Lorem ipsum LOREM56 8 1 1 1 0 0 Lorem ipsum LOREM56 8 2 1 1 0 0 Lorem ipsum LOREM56 9 1 1 0 0 0 Lorem ipsum dolor sit amet consectetur adipiscing elit
consectetur adipiscing elit
consectetur adipiscing elitconsectetur adipiscing elit
Lorem ipsum dolor sit amet
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.Lorem ipsum dolor sit amet
consectetur adipiscing elit
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.|
DOLOR19|1 |2 |1 |1 |0 |0 |||3003124
LOREM01|1 |1 |1 |1 |1 |0 |||3003024Your suggested regex ^(?>.+?|){9}(?!.+?|) isn’t finding any matches on that
-
because it was assumed that all columns contain data
find:
^(?>.*?\|){9}(?!.*?\|)
does not make that assumption. -
@Ekopalypse said:
@Mark-Yorkovich
because it was assumed that all columns contain dataMy bad. I didn’t give you all of the details of what I’m working with.
find:
^(?>.*?\|){9}(?!.*?\|)
does not make that assumption.This works.
So at this point what I’d need to do, ideally, is to do a Find/Replace, finding all of the new line/line feed characters - only in those now-bookmarked lines - and replace them with some other character (spaces, dummy chars, whatever) to get each of those records to be on one line. Can I do a find/replace on just the bookmarked lines? Or perhaps, instead of the multi-step approach, is there a way to do this on the Replace tab, entering a regex in the Find what box that finds those lines and just replace the new line characters with dummy characters in one step?
-
@Mark-Yorkovich said:
Alan’s exp doesn’t match anything in my file
Well, if I copy and paste your “lorem ipsum” data (above) into a new tab and then run my regex (above) on it, I get lines with exactly 9 pipes redmarked, which I thought was the goal (or the inverse of the goal):
So…I really don’t know where the disconnect is…
-
@Mark-Yorkovich said:
…finding all of the new line/line feed characters - only in those now-bookmarked lines - and replace them with some other character (spaces, dummy chars, whatever) to get each of those records to be on one line
Didn’t we do all this the other day?
-
(.|){9}.
how about this?
-
I assume you meant
(.\|){9}.
This matches 9 and more pipe delimited lines. -
This post is deleted! -
@Ekopalypse said:
I assume you meant
(.\|){9}.
This matches 9 and more pipe delimited lines.in fact, I mean…
(。\|){9}。*
but it can’t show correctly, and I don’t know how to put screenshot
-
@Allen-Bai said:
it can’t show correctly,
To quote my boilerplate:
This forum is formatted using Markdown, with a help link buried on the little grey
?
in the COMPOSE window/pane when writing your post. For more about how to use Markdown in this forum, please see @Scott-Sumner’s post in the “how to markdown code on this forum” topic, and my updates near the end. It is very important that you use these formatting tips – using single backtick marks around small snippets, and using code-quoting for pasting multiple lines from your example data files – because otherwise, the forum will change normal quotes (""
) to curly “smart” quotes (“”
), will change hyphens to dashes, will sometimes hide asterisks (or if your text isc:\folder\*.txt
, it will show up asc:\folder*.txt
, missing the backslash).For images: upload image to imgur. embed images with the syntax
![](http://i.imgur.com/QTHZysa.png)
. (please use imgur’s “direct link” with i.imgur.com as the hostname and the appropriate .png or .gif extension, rather than the “image” link, which really links to the HTML-wrapper, and will not embed in the forum) -
-
@Allen-Bai said:
in fact, I mean
(。*\|){9}。*Then why not put it in tick marks? Both the help I linked to and the “how to use markdown code” post explained how to do that, as did my boilerplate text itself.
`(.*\|){9}.*`
renders as
(.*\|){9}.*
-
ah…
understand now, thank you so much
-
Hi, @mark-yorkovich, and All,
See my very late regex solution , below :
https://community.notepad-plus-plus.org/post/47905
Best Regards,
guy038