Using latest version 7.5 64bit. How to remove duplicte Lines
-
Hello, @meta-chuh, @terry-r and All,
No problem ;-))
So, assuming your data, WITHOUT the comments and their leading spaces
0040067 0040134 0040134 // indented 4 SPACES 0040134 // indented 4 SPACES 0040134 // indented 1 TAB 0040134 // indented 2 TABS 0040271 0040271
The S/R regex :
SEARCH
(?-s)^\h*(.+\R)(\h*\1)+
REPLACE
\1
does the job and you get :
0040067 0040134 0040271
If we start, for instance, with the text below ( again, WITHOUT any comment ) :
0040067 0040134 // indented 8 SPACES 0040134 // indented 4 SPACES 0040134 // indented 1 TAB 0040134 // indented 2 TABS 0040271 0040271
We get the same result :
0040067 0040134 0040271
However, we should remember that the generic S/R
(?-s)^(.+\R)\1+
is supposed to be run on pre-sorted data ! So, @meta-chuh, if we sort your data, we obtain :0040134 // indented 2 TABS 0040134 // indented 1 TAB 0040134 // indented 4 SPACES 0040134 // indented 4 SPACES 0040067 0040134 0040271 0040271
Now, applying my regex S/R against this sorted text, ( still WITHOUT the comments ), would result in :
0040134 0040067 0040134 0040271
Unfortunately, we see that there still are two occurrences of the 0040134 number ! So, finally, the best would be :
-
First, get rid of all leading blank characters, with the regex : SEARCH
^\h+
and REPLACEEMPTY
-
Perform an ascending alphabetic sort
-
Run the generic S/R :
SEARCH
(?-s)^(.+\R)\1+
REPLACE
\1
Cheers,
guy038
-
-
@Matt-Czugała TextFX 64 bit version is now available as a direct download from developer’s site.
-
It appears that an upcoming release of Notepad++ is gonna have a
Remove Duplicate Lines
menu command, but it appears crippled (or maybe just poorly named) as it will only remove duplicate lines that are on lines right next to each other. Okay, so that’s a nice functionality, but it isn’t gonna be (all of) what people want/expect in such a command… And I suspect people trying to help out with support will see questions about it, a lot. -
i hope that @Matt-Czugała will return to see this in 2020 or so 😉
reader’s note: it’s not the original textfx developer that compiled textfx to x64.
quote from @chcg in a recent post:
If you like experiments, you might want to test
https://github.com/HQJaTu/NPPTextFX/blob/VS2017-x64/bin/x64/NppTextFX.dll
for x64.here’s hqyatu’s github page for textfx x64 in case anyone wants to have a look around (entry point: x64 dll binary download page): https://github.com/HQJaTu/NPPTextFX/tree/VS2017-x64/bin/x64
-
@guy038 said:
0040067
0040134 // indented 8 SPACES
0040134 // indented 4 SPACES
0040134 // indented 1 TAB
0040134 // indented 2 TABSOk, let’s say that I have this scenario.
0040134 DRAWING TITLE 1
0040134 DRAWING TITLE 1A - (SLIGHTLY MODIFIED TITLE)
0040134 DRAWING TITLE 1
0040135 DRAWING TITLE 2
0040135 DRAWING TITLE 2A - (SLIGHTLY MODIFIED TITLE)Trying to figure out a Regex that will only match the number and ignore the title on the same line, but delete the line below with the slightly modified title, and where it still retains the first line.
0040134 DRAWING TITLE 1
0040135 DRAWING TITLE 2 -
It would have been nice if you’d included what regex you tried. Showing effort your effort lets the helpers know that you’re willing to put in effort – plus it makes the helpers’ jobs easier, since they know what’s already been tried and failed, and can get a glimpse as to what knowledge you already have in the domain.
Assuming an expanded data set,
0040134 DRAWING TITLE 1 0040134 DRAWING TITLE 1A - (SLIGHTLY MODIFIED TITLE) 0040134 DRAWING TITLE 1 0040999 DRAWING TITLE 3 0040999 SUPER MODIFIED TITLE 0040999 THIRD TITLE FOR SAME NUMBER 0040135 DRAWING TITLE 2 0040135 DRAWING TITLE 2A - (SLIGHTLY MODIFIED TITLE)
… that you want to convert to
0040134 DRAWING TITLE 1 0040999 DRAWING TITLE 3 0040135 DRAWING TITLE 2
Note: Assumptions
- no whitespace before the number at the beginning of the line
- any “near duplicates” would be on line(s) immediately following, and you wanted to delete them even when there’s more than one duplicate, keeping only the first.
- “slightly modified” meant that anything could come on the line after the number and it would still match as long as the number was the same; “slightly modified” is too ambiguous for a regular expression. (If you really want to match anything that has a heuristic to determine the “distance” between two strings, like the Levenshtein distance, you will have to use a programming language).
I used
- Search Mode = regular expression
- Find =
(?-s)^(\d+\b)(.*\R)(\1.*(?:\Z|\R))+
- Replace =
\1\2
Quick explanation:
(?-s)
= turn off .-matches-newline^
= the match starts at the beginning of the line(\d+\b)
= match one or more digits (ending in a “boundary”, which is a zero-width transition from numbers to non-numbers) and store in the first numbered group, which can be referenced later as\1
(or$1
)(.*\R)
= match any characters (.*
) coming after the number, through the EOL sequence (\R
) for that line, and store in group\2
(\1.*(?:\Z|\R))+
= complicated.(...)+
= match one or more lines that meet the condition inside the parens; it will store it in a group, though we aren’t using this group later\1
= it will start with the same number as was matched on the first line..*(...)
= followed by zero or more characters, followed by a sequence defined inside the parens(?:...)
= this inner group won’t be saved to a numbered group...|...
= the left side or the right side will match\Z
= match the end of the document (if the last line doesn’t have a newline sequence, we still want it to match)- or
\R
= match the EOL newline sequence for that row
And replace with
\1\2
which means the contents of the first two parenthesis-groups. In other words, the number plus the remaining contents of the first line with that number.If this isn’t quite right, you will have to give a more-complete example, which shows instances which should be changed and which shouldn’t, including anything that I got wrong above. You will also need to try to fix the regex yourself, and show us the modified regex you tried, and why you tried it (what you thought it would do), and show us the results that it gets you, compared to the results you wanted. See FYI below for more info on how to format your example text so it isn’t lost in translation, and where to go for regex documentation.
-----
FYI:This forum is formatted using Markdown, with a help link buried on the little grey
?
in the COMPOSE window/pane when writing your post. For more about how to use Markdown in this forum, please see @Scott-Sumner’s post in the “how to markdown code on this forum” topic, and my updates near the end. It is very important that you use these formatting tips – using single backtick marks around small snippets, and using code-quoting for pasting multiple lines from your example data files – because otherwise, the forum will change normal quotes (""
) to curly “smart” quotes (“”
), will change hyphens to dashes, will sometimes hide asterisks (or if your text isc:\folder\*.txt
, it will show up asc:\folder*.txt
, missing the backslash). If you want to clearly communicate your text data to us, you need to properly format it.
If you have further search-and-replace (“matching”, “marking”, “bookmarking”, regular expression, “regex”) needs, study this FAQ and the documentation it points to. Before asking a new regex question, understand that for future requests, many of us will expect you to show what data you have (exactly), what data you want (exactly), what regex you already tried (to show that you’re showing effort), why you thought that regex would work (to prove it wasn’t just something randomly typed), and what data you’re getting with an explanation of why that result is wrong. When you show that effort, you’ll see us bend over backward to get things working for you. If you need help formatting, see the paragraph above.
Please note that for all regex and related queries, it is best if you are explicit about what needs to match, and what shouldn’t match, and have multiple examples of both in your example dataset. Often, what shouldn’t match helps define the regular expression as much or more than what should match. -
there i have a plugin submitted for plugin admin 32bit
Remove_dup_lines -
@gurikbal-singh
it says “all checks have failed” if you go to your link https://github.com/notepad-plus-plus/nppPluginList/pull/59
what’s the difference to the built in Add “Remove Duplicate Lines” feature seen at following commit ?
https://github.com/notepad-plus-plus/notepad-plus-plus/commit/51f10bdba56a415d42eb829b27a08955cb7db0dd -
-
Hello @scott-fredrick-smith, @peterjones and All,
Peter, Instead of considering numbers starting lines, I just supposed that we need to delete all the lines containing duplicate consecutive
N
first characters !So, this leads to the following regex S/R, where number
7
corresponds to the number of first characters ( digits ) which must occur on further consecutive lines :SEARCH
(?-s)^((.{7}).*\R)(?:\2.*\R?)+
REPLACE
\1
=> I got the same results as yours, even if the very last line does not end with a line-break ;-))
Now, with our regex S/R, we may get some weird results, when applying against these sample texts, below :
A) A slightly modified title is the first line of block
0040134
0040134 DRAWING TITLE 1A - (SLIGHTLY MODIFIED TITLE) 0040134 DRAWING TITLE 1 0040134 DRAWING TITLE 1 0040999 DRAWING TITLE 3 0040999 SUPER MODIFIED TITLE 0040999 THIRD TITLE FOR SAME NUMBER 0040135 DRAWING TITLE 2 0040135 DRAWING TITLE 2A - (SLIGHTLY MODIFIED TITLE)
B) A very different title, after number, is the first line of block
0040999
0040134 DRAWING TITLE 1 0040134 DRAWING TITLE 1A - (SLIGHTLY MODIFIED TITLE) 0040134 DRAWING TITLE 1 0040999 SUPER MODIFIED TITLE 0040999 DRAWING TITLE 3 0040999 THIRD TITLE FOR SAME NUMBER 0040135 DRAWING TITLE 2 0040135 DRAWING TITLE 2A - (SLIGHTLY MODIFIED TITLE)
Luckily, if we do a “pre” sort operation, cases A and B, after replacement, do give us the expected results :
0040134 DRAWING TITLE 1 0040135 DRAWING TITLE 2 0040999 DRAWING TITLE 3
However the case below, with or without a preliminary sort, will fail !
C) A line, presently not the first of its block
0040999
, would become the first one, after a sort0040134 DRAWING TITLE 1 0040134 DRAWING TITLE 1A - (SLIGHTLY MODIFIED TITLE) 0040134 DRAWING TITLE 1 0040999 SUPER MODIFIED TITLE 0040999 DRAWING TITLE 3 0040999 3rd TITLE FOR SAME NUMBER 0040135 DRAWING TITLE 2 0040135 DRAWING TITLE 2A - (SLIGHTLY MODIFIED TITLE)
After the sort and the S/R, we would get :
0040134 DRAWING TITLE 1 0040135 DRAWING TITLE 2 0040999 3rd TITLE FOR SAME NUMBER
So, as usual, the correct answer depends on OP’s needs. As Peter said, we just need additional data !
Best Regards,
guy038
-
hi guy038
can i ask you in which field did you graduate? -
Hello, @rvw-mmrs,
OMG, it has been a long time since I finished my graduate education and got a degree in radio-electricity, electronics and computer science !
It is my turn to ask you why such a question and how it relates to the current discussion !?
Best Regards,
guy038
-
Maybe he wants to offer you a job…doing REGEX full time!! :)
-
I decided after your post yesterday to share what I have been working on. Thank you for your response and your detailed regex works quite nicely! I made a few tweeks and have shared them below.
I want to thank everyone who has contributed and responded to the questions I asked earlier in this forum, and I am sure this post will spur further discussion and collaboration.
For some time now, I have been trying to create an Indented Drawing Tree from the Multi-Level Bill of Material Export from PTC Windchill. PTC removed their drawing tree capability in their last major Windchill update, so it left me with no way to see a high level view of the product.
Drawings describe parts, so it is important from a program management perspective to be able to “see” the entire product structure in an indented view.
This is a work in progress, so I would like to accomplish all of the steps below with one hot key. I may try to do that with AutoHotKey, since it supports regular expressions and possibly run that script in Notepad++.
Any suggestions are welcome as to how to combine all of these steps into one Regex, or run it in one AutoHotKey script. The result of this will benefit many people who have the same or similar requirements.
I will do my best here to format this post with the correct Markdown. Please forgive me if it doesn’t meet all the Forum’s Markdown requirements, since I am new here.
---------------------------------------------------------------------------
Objective: Produce an Indented Drawing Tree that is extracted from a Multi-Level Bill of Material.
Multi-Level BOM is extracted from PTC Windchill and exported to .XLSX format.
The output from Winchill PTC for a Multi-Level BOM looks like this.Watered down example, nothing proprietary here. Does not represent a real product
0 AGH111900-1 GENERATOR 1 VA111200G1 GENERATOR ASSEMBLY - TOP LEVEL 2 VA111200P1 GENERATOR ASSEMBLY - HOUSING 2 100629-042 ADHESIVE, ANAEROBIC, LIQUID RESIN 2 200-000-111-112 CONNECTOR 2 200-000-112-004 CONNECTOR CONTACT 2 A50GB0013-1 TAPE, PRESSURE SENSITIVE ADHESIVE- POLYIMIDE 2 AS3236-06 BOLT, MACHINE - DOUBLE HEX HEAD 2 MIL-PRF-7808 PERF.SPEC,LUBRICATING OIL,GR3 2 MS16996-10 SCREW, SOCKET HEAD 2 VA112719P1 GEAR RETAINER 2 VA112719P2 GEAR RETAINER 2 VA112799P2 GROMMET - T2 2 VA112799P3 GROMMET 2 VA112817G1 PLATE - IDENTIFICATION 3 VA112817P1 PLATE - IDENTIFICATION 3 3-011-001 INSULATING CMPD,ELE 3 G11257P6 PLATE-BLANK 3 K34706P1 THINNER, PAINT PRODUCTS 4 S-8 THINNER, PAINT PRODUCTS 3 TD111234 PROCESS SPECIFICATION - SERIALIZATION 2 VA113269P1 COVER - DISCONNECT ASSEMBLY 2 VA113289P1 DISC RETAINER - GEAR 2 VA113448P4 BUSHING - HEATER HOUSING 2 VA113453P1 SHIM - 0.630 OD, 0.200 ID, 0.005 THK 2 VA113453P2 SHIM - 0.630 OD, 0.200 ID, 0.005 Thick
---------------------------------------------------------------------------
I am only looking for the Top Level AGH111900-1 and any VA Drawings, so I filtered this down using the grouping and search capability in MS Excel. Removed the parts that are not represented by drawings. I pasted this into Notepad++ from Excel. So this is where I have been starting with Regex’s in Notepad++. Regex suggestions on removing everything other than VA drawings in Notepad++ vs. MS Excel
AGH111900-1 GENERATOR VA111200G1 GENERATOR ASSEMBLY - TOP LEVEL VA111200P1 GENERATOR ASSEMBLY - HOUSING VA112719P1 GEAR RETAINER VA112719P2 GEAR RETAINER VA112799P2 GROMMET - T2 VA112799P3 GROMMET VA112817G1 PLATE - IDENTIFICATION VA112817P1 PLATE - IDENTIFICATION VA113269P1 COVER - DISCONNECT ASSEMBLY VA113289P1 DISC RETAINER - GEAR VA113448P4 BUSHING - HEATER HOUSING VA113453P1 SHIM - 0.630 OD, 0.200 ID, 0.005 THK VA113453P2 SHIM - 0.630 OD, 0.200 ID, 0.005 Thick
---------------------------------------------------------------------------
Starting with the first VA part number, I removed the "P’s & G’s and one to three numbers that follow the VA number with a Notepad++ recorded macro.
Regex suggestions on how to remove the P’s, G’s, and the numbers in between with a tab at the end welcome!AGH111900-1 GENERATOR VA111200 GENERATOR ASSEMBLY - TOP LEVEL VA111200 GENERATOR ASSEMBLY - HOUSING VA112719 GEAR RETAINER VA112719 GEAR RETAINER VA112799 GROMMET - T2 VA112799 GROMMET VA112817 PLATE - IDENTIFICATION VA112817 PLATE - IDENTIFICATION VA113269 COVER - DISCONNECT ASSEMBLY VA113289 DISC RETAINER - GEAR VA113448 BUSHING - HEATER HOUSING VA113453 SHIM - 0.630 OD, 0.200 ID, 0.005 THK VA113453 SHIM - 0.630 OD, 0.200 ID, 0.005 Thick
---------------------------------------------------------------------------
I then ran a replace with this regex per @guy038 help, and adding the capturing group around the repeated group to capture all iterations.
I used
- Find =
^(.*)(?:\r?\n\1)+$
- Replace =
\1
Options: Case sensitive; Exact spacing; Dot matches line breaks; ^$ match at line breaks; Numbered capture; Allow zero-length matches
-
Assert position at the beginning of a line (at beginning of the string or after a line break character) (carriage return and line feed, form feed)
^
-
Match the regex below and capture its match into backreference number 1
(.*)
- Match any single character
.*
- Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
*
- Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
- Match any single character
-
Match the regular expression below
(?:\r?\n\1)+
- Between one and unlimited times, as many times as possible, giving back as needed (greedy)
+
- Match the carriage return character
\r?
- Between zero and one times, as many times as possible, giving back as needed (greedy)
?
- Between zero and one times, as many times as possible, giving back as needed (greedy)
- Match the line feed character
\n
- Match the same text that was most recently matched by capturing group number 1 (case sensitive; fail if the group did not participate in the match so far)
\1
- Between one and unlimited times, as many times as possible, giving back as needed (greedy)
-
Assert position at the end of a line (at the end of the string or before a line break character) (carriage return and line feed, form feed)
$
-
Insert the text that was last matched by capturing group number 1
\1
Result: Removed the duplicate VA112719
AGH111900-1 GENERATOR VA111200 GENERATOR ASSEMBLY - TOP LEVEL VA111200 GENERATOR ASSEMBLY - HOUSING VA112719 GEAR RETAINER VA112799 GROMMET - T2 VA112799 GROMMET VA112817 PLATE - IDENTIFICATION VA112817 PLATE - IDENTIFICATION VA113269 COVER - DISCONNECT ASSEMBLY VA113289 DISC RETAINER - GEAR VA113448 BUSHING - HEATER HOUSING VA113453 SHIM - 0.630 OD, 0.200 ID, 0.005 THK VA113453 SHIM - 0.630 OD, 0.200 ID, 0.005 Thick
---------------------------------------------------------------------------
I then used @Terry-R recommendation to remove the duplicate indented lines with:
- Find =
(?-s)^(.+\R)\h{4}\1+
- Replace =
\1
Options: Case sensitive; Exact spacing; Dot matches line breaks; ^$ match at line breaks; Numbered capture; Allow zero-length matches
-
Use these options for the whole regular expression
(?-s)
- (hyphen inverts the meaning of the letters that follow)
-
- Dot doesn’t match line breaks
s
- (hyphen inverts the meaning of the letters that follow)
-
Assert position at the beginning of a line (at beginning of the string or after a line break character) (carriage return and line feed, form feed)
^
-
Match the regex below and capture its match into backreference number 1
(.+\R)
- Match any single character that is NOT a line break character (line feed, carriage return, form feed)
.+
- Between one and unlimited times, as many times as possible, giving back as needed (greedy)
+
- Between one and unlimited times, as many times as possible, giving back as needed (greedy)
- Match a line break (carriage return and line feed pair, sole line feed, sole carriage return, vertical tab, form feed)
\R
- Match any single character that is NOT a line break character (line feed, carriage return, form feed)
-
Match a single character that is a “hortizonal whitespace character” (tab or any space in the active code page)
\h{4}
- Exactly 4 times
{4}
- Exactly 4 times
-
Match the same text that was most recently matched by capturing group number 1 (case sensitive; fail if the group did not participate in the match so far)
\1+
- Between one and unlimited times, as many times as possible, giving back as needed (greedy)
+
- Between one and unlimited times, as many times as possible, giving back as needed (greedy)
-
Insert the text that was last matched by capturing group number 1
\1
With this result: Removed the indented duplicate VA112817
AGH111900-1 GENERATOR VA111200 GENERATOR ASSEMBLY - TOP LEVEL VA111200 GENERATOR ASSEMBLY - HOUSING VA112719 GEAR RETAINER VA112799 GROMMET - T2 VA112799 GROMMET VA112817 PLATE - IDENTIFICATION VA113269 COVER - DISCONNECT ASSEMBLY VA113289 DISC RETAINER - GEAR VA113448 BUSHING - HEATER HOUSING VA113453 SHIM - 0.630 OD, 0.200 ID, 0.005 THK VA113453 SHIM - 0.630 OD, 0.200 ID, 0.005 Thick
---------------------------------------------------------------------------
I then am runnning @peterjones regex, adding capturing group around the repeated group to capture all iterations:
- Find =
(?-s)^(\s+\w+\b)(.*\R)((?:\1.*(?:\Z|\R))+)
- Replace =
\1\2
Options: Case sensitive; Exact spacing; Dot matches line breaks; ^$ match at line breaks; Numbered capture; Allow zero-length matches
-
Use these options for the whole regular expression
(?-s)
- (hyphen inverts the meaning of the letters that follow)
-
- Dot doesn’t match line breaks
s
- (hyphen inverts the meaning of the letters that follow)
-
Assert position at the beginning of a line (at beginning of the string or after a line break character) (carriage return and line feed, form feed)
^
-
Match the regex below and capture its match into backreference number 1
(\d+\b)
- Match a single character that is a “digit” (any symbol with a decimal value in the active code page)
\d+
- Between one and unlimited times, as many times as possible, giving back as needed (greedy)
+
- Between one and unlimited times, as many times as possible, giving back as needed (greedy)
- Assert position at a word boundary (position preceded or followed—but not both—by a letter, digit, or underscore in the active code page)
\b
- Match a single character that is a “digit” (any symbol with a decimal value in the active code page)
-
Match the regex below and capture its match into backreference number 2
(.*\R)
- Match any single character that is NOT a line break character (line feed, carriage return, form feed)
.*
- Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
*
- Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
- Match a line break (carriage return and line feed pair, sole line feed, sole carriage return, vertical tab, form feed)
\R
- Match any single character that is NOT a line break character (line feed, carriage return, form feed)
-
Match the regular expression below
(?:\1.*(?:\Z|\R))+
- Between one and unlimited times, as many times as possible, giving back as needed (greedy)
+
- Match the same text that was most recently matched by capturing group number 1 (case sensitive; fail if the group did not participate in the match so far)
\1
- Match any single character that is NOT a line break character (line feed, carriage return, form feed)
.*
- Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
*
- Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
- Match the regular expression below
(?:\Z|\R)
- Match this alternative (attempting the next alternative only if this one fails)
\Z
- Assert position at the end of the string, or before any number of line breaks at the end of the string (carriage return and line feed, form feed)
\Z
- Assert position at the end of the string, or before any number of line breaks at the end of the string (carriage return and line feed, form feed)
- Or match this alternative (the entire group fails if this one fails to match)
\R
- Match a line break (carriage return and line feed pair, sole line feed, sole carriage return, vertical tab, form feed)
\R
- Match a line break (carriage return and line feed pair, sole line feed, sole carriage return, vertical tab, form feed)
- Match this alternative (attempting the next alternative only if this one fails)
- Between one and unlimited times, as many times as possible, giving back as needed (greedy)
-
Insert the text that was last matched by capturing group number 1
\1
-
Insert the text that was last matched by capturing group number 2
\2
Result: Removed the duplicate VA112799 with a different title
AGH111900-1 GENERATOR VA111200 GENERATOR ASSEMBLY - TOP LEVEL VA111200 GENERATOR ASSEMBLY - HOUSING VA112719 GEAR RETAINER VA112799 GROMMET - T2 VA112817 PLATE - IDENTIFICATION VA113269 COVER - DISCONNECT ASSEMBLY VA113289 DISC RETAINER - GEAR VA113448 BUSHING - HEATER HOUSING VA113453 SHIM - 0.630 OD, 0.200 ID, 0.005 THK
---------------------------------------------------------------------------
And finally, I am running this modification to @peterjones regex to find the dissimilar indented drawing with different titles:
*Find =
(?-s)^(\s+\w+\b)(.*\R)\h{4}((?:\1.*(?:\Z|\R))+)
*Replace =\1\2
Options: Case sensitive; Exact spacing; Dot matches line breaks; ^$ match at line breaks; Numbered capture; Allow zero-length matches
-
Use these options for the whole regular expression
(?-s)
- (hyphen inverts the meaning of the letters that follow)
-
- Dot doesn’t match line breaks
s
- (hyphen inverts the meaning of the letters that follow)
-
Assert position at the beginning of a line (at beginning of the string or after a line break character) (carriage return and line feed, form feed)
^
-
Match the regex below and capture its match into backreference number 1
(\s+\w+\b)
- Match a single character that is a “whitespace character” (any space in the active code page, tab, line feed, carriage return, vertical tab, form feed)
\s+
- Between one and unlimited times, as many times as possible, giving back as needed (greedy)
+
- Between one and unlimited times, as many times as possible, giving back as needed (greedy)
- Match a single character that is a “word character” (letter, digit, or underscore in the active code page)
\w+
- Between one and unlimited times, as many times as possible, giving back as needed (greedy)
+
- Between one and unlimited times, as many times as possible, giving back as needed (greedy)
- Assert position at a word boundary (position preceded or followed—but not both—by a letter, digit, or underscore in the active code page)
\b
- Match a single character that is a “whitespace character” (any space in the active code page, tab, line feed, carriage return, vertical tab, form feed)
-
Match the regex below and capture its match into backreference number 2
(.*\R)
- Match any single character that is NOT a line break character (line feed, carriage return, form feed)
.*
- Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
*
- Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
- Match a line break (carriage return and line feed pair, sole line feed, sole carriage return, vertical tab, form feed)
\R
- Match any single character that is NOT a line break character (line feed, carriage return, form feed)
-
Match a single character that is a “hortizonal whitespace character” (tab or any space in the active code page)
\h{4}
- Exactly 4 times
{4}
- Exactly 4 times
-
Match the regex below and capture its match into backreference number 3
((?:\1.*(?:\Z|\R))+)
- Match the regular expression below
(?:\1.*(?:\Z|\R))+
- Between one and unlimited times, as many times as possible, giving back as needed (greedy)
+
- Match the same text that was most recently matched by capturing group number 1 (case sensitive; fail if the group did not participate in the match so far)
\1
- Match any single character that is NOT a line break character (line feed, carriage return, form feed)
.*
- Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
*
- Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
- Match the regular expression below
(?:\Z|\R)
- Match this alternative (attempting the next alternative only if this one fails)
\Z
- Assert position at the end of the string, or before any number of line breaks at the end of the string (carriage return and line feed, form feed)
\Z
- Assert position at the end of the string, or before any number of line breaks at the end of the string (carriage return and line feed, form feed)
- Or match this alternative (the entire group fails if this one fails to match)
\R
- Match a line break (carriage return and line feed pair, sole line feed, sole carriage return, vertical tab, form feed)
\R
- Match a line break (carriage return and line feed pair, sole line feed, sole carriage return, vertical tab, form feed)
- Match this alternative (attempting the next alternative only if this one fails)
- Between one and unlimited times, as many times as possible, giving back as needed (greedy)
- Match the regular expression below
-
Insert the text that was last matched by capturing group number 1
\1
-
Insert the text that was last matched by capturing group number 2
\2
Result: Removed the VA111200 indented duplicate with a different title
AGH111900-1 GENERATOR VA111200 GENERATOR ASSEMBLY - TOP LEVEL VA112719 GEAR RETAINER VA112799 GROMMET - T2 VA112817 PLATE - IDENTIFICATION VA113269 COVER - DISCONNECT ASSEMBLY VA113289 DISC RETAINER - GEAR VA113448 BUSHING - HEATER HOUSING VA113453 SHIM - 0.630 OD, 0.200 ID, 0.005 THK
Walla! An Indented Drawing Tree from a Multi-Level BOM export out of PTC Windchill!
- Find =
-
you can try one of this regex:
SEARCH:
(?-s)^(.*)\R(?s)(?=.*^\1(?:\R|\z))
REPLACE BY:
(LEAVE EMPTY)
OR
SEARCH:
(?-s)^(.*)(?:\R)(?s)(?=.*^\1\R)
REPLACE BY:
(LEAVE EMPTY)
-
It’s about
20h30
, in France, and tomorrow, I have to be awake, around5h30
, for a ski-day in Meribel ( The “3-vallées” domain ! ). So, please, just wait until Sunday to give me time to study your loooooong reply ;-))Cheers,
guy038
-
-
Hello, @scott-fredrick-smith, and All,
From your original text, below :
0 AGH111900-1 GENERATOR 1 VA111200G1 GENERATOR ASSEMBLY - TOP LEVEL 2 VA111200P1 GENERATOR ASSEMBLY - HOUSING 2 100629-042 ADHESIVE, ANAEROBIC, LIQUID RESIN 2 200-000-111-112 CONNECTOR 2 200-000-112-004 CONNECTOR CONTACT 2 A50GB0013-1 TAPE, PRESSURE SENSITIVE ADHESIVE- POLYIMIDE 2 AS3236-06 BOLT, MACHINE - DOUBLE HEX HEAD 2 MIL-PRF-7808 PERF.SPEC,LUBRICATING OIL,GR3 2 MS16996-10 SCREW, SOCKET HEAD 2 VA112719P1 GEAR RETAINER 2 VA112719P2 GEAR RETAINER 2 VA112799P2 GROMMET - T2 2 VA112799P3 GROMMET 2 VA112817G1 PLATE - IDENTIFICATION 3 VA112817P1 PLATE - IDENTIFICATION 3 3-011-001 INSULATING CMPD,ELE 3 G11257P6 PLATE-BLANK 3 K34706P1 THINNER, PAINT PRODUCTS 4 S-8 THINNER, PAINT PRODUCTS 3 TD111234 PROCESS SPECIFICATION - SERIALIZATION 2 VA113269P1 COVER - DISCONNECT ASSEMBLY 2 VA113289P1 DISC RETAINER - GEAR 2 VA113448P4 BUSHING - HEATER HOUSING 2 VA113453P1 SHIM - 0.630 OD, 0.200 ID, 0.005 THK 2 VA113453P2 SHIM - 0.630 OD, 0.200 ID, 0.005 Thick
Here is a regex S/R which :
-
Delete any line which does not contain VA Drawings and is different from the level 0 line
-
Delete all the P’s and G’s, followed with digits
SEARCH
(?-is)(?!.*VA\d+|^0|^\u)^.+\R|^\d\x20{3}|[GP]\d+
REPLACE
Leave EMPTY
So, after clicking, once, on the
Replace All
button or several times on theReplace
button, you should get :AGH111900-1 GENERATOR VA111200 GENERATOR ASSEMBLY - TOP LEVEL VA111200 GENERATOR ASSEMBLY - HOUSING VA112719 GEAR RETAINER VA112719 GEAR RETAINER VA112799 GROMMET - T2 VA112799 GROMMET VA112817 PLATE - IDENTIFICATION VA112817 PLATE - IDENTIFICATION VA113269 COVER - DISCONNECT ASSEMBLY VA113289 DISC RETAINER - GEAR VA113448 BUSHING - HEATER HOUSING VA113453 SHIM - 0.630 OD, 0.200 ID, 0.005 THK VA113453 SHIM - 0.630 OD, 0.200 ID, 0.005 Thick
Nice, isn’t it ?
Now, I built an other regex, which keeps the level number, at beginning of lines and align all the VA Drawings lines
This could be important regarding further deleting of [pseudo] duplicate lines !
SEARCH
(?-is)(?!.*VA\d+|^0)^.+\R|^\d+\x20\K\x20+|VA\d+\K[GP]\d+
REPLACE
Leave EMPTY
This time, due to the
\K
syntax, in some locations of the regex, use theReplace All
button, exclusively !So, from your initial text, you should obtain, this time, the text below :
0 AGH111900-1 GENERATOR 1 VA111200 GENERATOR ASSEMBLY - TOP LEVEL 2 VA111200 GENERATOR ASSEMBLY - HOUSING 2 VA112719 GEAR RETAINER 2 VA112719 GEAR RETAINER 2 VA112799 GROMMET - T2 2 VA112799 GROMMET 2 VA112817 PLATE - IDENTIFICATION 3 VA112817 PLATE - IDENTIFICATION 2 VA113269 COVER - DISCONNECT ASSEMBLY 2 VA113289 DISC RETAINER - GEAR 2 VA113448 BUSHING - HEATER HOUSING 2 VA113453 SHIM - 0.630 OD, 0.200 ID, 0.005 THK 2 VA113453 SHIM - 0.630 OD, 0.200 ID, 0.005 Thick
Not bad, too !
BTW, don’t worry about suppression of the indenting. We’ll be able to get the indenting again, at the end of the process !
Now, at this point of our discussion, the best would be that you tell me which lines, among the
14
lines just above, you would like to keep ;-)) With this additional information, I’ll try to find out a regex matching your needs and deleting all the other [duplicate] lines ;-))See you later,
Best Regards
guy038
-
-
@guy038 ,
Yes, Very Nice! Both regex’s work great!
Starting with:
AGH111900-1 GENERATOR VA111200G1 GENERATOR ASSEMBLY - TOP LEVEL VA111200P1 GENERATOR ASSEMBLY - HOUSING 100629-042 ADHESIVE, ANAEROBIC, LIQUID RESIN 200-000-111-112 CONNECTOR 200-000-112-004 CONNECTOR CONTACT A50GB0013-1 TAPE, PRESSURE SENSITIVE ADHESIVE- POLYIMIDE AS3236-06 BOLT, MACHINE - DOUBLE HEX HEAD MIL-PRF-7808 PERF.SPEC,LUBRICATING OIL,GR3 MS16996-10 SCREW, SOCKET HEAD VA112719P1 GEAR RETAINER VA112719P2 GEAR RETAINER VA112799P2 GROMMET - T2 VA112799P3 GROMMET VA112817G1 PLATE - IDENTIFICATION VA112817P1 PLATE - IDENTIFICATION 3-011-001 INSULATING CMPD,ELE G11257P6 PLATE-BLANK K34706P1 THINNER, PAINT PRODUCTS S-8 THINNER, PAINT PRODUCTS TD111234 PROCESS SPECIFICATION - SERIALIZATION VA113269P1 COVER - DISCONNECT ASSEMBLY VA113289P1 DISC RETAINER - GEAR VA113448P4 BUSHING - HEATER HOUSING VA113453P1 SHIM - 0.630 OD, 0.200 ID, 0.005 THK VA113453P2 SHIM - 0.630 OD, 0.200 ID, 0.005 Thick
And to get to the results I want to achieve, I have summarized running the regex’s in the the order below:
-
In WindChill, Select the End Item to be exported. Go to the Structure Tab, Select Viewing, Select Display, Select Expand All Levels.
Go to Reports, Select Multi-Level BOM. Export Multi-Level BOM from Windchill. Select Actions, Export List to File, Export XLSX. -
Copy data from Excel (only data below the Number and Name Column headers) and paste it into Notepad++.
-
In NotePad++ (Cntrl-H) use the Search and Replace.
-
Find what:
(?-s)(?!.*VA1\d+|^0|^\u)^.+\R|^\d\x20{3}|[GP]\d+
Replace with:Leave Empty
Result: Removes anything that is not a “VA1” drawing, and removes the G (Assembly) & P (Part) conditions. Note: If you use “dash” numbers for parts and/or assemblies, you would look for a “-” instead.
-
Find what:
^(.*)(\r?\n\1)+$
Replace with::\1
*In the Replace Dialog, Regular expression, X . matches newline (checked)Result: Finds not just duplicates, but also finds groups of text that are duplicated, and removes the second duplicate group.
Keep searching with this one until it doesn’t find any more duplicates. -
Find what:
(?-s)^(.+\R)\h{4}\1+
Replace with:\1
*In the Replace Dialog, Regular expression, . matches newline (checked)Result: Removes the indented duplicates
-
Find what:
(?-s)^(\s+\w+\b)(.*\R)((?:\1.*(?:\Z|\R))+)
Replace with:\1\2
*In the Replace Dialog, Regular expression, . matches newline (checked)Result: Removes duplicates that have the same drawing number, but dissimilar titles.
-
Find what:
(?-s)^(\s+\w+\b)(.*\R)\h{4}((?:\1.*(?:\Z|\R))+)
Replace with:\1\2
*In the Replace Dialog, Regular expression, . matches newline (checked)Result: Removes duplicates that have same drawing number, dissimilar titles and the duplicate is indented 4 spaces.
AGH111900-1 GENERATOR
VA111200 GENERATOR ASSEMBLY - TOP LEVEL
VA112719 GEAR RETAINER
VA112799 GROMMET - T2
VA112817 PLATE - IDENTIFICATION
VA113269 COVER - DISCONNECT ASSEMBLY
VA113289 DISC RETAINER - GEAR
VA113448 BUSHING - HEATER HOUSING
VA113453 SHIM - 0.630 OD, 0.200 ID, 0.005 THK
-
-
Hi, @scott-fredrick-smith, and All,
OK ! Taking in account, again, my first reges S/R, of my previous post :
So, from your original text, below :
0 AGH111900-1 GENERATOR 1 VA111200G1 GENERATOR ASSEMBLY - TOP LEVEL 2 VA111200P1 GENERATOR ASSEMBLY - HOUSING 2 100629-042 ADHESIVE, ANAEROBIC, LIQUID RESIN 2 200-000-111-112 CONNECTOR 2 200-000-112-004 CONNECTOR CONTACT 2 A50GB0013-1 TAPE, PRESSURE SENSITIVE ADHESIVE- POLYIMIDE 2 AS3236-06 BOLT, MACHINE - DOUBLE HEX HEAD 2 MIL-PRF-7808 PERF.SPEC,LUBRICATING OIL,GR3 2 MS16996-10 SCREW, SOCKET HEAD 2 VA112719P1 GEAR RETAINER 2 VA112719P2 GEAR RETAINER 2 VA112799P2 GROMMET - T2 2 VA112799P3 GROMMET 2 VA112817G1 PLATE - IDENTIFICATION 3 VA112817P1 PLATE - IDENTIFICATION 3 3-011-001 INSULATING CMPD,ELE 3 G11257P6 PLATE-BLANK 3 K34706P1 THINNER, PAINT PRODUCTS 4 S-8 THINNER, PAINT PRODUCTS 3 TD111234 PROCESS SPECIFICATION - SERIALIZATION 2 VA113269P1 COVER - DISCONNECT ASSEMBLY 2 VA113289P1 DISC RETAINER - GEAR 2 VA113448P4 BUSHING - HEATER HOUSING 2 VA113453P1 SHIM - 0.630 OD, 0.200 ID, 0.005 THK 2 VA113453P2 SHIM - 0.630 OD, 0.200 ID, 0.005 Thick
The following regex S/R, named
A
, which :-
Deletes any line which does not contain VA Drawings and is different from the level 0 line
-
Deletes all the P’s and G’s, followed with digits
SEARCH
(?-is)(?!.*VA\d+|^0|^\u)^.+\R|^\d\x20{3}|[GP]\d+
REPLACE
Leave EMPTY
So, after clicking, once, on the
Replace All
button or several times on theReplace
button, you should get :AGH111900-1 GENERATOR VA111200 GENERATOR ASSEMBLY - TOP LEVEL VA111200 GENERATOR ASSEMBLY - HOUSING VA112719 GEAR RETAINER VA112719 GEAR RETAINER VA112799 GROMMET - T2 VA112799 GROMMET VA112817 PLATE - IDENTIFICATION VA112817 PLATE - IDENTIFICATION VA113269 COVER - DISCONNECT ASSEMBLY VA113289 DISC RETAINER - GEAR VA113448 BUSHING - HEATER HOUSING VA113453 SHIM - 0.630 OD, 0.200 ID, 0.005 THK VA113453 SHIM - 0.630 OD, 0.200 ID, 0.005 Thick
Now, with this new second regex S/R, named
B
, below, you just wipe out all duplicates lines !SEARCH
(?-s)^(\h+(VA\d+)\x20.+\R)(\h+\2.+\R)+
REPLACE
\1
After clicking, once, on the
Replace All
button or several times on theReplace
button, here is what you get. Practically, your final text !AGH111900-1 GENERATOR VA111200 GENERATOR ASSEMBLY - TOP LEVEL VA112719 GEAR RETAINER VA112799 GROMMET - T2 VA112817 PLATE - IDENTIFICATION VA113269 COVER - DISCONNECT ASSEMBLY VA113289 DISC RETAINER - GEAR VA113448 BUSHING - HEATER HOUSING VA113453 SHIM - 0.630 OD, 0.200 ID, 0.005 THK
Finally, we just have to normalize the indenting spaces, after the VA drawings (
VA\d+
) to4
space characters. Thus, this last regex S/R, namedC
SEARCH
(VA\d+)\x20+
REPLACE
\1\x20\x20\x20\x20
Again, after clicking, once, on the
Replace All
button or several times on theReplace
button, you’ll obtain your expected text :AGH111900-1 GENERATOR VA111200 GENERATOR ASSEMBLY - TOP LEVEL VA112719 GEAR RETAINER VA112799 GROMMET - T2 VA112817 PLATE - IDENTIFICATION VA113269 COVER - DISCONNECT ASSEMBLY VA113289 DISC RETAINER - GEAR VA113448 BUSHING - HEATER HOUSING VA113453 SHIM - 0.630 OD, 0.200 ID, 0.005 THK
Et voilà :-))
Best Regards,
guy038
P. S. : Next time, I could give you some explanations on these
3
regex S/R (A
,B
andC
) ! -