First occurrence of series of words in file
-
I have a log file which have several server names in them. I have created a RegEx formula to get the series of servers that I am looking for. I would like to find the first occurrence of each of those servers in the entire log file. The RegEx formula for the servers is like:
qwert.?\d{3}.
This would match servers like:
qwertb012b
qwert123g
EtcI would like to find the first occurrence of each of the servers resolved by the RegEx formula in the entire log file so I have a list of servers that have connected.
I am using NotePad++ 7.3.3 with the RegEx Helper add-in to get the list.
-
There are a few ways to solve this one. Here’s how I might do it, not saying it is the best way, just A way to get the job done:
- Make the
Find result
panel visible if not already [Search
(menu) ->Search Results Window
] - Clear the
Find result
panel (right-click the panel, chooseClear all
) - Do your
Find All in Current Document
search exactly as you described; results will be in theFind result
panel - Right-click
Find result
panel, chooseSelect all
- Right-click
Find result
panel, chooseCopy
- Paste into a new (scratch) document
- Invoke the Mark feature of the Find dialog to mark this regex:
(?-s)^(.*)(?:\R)(?s)(?=.*^\1\R)
- Set the following options (only) on the mark:
Mark line
,Purge for each search
,Wrap around
and of courseSearch mode: Regular expression
- Click
Find All
to do the mark; All but the LAST occurrence of each unique server name will be marked - Delete the non-unique lines via
Search
(menu) ->Bookmark
->Removed Bookmarked Lines
You are left with one unique line per server name. It finds the LAST occurrence of each name, and you wanted the FIRST, but I don’t see how that makes a difference for your situation.
:-D - Make the
-
Unfortunately, it did not work. I am trying to do a log like this from Lotus Notes:
05/23/2017 03:40:39 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) disconnected. 0 message[s] received
05/23/2017 03:41:16 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) connected
05/23/2017 03:41:16 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) disconnected. 0 message[s] received
05/23/2017 04:00:01 AM Router: Beginning Out of office service daily maintenance.
05/23/2017 04:00:01 AM Router: Completed Out of office service daily maintenance.
05/23/2017 04:01:00 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) connected
05/23/2017 04:01:00 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) disconnected. 0 message[s] received
05/23/2017 04:01:00 AM Router: Received maintenance request for mail4.box, processing all pending message updates
05/23/2017 04:01:00 AM Router: Completed compaction of mailbox file mail3.box
05/23/2017 04:01:00 AM Router: Beginning in-place compaction of mailbox file mail4.box
05/23/2017 04:01:08 AM Router: Completed compaction of mailbox file mail4.box
05/23/2017 04:13:16 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) connected
05/23/2017 04:13:16 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) disconnected. 0 message[s] received
05/23/2017 04:45:44 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) connected
05/23/2017 04:45:44 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) disconnected. 0 message[s] received
05/23/2017 05:00:56 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) connected
05/23/2017 05:00:56 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) disconnected. 0 message[s] received
05/23/2017 05:02:21 AM Router: Transferring mail to domain xxx.xxx.xxx.xxx (host xxx.xxx.xxx.xxx [xxx.xxx.xxx.xxx]) via SMTP
05/23/2017 05:02:24 AM Router: Message 0031A730 transferred to xxx.xxx.xxx.xxx for another_user@domain.com via SMTP
05/23/2017 05:02:27 AM Router: Transferred 2 messages to xxx.xxx.xxx.xxx (host xxx.xxx.xxx.xxx) via SMTP
05/23/2017 05:02:27 AM Router: Message 0031A75A transferred to xxx.xxx.xxx.xxx for another_user@domain.com via SMTP
05/23/2017 05:02:40 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) connected
05/23/2017 05:02:40 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) disconnected. 0 message[s] received
05/23/2017 05:04:09 AM SMTP Server: qwerty321f.domain.com (xxx.xxx.xxx.xxx) connected
05/23/2017 05:04:10 AM Router: Transferred 1 messages to SERVER004/DOMAIN via Notes
05/23/2017 05:04:10 AM SMTP Server: Message 0031D206 (…) received
05/23/2017 05:04:10 AM SMTP Server: qwerty321f.domain.com (xxx.xxx.xxx.xxx) disconnected. 1 message[s] received
05/23/2017 05:04:10 AM Router: Transferring mail to SERVER004/DOMAIN via Notes
05/23/2017 05:04:10 AM Router: Message 0031D206 transferred to SERVER004/DOMAIN for Yet_Another_User/FOO/DOMAIN@DOMAIN via Notes
05/23/2017 05:04:52 AM SMTP Server: foobar003.domain.com (xxx.xxx.xxx.xxx) connected
05/23/2017 05:04:52 AM SMTP Server: foobar003.domain.com (xxx.xxx.xxx.xxx) disconnected. 0 message[s] received
05/23/2017 05:12:27 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) connected
05/23/2017 05:12:27 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) disconnected. 0 message[s] received
05/23/2017 05:26:02 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) connected
05/23/2017 05:26:02 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) disconnected. 0 message[s] received
05/23/2017 05:41:36 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) connectedI would like to find all the unique qwert servers (there is qwert123b and qwerty321f in this example). I would love if there was a single regex formula that would find each server once and just show me those servers. Basically, the idea is that we are shutting down this server and I would like to see what servers are still connecting. I have one server which they cannot seem to move to our new routing server. There are several of these servers which I have pseudonamed qwert that I would just like to get a singular list. I was trying to use the Regex helper add-in to see if I can just get the server names listed without the rest of the line.
-
So the general procedure works, it is just a matter of getting the regular expressions to match the real data. As you didn’t initially provide a sample of your data (a common problem in this forum), I made some (bad) assumptions.
Here’s how I would amend the procedure from earlier (changes in
this
syntax), with the additional assumption that in your sample data you have x’d out your real IP addresses; in other words, you aren’t really searching for xxx.xxx.xxx.xxx, but rather real numerical IP addresses that are constant for each individual server.- Make the Find result panel visible if not already [Search (menu) -> Search Results Window]
- Clear the Find result panel (right-click the panel, choose Clear all)
- Do a Find All in Current Document searching for
qwert.?\d{3}.\.domain\.com \((?:\d{1,3}\.){3}\d{1,3}\) connected
; results will be in the Find result panel - Right-click Find result panel, choose Select all
- Right-click Find result panel, choose Copy
- Paste into a new (scratch) document
- Invoke the Mark feature of the Find dialog to mark this regex:
(?-s)(qwert.?\d{3}.\.domain\.com \((?:\d{1,3}\.){3}\d{1,3}\) connected)(?s)(?=.*\1)
- Set the following options (only) on the mark: Mark line, Purge for each search, Wrap around and of course Search mode: Regular expression
- Click Find All to do the mark; All but the LAST occurrence of each unique server name will be marked
- Delete the non-unique lines via Search (menu) -> Bookmark -> Removed Bookmarked Lines
I think you understand the first regex. The second one is more complicated, but what it is doing is finding a connection match, then asserting (the ?= syntax) that the same connection data occurs later in the file. Only if the same connection occurs later is the match marked. It then goes on to find all such matches in the file. The LAST occurrence of a connection match is NOT marked, because by definition (of LAST) there is no match farther on. Thus, it leaves the final connection match (for each unique server) unmarked. Since all non-unique server connections get marked, deleting all the marked lines leaves you with what you desire, a simple and hopefully short list of all unique connecting servers.
I think that you are ideally wishing for a simpler procedure, and indeed there probably is one, but I don’t think it is going to end up being as simple as you want it to be.
:-D -
Also, a good reference for the regex used above to mark duplicate lines is found here:
http://www.regular-expressions-cookbook.com/Regex Cookbook 2 Code Samples.html
then search that page for Keep the last occurrence of each duplicate line in an unsorted file -
That. Is. So. Cool. Thanks! I can live with that… <smile> Worked like a charm (once I got the correct syntax).
-
Hi, @zeff-wheelock and @scott-sumner,
Scott, your regexes are exact, of course ! In addition, searching for the last occurrence makes sense, as we want to know the final state of all these servers.
However, I must miss something ! To my mind, we should search for the last line, with the word connected, for each server, ONLY IF a previous line, with the word disconnected, for the according server, has not been found !
For instance, if we consider the part of the Zeff’s log, below :
05/23/2017 05:04:09 AM SMTP Server: qwerty321f.domain.com (xxx.xxx.xxx.xxx) connected 05/23/2017 05:04:10 AM Router: Transferred 1 messages to SERVER004/DOMAIN via Notes 05/23/2017 05:04:10 AM SMTP Server: Message 0031D206 (…) received 05/23/2017 05:04:10 AM SMTP Server: qwerty321f.domain.com (xxx.xxx.xxx.xxx) disconnected. 1 message[s] received
To my mind, it looks like, on 05/23/2017 05:04:10 AM, the server “qwerty321f” is disconnected. No ? Of course, I suppose that the string xxx.xxx.xxx.xxx, in the first and last line, just above, represents the same IPV4 address !
Cheers,
guy038