First occurrence of series of words in file



  • I have a log file which have several server names in them. I have created a RegEx formula to get the series of servers that I am looking for. I would like to find the first occurrence of each of those servers in the entire log file. The RegEx formula for the servers is like:

    qwert.?\d{3}.

    This would match servers like:
    qwertb012b
    qwert123g
    Etc

    I would like to find the first occurrence of each of the servers resolved by the RegEx formula in the entire log file so I have a list of servers that have connected.

    I am using NotePad++ 7.3.3 with the RegEx Helper add-in to get the list.



  • @Zeff-Wheelock

    There are a few ways to solve this one. Here’s how I might do it, not saying it is the best way, just A way to get the job done:

    • Make the Find result panel visible if not already [Search (menu) -> Search Results Window]
    • Clear the Find result panel (right-click the panel, choose Clear all)
    • Do your Find All in Current Document search exactly as you described; results will be in the Find result panel
    • Right-click Find result panel, choose Select all
    • Right-click Find result panel, choose Copy
    • Paste into a new (scratch) document
    • Invoke the Mark feature of the Find dialog to mark this regex: (?-s)^(.*)(?:\R)(?s)(?=.*^\1\R)
    • Set the following options (only) on the mark: Mark line, Purge for each search, Wrap around and of course Search mode: Regular expression
    • Click Find All to do the mark; All but the LAST occurrence of each unique server name will be marked
    • Delete the non-unique lines via Search (menu) -> Bookmark -> Removed Bookmarked Lines

    You are left with one unique line per server name. It finds the LAST occurrence of each name, and you wanted the FIRST, but I don’t see how that makes a difference for your situation.
    :-D



  • Unfortunately, it did not work. I am trying to do a log like this from Lotus Notes:

    05/23/2017 03:40:39 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) disconnected. 0 message[s] received
    05/23/2017 03:41:16 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) connected
    05/23/2017 03:41:16 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) disconnected. 0 message[s] received
    05/23/2017 04:00:01 AM Router: Beginning Out of office service daily maintenance.
    05/23/2017 04:00:01 AM Router: Completed Out of office service daily maintenance.
    05/23/2017 04:01:00 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) connected
    05/23/2017 04:01:00 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) disconnected. 0 message[s] received
    05/23/2017 04:01:00 AM Router: Received maintenance request for mail4.box, processing all pending message updates
    05/23/2017 04:01:00 AM Router: Completed compaction of mailbox file mail3.box
    05/23/2017 04:01:00 AM Router: Beginning in-place compaction of mailbox file mail4.box
    05/23/2017 04:01:08 AM Router: Completed compaction of mailbox file mail4.box
    05/23/2017 04:13:16 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) connected
    05/23/2017 04:13:16 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) disconnected. 0 message[s] received
    05/23/2017 04:45:44 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) connected
    05/23/2017 04:45:44 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) disconnected. 0 message[s] received
    05/23/2017 05:00:56 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) connected
    05/23/2017 05:00:56 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) disconnected. 0 message[s] received
    05/23/2017 05:02:21 AM Router: Transferring mail to domain xxx.xxx.xxx.xxx (host xxx.xxx.xxx.xxx [xxx.xxx.xxx.xxx]) via SMTP
    05/23/2017 05:02:24 AM Router: Message 0031A730 transferred to xxx.xxx.xxx.xxx for another_user@domain.com via SMTP
    05/23/2017 05:02:27 AM Router: Transferred 2 messages to xxx.xxx.xxx.xxx (host xxx.xxx.xxx.xxx) via SMTP
    05/23/2017 05:02:27 AM Router: Message 0031A75A transferred to xxx.xxx.xxx.xxx for another_user@domain.com via SMTP
    05/23/2017 05:02:40 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) connected
    05/23/2017 05:02:40 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) disconnected. 0 message[s] received
    05/23/2017 05:04:09 AM SMTP Server: qwerty321f.domain.com (xxx.xxx.xxx.xxx) connected
    05/23/2017 05:04:10 AM Router: Transferred 1 messages to SERVER004/DOMAIN via Notes
    05/23/2017 05:04:10 AM SMTP Server: Message 0031D206 (…) received
    05/23/2017 05:04:10 AM SMTP Server: qwerty321f.domain.com (xxx.xxx.xxx.xxx) disconnected. 1 message[s] received
    05/23/2017 05:04:10 AM Router: Transferring mail to SERVER004/DOMAIN via Notes
    05/23/2017 05:04:10 AM Router: Message 0031D206 transferred to SERVER004/DOMAIN for Yet_Another_User/FOO/DOMAIN@DOMAIN via Notes
    05/23/2017 05:04:52 AM SMTP Server: foobar003.domain.com (xxx.xxx.xxx.xxx) connected
    05/23/2017 05:04:52 AM SMTP Server: foobar003.domain.com (xxx.xxx.xxx.xxx) disconnected. 0 message[s] received
    05/23/2017 05:12:27 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) connected
    05/23/2017 05:12:27 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) disconnected. 0 message[s] received
    05/23/2017 05:26:02 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) connected
    05/23/2017 05:26:02 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) disconnected. 0 message[s] received
    05/23/2017 05:41:36 AM SMTP Server: qwert123b.domain.com (xxx.xxx.xxx.xxx) connected

    I would like to find all the unique qwert servers (there is qwert123b and qwerty321f in this example). I would love if there was a single regex formula that would find each server once and just show me those servers. Basically, the idea is that we are shutting down this server and I would like to see what servers are still connecting. I have one server which they cannot seem to move to our new routing server. There are several of these servers which I have pseudonamed qwert that I would just like to get a singular list. I was trying to use the Regex helper add-in to see if I can just get the server names listed without the rest of the line.



  • @Zeff-Wheelock

    So the general procedure works, it is just a matter of getting the regular expressions to match the real data. As you didn’t initially provide a sample of your data (a common problem in this forum), I made some (bad) assumptions.

    Here’s how I would amend the procedure from earlier (changes in this syntax), with the additional assumption that in your sample data you have x’d out your real IP addresses; in other words, you aren’t really searching for xxx.xxx.xxx.xxx, but rather real numerical IP addresses that are constant for each individual server.

    • Make the Find result panel visible if not already [Search (menu) -> Search Results Window]
    • Clear the Find result panel (right-click the panel, choose Clear all)
    • Do a Find All in Current Document searching for qwert.?\d{3}.\.domain\.com \((?:\d{1,3}\.){3}\d{1,3}\) connected; results will be in the Find result panel
    • Right-click Find result panel, choose Select all
    • Right-click Find result panel, choose Copy
    • Paste into a new (scratch) document
    • Invoke the Mark feature of the Find dialog to mark this regex: (?-s)(qwert.?\d{3}.\.domain\.com \((?:\d{1,3}\.){3}\d{1,3}\) connected)(?s)(?=.*\1)
    • Set the following options (only) on the mark: Mark line, Purge for each search, Wrap around and of course Search mode: Regular expression
    • Click Find All to do the mark; All but the LAST occurrence of each unique server name will be marked
    • Delete the non-unique lines via Search (menu) -> Bookmark -> Removed Bookmarked Lines

    I think you understand the first regex. The second one is more complicated, but what it is doing is finding a connection match, then asserting (the ?= syntax) that the same connection data occurs later in the file. Only if the same connection occurs later is the match marked. It then goes on to find all such matches in the file. The LAST occurrence of a connection match is NOT marked, because by definition (of LAST) there is no match farther on. Thus, it leaves the final connection match (for each unique server) unmarked. Since all non-unique server connections get marked, deleting all the marked lines leaves you with what you desire, a simple and hopefully short list of all unique connecting servers.

    I think that you are ideally wishing for a simpler procedure, and indeed there probably is one, but I don’t think it is going to end up being as simple as you want it to be.
    :-D



  • Also, a good reference for the regex used above to mark duplicate lines is found here:

    http://www.regular-expressions-cookbook.com/Regex Cookbook 2 Code Samples.html
    then search that page for Keep the last occurrence of each duplicate line in an unsorted file



  • That. Is. So. Cool. Thanks! I can live with that… <smile> Worked like a charm (once I got the correct syntax).



  • Hi, @zeff-wheelock and @scott-sumner,

    Scott, your regexes are exact, of course ! In addition, searching for the last occurrence makes sense, as we want to know the final state of all these servers.

    However, I must miss something ! To my mind, we should search for the last line, with the word connected, for each server, ONLY IF a previous line, with the word disconnected, for the according server, has not been found !

    For instance, if we consider the part of the Zeff’s log, below :

    05/23/2017 05:04:09 AM SMTP Server: qwerty321f.domain.com (xxx.xxx.xxx.xxx) connected
    05/23/2017 05:04:10 AM Router: Transferred 1 messages to SERVER004/DOMAIN via Notes
    05/23/2017 05:04:10 AM SMTP Server: Message 0031D206 (…) received
    05/23/2017 05:04:10 AM SMTP Server: qwerty321f.domain.com (xxx.xxx.xxx.xxx) disconnected. 1 message[s] received
    

    To my mind, it looks like, on 05/23/2017 05:04:10 AM, the server “qwerty321f” is disconnected. No ? Of course, I suppose that the string xxx.xxx.xxx.xxx, in the first and last line, just above, represents the same IPV4 address !

    Cheers,

    guy038


Log in to reply