docs/wiki - Regular Expressions - Example 2 not correct



  • I am not sure, if the wiki will ever updated again, but if so - here something i found to be wrong. In Example 2 there is also an ip adress mentioned that is not supposed to be changed, but it is, you just won’t find out if you won’t change the numbers. Some details:

    • ([^0-9])([0123][0-9]).([01][0-9]).([0-9][0-9])([^0-9])

    Replace with: \1\4-\3-\2\5
    Problem: Doesn’t find dates at the beginning, doesn’t find 4 digit year, doesn’t find dates with only 1 number in d/m/y, DOES find the ip mentioned it just doesn not appear to change cause \2 and \4 are the same, does accept 0.0.0, outer groups not needed
    First try:

    • ([^0-9.]|^)([0123]?[0-9]).([01]?[0-9]).(?:[0-9]{2})?([0-9]?[0-9])(?!.?[0-9])

    Replace with: \1\4-\3-\2
    Problem: Won’t find dates that start after a dot. like “…this .01.01.12 …”, should only happen in case of a misplaced dot, nevertheless …, also the first group is still in the match, as such the replace-numbers aren’t that clear to read. Does still accept 00.00.00. Either use 2x (?!00) or more alternatives like ([1-9]|[01][1-9]|10). Of course something like 37.18.0000 would be possible too, just … go to hell ^^ :P. Nah, more alternatives - of course (see below). Now ask for the days in each month - 30/31 and 28.2 (incl. leap year). Possible, but … rly? https://stackoverflow.com/questions/15491894/
    (STILL EASY) SOLUTION:

    • (?<![0-9].)(?<![0-9])(3[01]|[12][0-9]|0?[1-9])[- /.](1[012]|0?[1-9])[- /.](?:[0-9]{2})?([0-9]?[0-9])(?!.?[0-9])

    Replace with: \3-\2-\1
    Problem: two lookbehinds, one lookahead - might be in impact on performance on very long documents, though no better solution possible, if you want to avoid above problems. 1-2-3.4 You motherf…

    Find new date: (?<![0-9].)(?<![0-9])([0-9]?[0-9])-(1[012]|0?[1-9])-(3[01]|[12][0-9]|0?[1-9])(?!-?[0-9.])

    Though if you want to have two digit d/m/y you have to:
    Find: (?<![0-9].)(?<![0-9])([0-9])-(1[012]|0?[1-9])-(3[01]|[12][0-9]|0?[1-9])
    Replace with: 0\1-\2-\3
    Find: (?<![0-9].)(?<![0-9])([0-9][0-9])-([1-9])-(3[01]|[12][0-9]|0?[1-9])(?!-?[0-9.])
    Replace with: \1-0\2-\3
    Find: (?<![0-9].)(?<![0-9])([0-9][0-9])-([0-9][0-9])-([1-9])(?!-?[0-9.])
    Replace with: \1-\2-0\3
    You either might not need the lookarounds or want to look for a 3rd “-” too, depending on the data you have.

    Find new date: (?<![0-9].)(?<![0-9])([0-9][0-9])-([01][0-9])-([0123][0-9])(?!-?[0-9.])

    Tough I do wonder - those last 3 find/replaces - would there even be a better option? I can’t seem to find one (for notepad++, scripts on the other wise …)



  • Ah - that’s when you first post on a new forum



  • Hello @kusalux, and All,

    The problem should be separated into two independent tasks :

    • Firstly, verify that NO invalid date exists, in the involved file(s)

    • Secondly, replace any European valid date, as dd.mm.yy[yy], dd-mm-yy[yy] or dd/mm/yy[yy] into the sortable American-English format, such as yy[yy]-mm-dd, for instance


    Paradoxically, let’s begin by the easy second task !

    The simple regexes S/R, below, do the job, assuming that all dates :

    • Have, either, the dd.mm.yy[yy], dd-mm-yy[yy] or dd/mm/yy[yy] format

    • Are surrounded by blank characters or located at the very beginning or the very end of the current file

    • After replacement, the regex 1 keeps the year format. So, dd.mm.yy => yy-mm-dd and dd/mm/yyyy => yyyy-mm-dd

    • Whereas the regex 2 always writes the year in a two-digits number => yy-mm-dd

    So :

    SEARCH (?:\A|(?<=\s))(\d\d)[./-](\d\d)[./-]((\d\d)?\d\d)(?=\s|\z) Regex 1

    SEARCH (?:\A|(?<=\s))(\d\d)[./-](\d\d)[./-](?:\d\d)?(\d\d)(?=\s|\z) Regex 2

    REPLACE \3-\2-\1

    Notes :

    • First, the part (?:\A|(?<=\s)) forces the date to be preceded with a Blank character or to begin the file

    • Then (\d\d) looks for the two digits number of the day, stored as group 1, followed by a literal dot, slash or dash [./-]

    • Followed with the two digits of the month (\d\d), stored as group 2, followed, again, with a literal dot, slash or dash [./-]

      • Followed with the two or four digits of the year, stored as group 3 ((\d\d)?\d\d) ( case of regex 1 )

      • Followed with the last two digits of the year, stored as group 3 (?:\d\d)?(\d\d) ( case of regex 2 )

    • Finally, the positive look-ahead (?=\s|\z) ensures that the date is followed with a Blank character or ends the file

    • In replacement, the different groups are simply rewritten in the inverse order, separated with dashes


    Now, the task to verify if the date is valid, or not, is much more difficult to achieve ! At first sight :

    • The separator, between numbers, must be present twice and can be, exclusively, one of these three characters [./-]

    • The regex (19|20)?[0-9]{2} defines any correct two or four digits year, in the range [1900 - 2099 ]

    • The regex 0[1-9]|1[0-2] defines any correct two-digits month

    • And the regex 0[1-9]|[12][0-9]|3[01] defines any correct two-digits day… But not exactly… Indeed, we must take in account such dates as 31.04.18, 29-02-18 or even 31/02/18 , which are obviously invalid ones :-((

    I succeeded to build a correct but awful regex, using the fre-spacing mode ! So, here is, below, the regex A, which detects any valid date, with format dd.mm.yy[yy], dd-mm-yy[yy] or dd/mm/yy[yy], surrounded by two blank character or by the very beginning or the very end of current file :

    (?x)
    (\A|(?<=\s))(
    (0[1-9]|[12][0-9]|3[01])  ([./-])   (0[13578]|1[02])   \g4    (19|20)?[0-9]{2}   |                #  Months =  01, 03, 05, 07, 08, 10, 12
    (0[1-9]|[12][0-9]|30)     ([./-])   (0[469]|11)        \g8    (19|20)?[0-9]{2}   |                #  Months =  04, 06, 09, 11
    (0[1-9]|1[0-9]|2[0-8])    ([./-])   02                 \g12   (19|20)?[0-9]{2}   |                #  Month  =  02 and day < 29
    29                        ([./-])   02                 \g14   (19|20)?([02468][048]|[13579][26])  #  Month  =  02 and day = 29
    )(?=\s|\z)
    

    But, practically, everyone understands that this regex is not exploitable ! Note that we can build a less-restrictive regex, if we don’t care about possible wrong days, in dates !

    For instance, the regex B, below, would detect dates, with format dd.mm.yy[yy], dd-mm-yy[yy] or dd/mm/yy[yy], surrounded by two blank character or by the very beginning or the very end, of current file. However, some invalid dates, such as 29/02/18 or 31-04-18, would still be matched :-((

    (\A|(?<=\s))(0[1-9]|[12][0-9]|3[01])([./-])(0[1-9]|1[0-2])\3(19|20)?[0-9]{2}(?=\s|\z)


    So, we have to change our plans. The best would be :

    • Firstly, use a regex to detect most erroneous date formats, such as those, below :
    0,0,0
    test 31.100.1952
    32/03/18
    1.2.3
    123.2.12
    _22.02.18_
    22-02-18test
    22/02-18
    22//02/18
    22 02-18
    22:02:18
    

    Here is, below, the regex C, in free-spacing mode, which catches most of the invalid formats of date :

    (?x)
    (\A|(?<=\s))  (?!(0[1-9]|[12][0-9]|3[01])  ([./-])  (0[1-9]|1[0-2])  \3  (19|20)?[0-9]{2})  \d+[^\d\r\n]+\d+[^\d\r\n]+\d+(?=\s|\z) |
    (?<=[\l\u_])  (  (0[1-9]|[12][0-9]|3[01])  ([./-])  (0[1-9]|1[0-2])  \8  (19|20)?[0-9]{2}  )  |  (?6)  (?=[\l\u_])
    

    Secondly, AFTER correction of the erroneous formats, in your file, use a regex to detect all possible erroneous days, in dates, such as those, below, relative to months 02, 04, 06, 09, 11 :

    30.02.00
    31.02.2016	
    29/02/12
    29.02.18
    31.06.2018
    31.11.18
    

    Here is, below, the regex D, which catches most of the invalid days, in well-built dates :

    (\A|(?<=\s))(31[./-](04|06|09|11)|(29|30|31)[./-]02)(?=[./-])
    

    Beware ! The regex D will, also, detect few correct dates such as 29/02/12 ! So, its up to you to decide IF it’s a valid leap date or NOT :-))


    Now, after correction of these possible erroneous days, in dates, NO invalid date should exist anymore :-)) Indeed, just verify by counting the total matches, running regex A, then regex B => You should notice the same number of matches !

    Best Regards

    guy038

    P.S. :

    Oh, I’ve just seen that the regex A ( The huge one, you remember ! ) does get one wrong match. Indeed, regex A considers that the 29/02/1900 is a valid date. That’s totally wrong, because the 1900 year is not a leap year, at all :-))



  • Well, you know my post was a mere mention of a wrong statement in the wiki. ;) And a short, yet easy to explain better example, that:

    • catches all possible dates given the format: (d)d*(m)m*(yy)(y)y (while * could be any of “.-/”)
    • does not allow 0 or 00 as day/month (assuming you won’t have a date like 0.0.0 that you act. need)
    • makes sure, there are not any other numbers around to exclude f.e. IPs, while (nearly) making sure, that just dates, nothing else, is matched.
    • assuming the dates are correct, just some misplaced dots
    • prob. most important as it is, what the wiki provides: an easy find/replace solution for those unfamiliar with regex. (i did wan’t to have that \3-\2-\1 to be clear what get’s replaced)

    (one can f.e. change [- \/.] -> [^\w\r\n:] too and test at f.e. with https://regex101.com/r/1rp7Uc/1/)

    Nevertheless: as i’m referenced - on the first Regex A, you’re right, that 1900 is allowed - (19|20)?([02468][048]|[13579][26]) - one can clearly see why. You need to make sure, that it’s not native divideable by 100, only 400. The regex in my first link does provide a correct one: ((1[6-9]|[2-9]\d)? (0[48]|[2468][048]|[13579][26])| ((16|[2468][048]|[3579][26]) 00)) (also showing regex’s weaknes … :4 :100 :400 … a clear and 1 time job for any given date using scripts, … but in regex you need new (in compar. rather) complex code for every new milestone. That’s why I did not go into any detail, while provide a link with further information.)

    As for the Regex C/D - that is probably the better solution, if your data is a total mess. But you have to look it through, as you prob. don’t want to replace all of it. And there is still one point you won’t catch (the wiki claims of a document with dates in it, therefore end/start of sentence):

    … falls on the date of 10.11.12. That is a totally legal case! …
    … it.10.11.12 has some typo - there should be a whitespace but …



  • Hi, @kusalux, and All,

    Yeah ! it’s always the same story ! Everyone which codes, or even builds a regex, before writing the first line, should ask himself, about all aspects of the program or the regex to create ;-))

    Regarding your analyse about fair dates or wrong ones, an other guy could have a totally different approach about what it’s right or not ! Theoretically, in your case, you should recapitulate all possible valid dates and all possible valid formats of date that you consider as right ones ! I do understand that this is, generally, not an easy task !

    For instance, you said that the syntax 10.11.12. should be taken as valid. But, now, the syntax 10.11.12.13 looks rather like an IP address ! And if your last example it.10.11.12 is supposed correct, too, then a date, embedded in text, like , for instance, abc.27.06.12.def should be a valid format of date !

    Of course, when building the regexes, in my previous post, I just imagined them, according to my way of considering correct formats of date and how the should be located among all surrounding text ! Now, let’s imagine that you previously thought and defined the list of all valid formats of date ( for YOU ! ) then, half work would be already done and, to get the right regex should be easier to achieve :-))

    In addition, you may have a different look to the date problem, later, working on different files ! Indeed, regexes strongly depends on the current file contents, which could end to very different regexes for similar kinds of search ! And, honestly, when a regex seems too specific or complicated, and can be used in very few cases, only, I rather think that it’s not a useful regex, anyway :-))

    If you do need information on regexes, I could help you, but I’m sure than you can cope with these strange things ( the regexes ) by yourself :-))

    Cheers,

    guy038

    P.S. :

    Above all, I forgot to point out, that a script or a programming language is a better way to verify dates than regexes, anyway !


Log in to reply