Regex for Searching <HEAD> Section



  • I want to use Notepad++ to find soft hyphen characters (ISO 8859: 0xAD, Unicode U+00AD SOFT HYPHEN, HTML: ­ ­) in the <head> section of my HTML files. I tried the two regular expressions below, but both return zero hits.

    <head>.*?­.*?</head>
    <head>.*­.*</head>
    

    Curiously, the following regex does finds soft hyphens in <figcaption> sections:

    <figcaption>.*?­.*?</figcaption>
    

    I suspect the issue is that the <head> section contains newlines. I tried the search with the “. matches newline” both checked and unchecked. Still got zero hits both ways.

    Is there a way to do this kind of search in Notepad++?



  • @aksarben

    I think the code blocks you used above are hiding your soft-hyphen character, at least visually. I find that if I copy and paste them into Notepad++, the soft-hyphen character reappears.

    Anyway, I would try searching for: (?s)<head>.*?\x{00AD}.*?</head>

    I think there have been some recent postings about Unicode characters used explicitly in the Find-what box of the Find dialog not working correctly…?



  • Hello, @aksarben, @alan-kilborn and All,

    Simply, use this regex S/R :

    SEARCH (?s)(.*?<head>|\G)((?!</head>).)*?\K\xAD

    REPLACE Any SINGLE character or STRING

    Notes :

    • I assume, of course, that there only one section <head>........</head> per file

    • The <head>........</head> section can be, either, in one line or splitted into several ones

    • Any soft hyphen, found, above the starting tag <head> is ignored

    • Any soft hyphen, between the starting and the ending tag is found, individually

    • Any soft hyphen, found, under the ending tag </head> is ignored

    • Preferably, when testing on a single file, tick the wrap around option, which forces to starts the S/R process from the very beginning of the file

    Best Regards,

    guy038


Log in to reply