Community
    • Login

    multi-word expressions across lines

    Scheduled Pinned Locked Moved Help wanted · · · – – – · · ·
    search in filesword wrap
    6 Posts 2 Posters 34.2k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Ivan GhioI
      Ivan Ghio
      last edited by

      Hello everybody,

      I hope you are well. I was wondering whether somebody could help me finalise a regular expression I’ve been bogged down with for a while.
      I am trying to capture those expressions where gay.* and homosexual.* appear with HIV and AIDS (the order of the two limits can be reversed).

      The regex I came up with is the following:
      \b(gay|homosex|hiv|aids|disease|virus|condition) ([A-Z0-9.,-;/]+ ){1,500}(gay|homosex|hiv|aids|disease|virus|condition)\b

      but unfortunately few expressions where the virus and the sexuality appear in the text are missed out by the regex. In the small text attached below, only two expressions are captured (‘AIDS and HIV’ in the title - I don’t understand why this is the case, as I wanted the sexuality and the virus to appear together) and AIDS…homosexual (in the penultimate sentence).

      Thank you so much for your help and contributions. Looking forward to them,

      Ivan

      Latest AIDS and HIV figures for Scotland

      During April to June 1996, 35 cases of HIV infection were reported to the Scottish Centre for Infection and Environmental Health.Twenty of these were in homosexual/bisexual males, six in injecting drug users and four in persons who were probably infected heterosexually; five cases are as yet undetermined.Thirteen of the cases were from Lothian and thirteen from Greater Glasgow; 29 cases were male.The cumulative total for HIV infected cases to June 30, 1996 is 2,452; 750 (30.6%) are homosexual/bisexual males, 1,102 (44.9%) are injecting drug users and 391 (15.9%) are thought to have been infected heterosexually.The cumulative total of HIV infection for the United Kingdom is now 27,033 of which 23,001 are male and 4,032 female. The majority, 16,542 are homosexual/bisexual males but 5,060 are in the heterosexual non-injecting drug user category.During the second quarter of 1996, 20 cases of AIDS were reported to SCIEH by clinicians, of which 18 were males; ten were homosexual/bisexual males, four injecting drug users and five in the heterosexual category. Ten of the 20 cases were from Lothian, six from Grampian and four from Greater Glasgow. The cumulative total for AIDS cases to June 30, 1996 is 772 of whom 571 have died.309 cases have been in homosexual/bisexual males, 279 in injecting drug users and 114 in heterosexuals. For the UK as a whole, 12,976 cases of AIDS have now been registered, of which 9,344 have been in homosexual/bisexual males.

      1 Reply Last reply Reply Quote 0
      • guy038G
        guy038
        last edited by guy038

        Hello Ivan,

        Probably, a nicer regex to achieve what you’re looking for, would be :

        (?i)(HIV|AIDS).+?(gay|homosexual)|(?2).+?(?1)

        Notes :

        • First of all, the modifier (?i) forces the search to be case insensitive. Then :

        • The part (HIV|AIDS) looks for one of the two possible names of that disease

        • The part (gay|homosexual) tries to match one of the names, for that sexuality

        • The part .+? represents the shortest non null amount of text, between the strings HIV or AIDS AND gay or homosexual !

        Else :

        • The subroutine call (?2) refers to the group 2 (gay|homosexual)

        • The subroutine call (?1) refers to the group 1 (HIV|AIDS)

        • Again, the part .+? represents the shortest non null amount of text between the strings gay or homosexual AND the strings HIV or AIDS !

        With your given text, to which I added the simple sentence, below, in order to test the second case :

        The cumulative total is 750 homosexual/bisexual males, infected by the HIV, on June 30, 1996
        

        I obtained the 7 captures of text, below :

        1    HIV infection were reported to the Scottish Centre for Infection and Environmental Health.Twenty of these were in homosexual
        2    HIV infected cases to June 30, 1996 is 2,452; 750 (30.6%) are homosexual
        3    HIV infection for the United Kingdom is now 27,033 of which 23,001 are male and 4,032 female. The majority, 16,542 are homosexual
        4    AIDS were reported to SCIEH by clinicians, of which 18 were males; ten were homosexual
        5    AIDS cases to June 30, 1996 is 772 of whom 571 have died.309 cases have been in homosexual
        6    AIDS have now been registered, of which 9,344 have been in homosexual
        7    homosexual/bisexual males, infected by the HIV
        

        Hope that this slicing is what you’re looking for :-)

        Best regards,

        guy038

        P.S :

        • The quantifiers * , + , ? , {n,m} and {n,} are considered as greedy quantifiers

        • The quantifiers *? , +? , ?? , {n,m}? and {n,}? are considered as lazy quantifiers

        Some examples :

        Given the subject string : aaaaaaa333aaaaaaa33333aaaaaaa33333aaaaaaaa333aaaaaaa

        • The regex a.+3 captures the string aaaaaaa333aaaaaaa33333aaaaaaa33333aaaaaaaa333

        • The regex a.+?3 captures the string aaaaaaa3

        • The regex a.+333 captures the string aaaaaaa333aaaaaaa33333aaaaaaa33333aaaaaaaa333

        • The regex a.+?333 captures the string aaaaaaa333

        • The regex a.+33333 captures the string aaaaaaa333aaaaaaa33333aaaaaaa33333

        • The regex a.+?33333 captures the string aaaaaaa333aaaaaaa33333

        Ivan GhioI 1 Reply Last reply Reply Quote 0
        • Ivan GhioI
          Ivan Ghio
          last edited by

          Hello guy038,

          Thank you so much for taking the trouble to reply to my query. This is very helpful and kind of you.

          I found your expression very useful, as there is no doubt yours is nicer and more powerful.
          I ran it across a sample of my corpus and I noticed that the regex captures the text that is comprised within the HIV/AIDS and gay or homosexual. This means that if in the text there is a first reference to HIV/AIDS and halfway through the text a second reference to gay/homosexual (references can be inverted as suggested in your regex too), the entire chunk of text is captured. This however poses a problem because the captures found may not identify a real connection between gay and HIV.

          That is why I was thinking that maybe I should put some parameters (number of words), but it definitely doesn’t capture everything.

          I’ve attached a text where the above occurs.

          Would you recommend using wild cards to capture also gays or homosexuals?

          Thank you so much for your help and attention. You are really helping me get out of this horrible cul-de-sac!

          Best wishes,

          Ivan

          He was just three months early.’
          Eric ‘Eazy-E’ Wright died on March 26, 1995, from complications following Aids -
          a combination of a collapsed lung causing heart failure and pneumonia. He was
          just 31 and had checked himself into hospital with chronic breathing problems
          only a month before, completely unaware he was carrying a fatal virus. He left a
          wife, Tomica, whom he married in hospital. They were together for four years and
          had two children, Dominic, two, and Deijah, born six months after her father’s
          death. All have tested negative for HIV.

          Long before he passed away, Eric Wright had indeed put Compton on the map. And
          as the founding father of gangsta rap, he was arguably one of popular culture’s
          most influential figures of the last quarter of the century. As an entrepreneur,
          he was an inspiration to millions, and with his death he metamorphosed swiftly
          from mogul to martyr. Rarely can such a short life have been so symbolic.

          Wright failed to stay the distance at Dominguez High School in Compton, and by
          the early Eighties was a regular hustler, a dealer in crack and pot. It must
          have been the lure of money or excitement that enticed him, for the home
          provided for him and his younger brother and sister was a stable one. His mother
          was a Montessori teacher, his father, a retired post office wo rker and sometime
          musician who had a big hit with the 103rd Street Rhythm Band back in the early
          Seventies with Express Yourself, later covered by NWA. (Both are still alive,
          though his father recently suffered a stroke.) Some of the streets in Compton
          are strangely quaint: rows of polite bungalows fronted by porches and lawns
          enriched by sub-tropical weather. Some are unremittingly grim, and the main
          thoroughfare, Long Beach Boulevard, is a sorry strip of boarded-up businesses,
          dishevelled lots, soiled fast-food joints and two-bit stores. Wright came from a
          pleasanter part, and it may seem he was self-consciously ‘dropping down’ by his
          first choice of career. But with US ghettos it is wise to recall the phrase, ‘it
          takes a village to raise a child’. A kid is part of the street environment, like
          it or not. And Eric Wright obviously did.

          A childhood friend, Big Man (he declined to give his real name), who bears some
          resemblance to the splendidly sonorous and rotund singer Barry White - though
          his voice is a few notes higher and his girth a couple of feet narrower -
          remembers: ‘Unlike Eric, I went all the way through school.’ His smile broadens:
          ‘Right in the front, right out the back, I never stayed for a class, though I
          did have a dice game third period every day. I didn’t run into him on campus too
          much.’ As an adult, Wright was so small - a slender 5ft 4in - that everyone
          presumed he was a drop -out kid. Most didn’t know his real age until he died.
          But he had presence - the presence of money, even when he didn’t have that much.
          People noticed him strutting down the street. He wore the same trademark clothes
          as the other home boys, but less baggy, with more style. To rib him, close
          friends called him ‘casual’. Perhaps his first vehicle, a psychedelically
          -painted truck, wasn’t so hip, but Eric - he was always Eric or Little Man,
          never Eazy, to his friends - had respect.

          He started organising parties with a friend, Andre Young, aka Dr Dre, who had
          real talent, and was a member of the World Class Wreckin’ Cru with Antoine
          Carraby, aka DJ Yella. Dre began telling Wright what he knew of the music
          business. Smart enough to know drug dealing couldn’t last forever, Eric wanted
          out. Towards the end of 1985 he took, and passed, the test to join the post
          office. But he had also seen that music, like drugs, offered power, and decided
          to act.

          At the time west coast rap was lame, party stuff about the good times - where
          people wanted to be, where they would be soon. It came a poor second to the
          east coast, where hip-hop had begun in the late Seventies and burst across the
          world with the Sugar Hill Gang’s Rapper’s Delight.

          In the spring of 1986, aged 22, Wright assembled the best talent in the 'hood:
          Dre, Yella, MC Ren (Lorenzo Patterson) and Ice Cube (O’Shea Jackson). He paid
          for studio time and urged Cube, the best rapper and songwriter, to write
          something ‘real’, something about the gangs, about the life he’d been leading.
          The result was Boyz 'N The Hood (later the title of a film) and the birth of
          NWA. Wright paid $ 7,000 for 10,000 12-inch records, and he and Dre would drive
          around - by now in a burgundy Suzuki Samurai - selling the discs at swap-meets,
          a cross between a flea market and a car-boot sale. By word of mouth alone, the
          record sold 500,000 copies.

          The next year, Wright hooked up with Jerry Heller, a music business veteran, who
          recognised this hardcore stuff as the next big thing. He took Wright to Priority
          Records, where the boy from the 'hood talked his way into a unique deal:
          Priority would distribute records on Wright’s own label, Ruthless, and Wright
          would have, for a newcomer, an unheard of piece of the action.

          In 1988, Eazy-E’s first solo album, Eazy Duz It, and NWA’s Straight Outta
          Compton, sold more than 5.5 million copies between them - without a single play
          on radio or television. Suddenly they were stars - a lifestyle that’s easy to
          slip into in LA. Eazy bought a $ 1 million house in a cul-de-sac in Westlake, a
          suburb predominated by retired people. Dre and Yella had a house next door and
          they shared a party house over the road. By night, they were the people everyone
          wanted to know.

          As a rapper, Wright had a distinctive, high-pitched, brattish voice, but was not
          one of the very best. It could take him hours to get his part down right. Nor
          was he a great songwriter. But he cemented NWA and determined their direction.
          MC Ren recalls: ‘It just wouldn’t have happened without him. Even though he
          wasn’t writing - it was his voice that grabbed everybody. And he had the idea of
          putting that shit together, that all-star group. He saw something I didn’t see
          and that shit just clicked. Others were doing something like it but not to the
          full depth of what we were doing. It shocked a lot of motherfuckers.’ Even the
          group’s full name - Niggaz With Attitude - was a challenge to a liberal
          mainstream that had shunned the n-word in the post-Martin Luther King era. Now
          it was thrust back in their white faces, worn as a badge of honour, a viciously
          ironic statement that, dammit, young blacks still felt they were treated like
          niggers. Gangsta rap, like all black American music, was rooted in the Blues -
          with its lyrical expressiveness and rhythmic foundations - but added to the
          plaintiveness of old was an angry, cussing criticism of authority that
          represented a new departure.

          NWA, and the genre they spawned, represented a nightmare for the respectable
          world - and many feminists: a live and kicking validation of foul language,
          violence and misogyny. They were reprimanded by the FBI over their song Fuck Tha
          Police on Straight Outta Compton, and joined a line of demonised performers that
          began with Elvis and continued through Jimi Hendrix, Sly Stone, the Rolling
          Stones and the Sex Pistols. With their baseball caps, baggies and snow-white
          trainers, they influenced fashion across the world. Their tales of drugs, police
          intimidation, gang violence and alienation jump-started a new brand of American
          film-making and told the world of black urban angst long before the Rodney King
          beating and subsequent LA riots. America’s political leaders had to listen.
          Attitude had arrived.

          By the time NWA’s 1991 album Efil4Zaggin (Niggaz4Life spelt backwards) became
          the first rap album to top the US Billboard charts, middle-class white kids
          across America, and indeed Europe, were draping themselves in polyester cotton,
          greeting each other with high fives and ‘Yo!’, and were fully acquainted
          vicariously with the hip Hades of motherfuckers, bitches and hoes.

          Ruthless Records artists have sold more than 28 million records, with 21
          recordings reaching gold or platinum status. After NWA split, Ice Cube and Dr
          Dre soared. Eazy-E’s final album, Str8 Off Tha Streets Of Muthaphu**in Compton,
          which is released on Monday, is sure to hit the top back home. Priority Records
          (now separate from Ruthless) has meanwhile produced a greatest hits package,
          Eternal E, also available here.

          Wright once quoted his personal wealth at $ 60 million and Ruthless’s value at $
          20 million. For a time, it was one of the most successful independent labels of
          all time, and inspired many imitators, many rivals. For black Americans, Wright
          became a beacon of entrepreneurship, even more so than Russell Simmons of Def-
          Jam Records on the east coast, because no one, but no one, had ever come off the
          streets, turned their back on dealing dope and made it big.

          But within days of his death, a struggle for his legacy was underway. Michael
          Klein, the business manager at Ruthless, disputed Wright’s will which left the
          company to Tomica, claiming Wright handed half of it over to him in 1992.
          Tomica, a former assistant to the chairman of Motown, promised her dying husband
          she would keep the company alive. Klein is also questioning Wright’s state of
          mind when he married and the validity of making his wife a trustee of his
          estate.

          ‘Tomica has the ambition and the ability to run the company. She has worked in
          the business and knows the ins and outs and she is a bright woman, she’s no
          airhead,’ says Ernie Singleton, formerly head of MCA’s black music division who
          was brought in as acting president of Ruthless by the California Superior
          Courts.

          ‘It was a real love affair,’ says a family friend. ‘She certainly loved him. And
          Eric, he was obviously no goody-two shoes but he was extremely devoted. He
          trusted her judgment.’ Ruthless is now back up on its feet - the doors were
          padlocked for several weeks last spring after things started disappearing - but
          the legal grind could, according to music industry experts, take another three
          years. Ruthless may be whittled away in lawyers’ fees, taking Wright’s dream of
          a large-scale, multi-media black company with it.

          In America, there isn’t the pressure there is here on artists to produce fresh
          sounds. Fans stay loyal. And from Eazy they expected the familiar rough stuff,
          which largely explains why the subject matter on the first Eazy-E album and the
          last is so similar, though musically the latter is richer, more layered, bumping
          along in a smooth groove. Alongside the usual bravado and gang tales, there are
          tracks such as Nuts On Ya Chin and Lickin, Suckin And Fuckin. The heavy sex
          content of Eazy’s raps, the brazen assertion of sexual prowess, the arrogant
          expectation of female submission, lend a grim irony to his death.

          It almost goes without saying that he came from an environment where Aids is
          very largely viewed as a gay disease. ‘It does have that image,’ says Cassandra
          Ware, vice-president of Ruthless Records and a friend of Wright’s before she
          arrived at the company. 'Because of that heavy machismo, like, ‘never will I
          take it in the butt, so I’ll be all right’. And Eric was one of those machismos

          • ‘I’m a star, killer, handsome buck of a man. Any woman who comes my way I’ll
            ride her’ - and it happened.’
          1 Reply Last reply Reply Quote 0
          • Ivan GhioI
            Ivan Ghio @guy038
            last edited by

            @guy038
            Hello guy038,

            Thank you so much for taking the trouble to reply to my query. This is very helpful and kind of you.

            I found your expression very useful, as there is no doubt yours is nicer and more powerful.
            I ran it across a sample of my corpus and I noticed that the regex captures the text that is comprised within the HIV/AIDS and gay or homosexual. This means that if in the text there is a first reference to HIV/AIDS and halfway through the text a second reference to gay/homosexual (references can be inverted as suggested in your regex too), the entire chunk of text is captured. This however poses a problem because the captures found may not identify a real connection between gay and HIV.

            That is why I was thinking that maybe I should put some parameters (number of words), but it definitely doesn’t capture everything.

            I’ve attached a text where the above occurs.

            Would you recommend using wild cards to capture also gays or homosexuals?

            Thank you so much for your help and attention. You are really helping me get out of this horrible cul-de-sac!

            Best wishes,

            Ivan

            (PS: the text is in my previous post. Apologies if this message is a repetition of the previous one. I don’t mean to harass you with lots of question. I just thought that you might have not seen the previous reply as a result of me not tagging you. Many thanks again for your generous help!)

            1 Reply Last reply Reply Quote 0
            • guy038G
              guy038
              last edited by guy038

              Hi Ivan,

              Sorry for my late reply. I tried to deeply think about your problem :-)

              I’m wondering… Looking from any significant word, with the simple regex gay|homosexual|hiv|aids|disease|virus|condition, in your second text, I only found :

              • The three words Aids, virus and HIV, in that order, inside the first paragraph

              • The words Aids, gay and disease, in that order, inside the last paragraph

              So, I’m a bit confused. Which relation would you like to be occured, with the help of regexes ?

              To my mind, the true problem is :

              • How many couples of words do you consider valuable to search for ?
              • Which minimum/maximum distance must separate the two words, of a couple ?

              It’s quite a difficult question, which seems to be more a linguistic matter that a regex matter ! Even worse, if we consider that you might want a maximum distance d1 between two words and an other maximum distance d2 between two other words !

              I’m quite optimist about building regexes for specific purposes. That’s not the main problem. As usual, you just need to, exactly, define what you expect to :-))


              To fully understand the complexity of the simple search of TWO words, inside any text ( although it could have seemed easy, at first sight ! ), let’s consider the example text below, in one line :

              1----A----1----B----1----C----9----D----9----E----1----F----9----G----1----H----1----I----9----J----
              

              Well. Suppose you’re looking for ranges of characters, which, either :

              • Begins with the 1 digit and ends with the 9 digit

              • Begins with the 9 digit and ends with the 1 digit

              Between these two limits 1 and 9, there are ten ranges of characters, from zone A to zone J

              Several interpretations are possible :

              • With the regex 1\K.*?(?=9)|9\K.*?(?=1), we obtain 5 ranges of characters, with their digit limits, below :

                • Zone 1----A----1----B----1----C----9
                • Zone 9----D----9----E----1
                • Zone 1----F----9
                • Zone 9----G----1
                • Zone 1----H----1----I----9
              • With the regex 1\K(?=.*?9)|9\K(?=.*?1), we obtain 9 ranges of characters, with their digit limits, below :

                • Zone 1----A----1----B----1----C----9
                • Zone 1----B----1----C----9
                • Zone 1----C----9
                • Zone 9----D----9----E----1
                • Zone 9----E----1
                • Zone 1----F----9
                • Zone 9----G----1
                • Zone 1----H----1----I----9
                • Zone 1----I----9
              • With the regex 1\K[^1\r\n]*?(?=9)|9\K[^9\r\n]*?(?=1), we obtain 5 ranges of characters, with their digit limits, below :

                • Zone 1----C----9
                • Zone 9----E----1
                • Zone 1----F----9
                • Zone 9----G----1
                • Zone 1----I----9

              Notes :

              • With the first regex, the regex engine matches, from cursor location, either :

                • The shortest next range 1…9
                • The shortest next range 9…1

              So, the regex engine, alternatively, find a 1…9 zone, then a 9…1 zone, then a 1…9 zone, and so on…

              • With the second regex, the regex engine matches, from cursor location, either :

                • The zero length location, just after the next limit 1, which begins a 1…9 zone
                • The zero length location, just after the next limit 9, which begins a 9…1 zone
              • With the third regex, the regex engine matches, from cursor location, either :

                • The shortest next range 1…9, which does NOT contain any limit 1, inside that range
                • The shortest next range 9…1, which does NOT contain any limit 9, inside that range

              So, in that example, which kind of search would be pertinent ( or a new one, different from the three above ! ), to your mind ?!

              See you later !

              Best regards,

              guy038

              P.S. :

              If the example text is split in several lines, as below :

              1----A----1---
              -B----1----C-
              ---9----D----9--
              --E----1--
              --F----9----G---
              -1----H--
              --1----I---
              -9----J----
              

              The 3 regexes, described above, must be rewritten, as below :

              1\K(?s).*?(?=9)|9\K(?s).*?(?=1)

              1\K(?s)(?=.*?9)|9\K(?s)(?=.*?1)

              1\K(?s)[^1]*?(?=9)|9\K(?s)[^9]*?(?=1)

              1 Reply Last reply Reply Quote 0
              • Ivan GhioI
                Ivan Ghio
                last edited by

                Hi @guy038,

                Only now did I notice that my answer thanking you for your help didn’t go through for some reason. I very much appreciated your help and your advice on my issue.

                In the end, what I did was to apply a slightly tweaked version of the first regex. I manually disambiguated all the possible combinations and chunks of text that contained the parameters ‘gay,gays, homosexual, Aids, HIV…’ and assessed the relevance and correctness of the results, text by text.

                Thank you again for your help.

                Keep in touch and have a good day,

                Ivan

                1 Reply Last reply Reply Quote 0
                • First post
                  Last post
                The Community of users of the Notepad++ text editor.
                Powered by NodeBB | Contributors