multi-word expressions across lines



  • Hello everybody,

    I hope you are well. I was wondering whether somebody could help me finalise a regular expression I’ve been bogged down with for a while.
    I am trying to capture those expressions where gay.* and homosexual.* appear with HIV and AIDS (the order of the two limits can be reversed).

    The regex I came up with is the following:
    \b(gay|homosex|hiv|aids|disease|virus|condition) ([A-Z0-9.,-;/]+ ){1,500}(gay|homosex|hiv|aids|disease|virus|condition)\b

    but unfortunately few expressions where the virus and the sexuality appear in the text are missed out by the regex. In the small text attached below, only two expressions are captured (‘AIDS and HIV’ in the title - I don’t understand why this is the case, as I wanted the sexuality and the virus to appear together) and AIDS…homosexual (in the penultimate sentence).

    Thank you so much for your help and contributions. Looking forward to them,

    Ivan

    Latest AIDS and HIV figures for Scotland

    During April to June 1996, 35 cases of HIV infection were reported to the Scottish Centre for Infection and Environmental Health.Twenty of these were in homosexual/bisexual males, six in injecting drug users and four in persons who were probably infected heterosexually; five cases are as yet undetermined.Thirteen of the cases were from Lothian and thirteen from Greater Glasgow; 29 cases were male.The cumulative total for HIV infected cases to June 30, 1996 is 2,452; 750 (30.6%) are homosexual/bisexual males, 1,102 (44.9%) are injecting drug users and 391 (15.9%) are thought to have been infected heterosexually.The cumulative total of HIV infection for the United Kingdom is now 27,033 of which 23,001 are male and 4,032 female. The majority, 16,542 are homosexual/bisexual males but 5,060 are in the heterosexual non-injecting drug user category.During the second quarter of 1996, 20 cases of AIDS were reported to SCIEH by clinicians, of which 18 were males; ten were homosexual/bisexual males, four injecting drug users and five in the heterosexual category. Ten of the 20 cases were from Lothian, six from Grampian and four from Greater Glasgow. The cumulative total for AIDS cases to June 30, 1996 is 772 of whom 571 have died.309 cases have been in homosexual/bisexual males, 279 in injecting drug users and 114 in heterosexuals. For the UK as a whole, 12,976 cases of AIDS have now been registered, of which 9,344 have been in homosexual/bisexual males.



  • Hello Ivan,

    Probably, a nicer regex to achieve what you’re looking for, would be :

    (?i)(HIV|AIDS).+?(gay|homosexual)|(?2).+?(?1)

    Notes :

    • First of all, the modifier (?i) forces the search to be case insensitive. Then :

    • The part (HIV|AIDS) looks for one of the two possible names of that disease

    • The part (gay|homosexual) tries to match one of the names, for that sexuality

    • The part .+? represents the shortest non null amount of text, between the strings HIV or AIDS AND gay or homosexual !

    Else :

    • The subroutine call (?2) refers to the group 2 (gay|homosexual)

    • The subroutine call (?1) refers to the group 1 (HIV|AIDS)

    • Again, the part .+? represents the shortest non null amount of text between the strings gay or homosexual AND the strings HIV or AIDS !

    With your given text, to which I added the simple sentence, below, in order to test the second case :

    The cumulative total is 750 homosexual/bisexual males, infected by the HIV, on June 30, 1996
    

    I obtained the 7 captures of text, below :

    1    HIV infection were reported to the Scottish Centre for Infection and Environmental Health.Twenty of these were in homosexual
    2    HIV infected cases to June 30, 1996 is 2,452; 750 (30.6%) are homosexual
    3    HIV infection for the United Kingdom is now 27,033 of which 23,001 are male and 4,032 female. The majority, 16,542 are homosexual
    4    AIDS were reported to SCIEH by clinicians, of which 18 were males; ten were homosexual
    5    AIDS cases to June 30, 1996 is 772 of whom 571 have died.309 cases have been in homosexual
    6    AIDS have now been registered, of which 9,344 have been in homosexual
    7    homosexual/bisexual males, infected by the HIV
    

    Hope that this slicing is what you’re looking for :-)

    Best regards,

    guy038

    P.S :

    • The quantifiers * , + , ? , {n,m} and {n,} are considered as greedy quantifiers

    • The quantifiers *? , +? , ?? , {n,m}? and {n,}? are considered as lazy quantifiers

    Some examples :

    Given the subject string : aaaaaaa333aaaaaaa33333aaaaaaa33333aaaaaaaa333aaaaaaa

    • The regex a.+3 captures the string aaaaaaa333aaaaaaa33333aaaaaaa33333aaaaaaaa333

    • The regex a.+?3 captures the string aaaaaaa3

    • The regex a.+333 captures the string aaaaaaa333aaaaaaa33333aaaaaaa33333aaaaaaaa333

    • The regex a.+?333 captures the string aaaaaaa333

    • The regex a.+33333 captures the string aaaaaaa333aaaaaaa33333aaaaaaa33333

    • The regex a.+?33333 captures the string aaaaaaa333aaaaaaa33333



  • Hello guy038,

    Thank you so much for taking the trouble to reply to my query. This is very helpful and kind of you.

    I found your expression very useful, as there is no doubt yours is nicer and more powerful.
    I ran it across a sample of my corpus and I noticed that the regex captures the text that is comprised within the HIV/AIDS and gay or homosexual. This means that if in the text there is a first reference to HIV/AIDS and halfway through the text a second reference to gay/homosexual (references can be inverted as suggested in your regex too), the entire chunk of text is captured. This however poses a problem because the captures found may not identify a real connection between gay and HIV.

    That is why I was thinking that maybe I should put some parameters (number of words), but it definitely doesn’t capture everything.

    I’ve attached a text where the above occurs.

    Would you recommend using wild cards to capture also gays or homosexuals?

    Thank you so much for your help and attention. You are really helping me get out of this horrible cul-de-sac!

    Best wishes,

    Ivan

    He was just three months early.’
    Eric ‘Eazy-E’ Wright died on March 26, 1995, from complications following Aids -
    a combination of a collapsed lung causing heart failure and pneumonia. He was
    just 31 and had checked himself into hospital with chronic breathing problems
    only a month before, completely unaware he was carrying a fatal virus. He left a
    wife, Tomica, whom he married in hospital. They were together for four years and
    had two children, Dominic, two, and Deijah, born six months after her father’s
    death. All have tested negative for HIV.

    Long before he passed away, Eric Wright had indeed put Compton on the map. And
    as the founding father of gangsta rap, he was arguably one of popular culture’s
    most influential figures of the last quarter of the century. As an entrepreneur,
    he was an inspiration to millions, and with his death he metamorphosed swiftly
    from mogul to martyr. Rarely can such a short life have been so symbolic.

    Wright failed to stay the distance at Dominguez High School in Compton, and by
    the early Eighties was a regular hustler, a dealer in crack and pot. It must
    have been the lure of money or excitement that enticed him, for the home
    provided for him and his younger brother and sister was a stable one. His mother
    was a Montessori teacher, his father, a retired post office wo rker and sometime
    musician who had a big hit with the 103rd Street Rhythm Band back in the early
    Seventies with Express Yourself, later covered by NWA. (Both are still alive,
    though his father recently suffered a stroke.) Some of the streets in Compton
    are strangely quaint: rows of polite bungalows fronted by porches and lawns
    enriched by sub-tropical weather. Some are unremittingly grim, and the main
    thoroughfare, Long Beach Boulevard, is a sorry strip of boarded-up businesses,
    dishevelled lots, soiled fast-food joints and two-bit stores. Wright came from a
    pleasanter part, and it may seem he was self-consciously ‘dropping down’ by his
    first choice of career. But with US ghettos it is wise to recall the phrase, ‘it
    takes a village to raise a child’. A kid is part of the street environment, like
    it or not. And Eric Wright obviously did.

    A childhood friend, Big Man (he declined to give his real name), who bears some
    resemblance to the splendidly sonorous and rotund singer Barry White - though
    his voice is a few notes higher and his girth a couple of feet narrower -
    remembers: ‘Unlike Eric, I went all the way through school.’ His smile broadens:
    ‘Right in the front, right out the back, I never stayed for a class, though I
    did have a dice game third period every day. I didn’t run into him on campus too
    much.’ As an adult, Wright was so small - a slender 5ft 4in - that everyone
    presumed he was a drop -out kid. Most didn’t know his real age until he died.
    But he had presence - the presence of money, even when he didn’t have that much.
    People noticed him strutting down the street. He wore the same trademark clothes
    as the other home boys, but less baggy, with more style. To rib him, close
    friends called him ‘casual’. Perhaps his first vehicle, a psychedelically
    -painted truck, wasn’t so hip, but Eric - he was always Eric or Little Man,
    never Eazy, to his friends - had respect.

    He started organising parties with a friend, Andre Young, aka Dr Dre, who had
    real talent, and was a member of the World Class Wreckin’ Cru with Antoine
    Carraby, aka DJ Yella. Dre began telling Wright what he knew of the music
    business. Smart enough to know drug dealing couldn’t last forever, Eric wanted
    out. Towards the end of 1985 he took, and passed, the test to join the post
    office. But he had also seen that music, like drugs, offered power, and decided
    to act.

    At the time west coast rap was lame, party stuff about the good times - where
    people wanted to be, where they would be soon. It came a poor second to the
    east coast, where hip-hop had begun in the late Seventies and burst across the
    world with the Sugar Hill Gang’s Rapper’s Delight.

    In the spring of 1986, aged 22, Wright assembled the best talent in the 'hood:
    Dre, Yella, MC Ren (Lorenzo Patterson) and Ice Cube (O’Shea Jackson). He paid
    for studio time and urged Cube, the best rapper and songwriter, to write
    something ‘real’, something about the gangs, about the life he’d been leading.
    The result was Boyz 'N The Hood (later the title of a film) and the birth of
    NWA. Wright paid $ 7,000 for 10,000 12-inch records, and he and Dre would drive
    around - by now in a burgundy Suzuki Samurai - selling the discs at swap-meets,
    a cross between a flea market and a car-boot sale. By word of mouth alone, the
    record sold 500,000 copies.

    The next year, Wright hooked up with Jerry Heller, a music business veteran, who
    recognised this hardcore stuff as the next big thing. He took Wright to Priority
    Records, where the boy from the 'hood talked his way into a unique deal:
    Priority would distribute records on Wright’s own label, Ruthless, and Wright
    would have, for a newcomer, an unheard of piece of the action.

    In 1988, Eazy-E’s first solo album, Eazy Duz It, and NWA’s Straight Outta
    Compton, sold more than 5.5 million copies between them - without a single play
    on radio or television. Suddenly they were stars - a lifestyle that’s easy to
    slip into in LA. Eazy bought a $ 1 million house in a cul-de-sac in Westlake, a
    suburb predominated by retired people. Dre and Yella had a house next door and
    they shared a party house over the road. By night, they were the people everyone
    wanted to know.

    As a rapper, Wright had a distinctive, high-pitched, brattish voice, but was not
    one of the very best. It could take him hours to get his part down right. Nor
    was he a great songwriter. But he cemented NWA and determined their direction.
    MC Ren recalls: ‘It just wouldn’t have happened without him. Even though he
    wasn’t writing - it was his voice that grabbed everybody. And he had the idea of
    putting that shit together, that all-star group. He saw something I didn’t see
    and that shit just clicked. Others were doing something like it but not to the
    full depth of what we were doing. It shocked a lot of motherfuckers.’ Even the
    group’s full name - Niggaz With Attitude - was a challenge to a liberal
    mainstream that had shunned the n-word in the post-Martin Luther King era. Now
    it was thrust back in their white faces, worn as a badge of honour, a viciously
    ironic statement that, dammit, young blacks still felt they were treated like
    niggers. Gangsta rap, like all black American music, was rooted in the Blues -
    with its lyrical expressiveness and rhythmic foundations - but added to the
    plaintiveness of old was an angry, cussing criticism of authority that
    represented a new departure.

    NWA, and the genre they spawned, represented a nightmare for the respectable
    world - and many feminists: a live and kicking validation of foul language,
    violence and misogyny. They were reprimanded by the FBI over their song Fuck Tha
    Police on Straight Outta Compton, and joined a line of demonised performers that
    began with Elvis and continued through Jimi Hendrix, Sly Stone, the Rolling
    Stones and the Sex Pistols. With their baseball caps, baggies and snow-white
    trainers, they influenced fashion across the world. Their tales of drugs, police
    intimidation, gang violence and alienation jump-started a new brand of American
    film-making and told the world of black urban angst long before the Rodney King
    beating and subsequent LA riots. America’s political leaders had to listen.
    Attitude had arrived.

    By the time NWA’s 1991 album Efil4Zaggin (Niggaz4Life spelt backwards) became
    the first rap album to top the US Billboard charts, middle-class white kids
    across America, and indeed Europe, were draping themselves in polyester cotton,
    greeting each other with high fives and ‘Yo!’, and were fully acquainted
    vicariously with the hip Hades of motherfuckers, bitches and hoes.

    Ruthless Records artists have sold more than 28 million records, with 21
    recordings reaching gold or platinum status. After NWA split, Ice Cube and Dr
    Dre soared. Eazy-E’s final album, Str8 Off Tha Streets Of Muthaphu**in Compton,
    which is released on Monday, is sure to hit the top back home. Priority Records
    (now separate from Ruthless) has meanwhile produced a greatest hits package,
    Eternal E, also available here.

    Wright once quoted his personal wealth at $ 60 million and Ruthless’s value at $
    20 million. For a time, it was one of the most successful independent labels of
    all time, and inspired many imitators, many rivals. For black Americans, Wright
    became a beacon of entrepreneurship, even more so than Russell Simmons of Def-
    Jam Records on the east coast, because no one, but no one, had ever come off the
    streets, turned their back on dealing dope and made it big.

    But within days of his death, a struggle for his legacy was underway. Michael
    Klein, the business manager at Ruthless, disputed Wright’s will which left the
    company to Tomica, claiming Wright handed half of it over to him in 1992.
    Tomica, a former assistant to the chairman of Motown, promised her dying husband
    she would keep the company alive. Klein is also questioning Wright’s state of
    mind when he married and the validity of making his wife a trustee of his
    estate.

    ‘Tomica has the ambition and the ability to run the company. She has worked in
    the business and knows the ins and outs and she is a bright woman, she’s no
    airhead,’ says Ernie Singleton, formerly head of MCA’s black music division who
    was brought in as acting president of Ruthless by the California Superior
    Courts.

    ‘It was a real love affair,’ says a family friend. ‘She certainly loved him. And
    Eric, he was obviously no goody-two shoes but he was extremely devoted. He
    trusted her judgment.’ Ruthless is now back up on its feet - the doors were
    padlocked for several weeks last spring after things started disappearing - but
    the legal grind could, according to music industry experts, take another three
    years. Ruthless may be whittled away in lawyers’ fees, taking Wright’s dream of
    a large-scale, multi-media black company with it.

    In America, there isn’t the pressure there is here on artists to produce fresh
    sounds. Fans stay loyal. And from Eazy they expected the familiar rough stuff,
    which largely explains why the subject matter on the first Eazy-E album and the
    last is so similar, though musically the latter is richer, more layered, bumping
    along in a smooth groove. Alongside the usual bravado and gang tales, there are
    tracks such as Nuts On Ya Chin and Lickin, Suckin And Fuckin. The heavy sex
    content of Eazy’s raps, the brazen assertion of sexual prowess, the arrogant
    expectation of female submission, lend a grim irony to his death.

    It almost goes without saying that he came from an environment where Aids is
    very largely viewed as a gay disease. ‘It does have that image,’ says Cassandra
    Ware, vice-president of Ruthless Records and a friend of Wright’s before she
    arrived at the company. 'Because of that heavy machismo, like, ‘never will I
    take it in the butt, so I’ll be all right’. And Eric was one of those machismos

    • ‘I’m a star, killer, handsome buck of a man. Any woman who comes my way I’ll
      ride her’ - and it happened.’


  • @guy038
    Hello guy038,

    Thank you so much for taking the trouble to reply to my query. This is very helpful and kind of you.

    I found your expression very useful, as there is no doubt yours is nicer and more powerful.
    I ran it across a sample of my corpus and I noticed that the regex captures the text that is comprised within the HIV/AIDS and gay or homosexual. This means that if in the text there is a first reference to HIV/AIDS and halfway through the text a second reference to gay/homosexual (references can be inverted as suggested in your regex too), the entire chunk of text is captured. This however poses a problem because the captures found may not identify a real connection between gay and HIV.

    That is why I was thinking that maybe I should put some parameters (number of words), but it definitely doesn’t capture everything.

    I’ve attached a text where the above occurs.

    Would you recommend using wild cards to capture also gays or homosexuals?

    Thank you so much for your help and attention. You are really helping me get out of this horrible cul-de-sac!

    Best wishes,

    Ivan

    (PS: the text is in my previous post. Apologies if this message is a repetition of the previous one. I don’t mean to harass you with lots of question. I just thought that you might have not seen the previous reply as a result of me not tagging you. Many thanks again for your generous help!)



  • Hi Ivan,

    Sorry for my late reply. I tried to deeply think about your problem :-)

    I’m wondering… Looking from any significant word, with the simple regex gay|homosexual|hiv|aids|disease|virus|condition, in your second text, I only found :

    • The three words Aids, virus and HIV, in that order, inside the first paragraph

    • The words Aids, gay and disease, in that order, inside the last paragraph

    So, I’m a bit confused. Which relation would you like to be occured, with the help of regexes ?

    To my mind, the true problem is :

    • How many couples of words do you consider valuable to search for ?
    • Which minimum/maximum distance must separate the two words, of a couple ?

    It’s quite a difficult question, which seems to be more a linguistic matter that a regex matter ! Even worse, if we consider that you might want a maximum distance d1 between two words and an other maximum distance d2 between two other words !

    I’m quite optimist about building regexes for specific purposes. That’s not the main problem. As usual, you just need to, exactly, define what you expect to :-))


    To fully understand the complexity of the simple search of TWO words, inside any text ( although it could have seemed easy, at first sight ! ), let’s consider the example text below, in one line :

    1----A----1----B----1----C----9----D----9----E----1----F----9----G----1----H----1----I----9----J----
    

    Well. Suppose you’re looking for ranges of characters, which, either :

    • Begins with the 1 digit and ends with the 9 digit

    • Begins with the 9 digit and ends with the 1 digit

    Between these two limits 1 and 9, there are ten ranges of characters, from zone A to zone J

    Several interpretations are possible :

    • With the regex 1\K.*?(?=9)|9\K.*?(?=1), we obtain 5 ranges of characters, with their digit limits, below :

      • Zone 1----A----1----B----1----C----9
      • Zone 9----D----9----E----1
      • Zone 1----F----9
      • Zone 9----G----1
      • Zone 1----H----1----I----9
    • With the regex 1\K(?=.*?9)|9\K(?=.*?1), we obtain 9 ranges of characters, with their digit limits, below :

      • Zone 1----A----1----B----1----C----9
      • Zone 1----B----1----C----9
      • Zone 1----C----9
      • Zone 9----D----9----E----1
      • Zone 9----E----1
      • Zone 1----F----9
      • Zone 9----G----1
      • Zone 1----H----1----I----9
      • Zone 1----I----9
    • With the regex 1\K[^1\r\n]*?(?=9)|9\K[^9\r\n]*?(?=1), we obtain 5 ranges of characters, with their digit limits, below :

      • Zone 1----C----9
      • Zone 9----E----1
      • Zone 1----F----9
      • Zone 9----G----1
      • Zone 1----I----9

    Notes :

    • With the first regex, the regex engine matches, from cursor location, either :

      • The shortest next range 1…9
      • The shortest next range 9…1

    So, the regex engine, alternatively, find a 1…9 zone, then a 9…1 zone, then a 1…9 zone, and so on…

    • With the second regex, the regex engine matches, from cursor location, either :

      • The zero length location, just after the next limit 1, which begins a 1…9 zone
      • The zero length location, just after the next limit 9, which begins a 9…1 zone
    • With the third regex, the regex engine matches, from cursor location, either :

      • The shortest next range 1…9, which does NOT contain any limit 1, inside that range
      • The shortest next range 9…1, which does NOT contain any limit 9, inside that range

    So, in that example, which kind of search would be pertinent ( or a new one, different from the three above ! ), to your mind ?!

    See you later !

    Best regards,

    guy038

    P.S. :

    If the example text is split in several lines, as below :

    1----A----1---
    -B----1----C-
    ---9----D----9--
    --E----1--
    --F----9----G---
    -1----H--
    --1----I---
    -9----J----
    

    The 3 regexes, described above, must be rewritten, as below :

    1\K(?s).*?(?=9)|9\K(?s).*?(?=1)

    1\K(?s)(?=.*?9)|9\K(?s)(?=.*?1)

    1\K(?s)[^1]*?(?=9)|9\K(?s)[^9]*?(?=1)



  • Hi @guy038,

    Only now did I notice that my answer thanking you for your help didn’t go through for some reason. I very much appreciated your help and your advice on my issue.

    In the end, what I did was to apply a slightly tweaked version of the first regex. I manually disambiguated all the possible combinations and chunks of text that contained the parameters ‘gay,gays, homosexual, Aids, HIV…’ and assessed the relevance and correctness of the results, text by text.

    Thank you again for your help.

    Keep in touch and have a good day,

    Ivan


Log in to reply