A help with regex



  • Guys,

    A little help?

    My database is exporting data jumping some lines of code.

    It’s exporting a CSV like this (first line is the fields):

    /--------------------/
    id, location, data, name, description
    ObjectId(“662626”), rio, 12/12/2018,
    this is a name, this is a description
    ObjectId(“2342342626”, saopaulo, 14/12/2018,
    this is also a name, another description
    /--------------------/

    I need it to be:

    /--------------------/
    id, location, data, name, description
    ObjectId(“662626)”, rio, 12/12/2018, this is a name, this is a description
    ObjectId(“2342342626”), saopaulo, 14/12/2018, this is also a name, another description
    /--------------------/

    Another catch is that not always this happens. Sometimes is in the same line, sometimes it skips 3 lines.

    The good thing is that, except the header, all lines should begin with ObjectId("

    Any suggestion?

    Thanks!



  • I can get one that works when each record is only broken into 2 lines…

    • (?s)(ObjectID\(.*?)(\R+)(.*?)(?=\R+ObjectID|\R+$)
    • \1 \3
    • Search Mode = ☑ Regular Expression

    But that won’t work when the record is broken into 3 or more lines…

    This seems to work with any number of lines, when run repeatedly (until all are merged)

    • (?-s)(ObjectID\(.*?)\h*\R+(?!^ObjectID\()
    • \1\x20
    • Search Mode = ☑ Regular Expression

    I don’t like that you have to run it multiple times, but it’s better than nothing, and will work for all multiline (and won’t mess up the single-line versions)

    • The search says “dot doesn’t match newlines”, "match literal ObjectID( followed by 0 or more non-newline characters and capture into group \1", “match zero or more horizontal spaces (space, tab), followed by one or more newlines, but don’t capture/keep those”, and “make sure that the match is followed by another literal ObjectID(, but don’t consume that”
    • The replace says “replace everything matched (except the not-consumed stuff) with the first group, followed by the space character (aka \x20)”

    I am sure @guy038 could get it in one run of the regex engine. But I don’t have such awesome regex skills

    The second one worked on this example text using just four runs of the regex:

    ObjectId(“662626”), rio, 12/12/2018,
    this is a name, this is a description
    ObjectId(“2342342626”, saopaulo, 14/12/2018,
    this is also a name, another description
    
    ObjectId(“2342342626”, 
    saopaulo, 
    14/12/2018,
    this is also a name, 
    another description
    

    --------
    FYI: if you have further regex needs, study this FAQ and the documentation it points to. Before asking a new regex question, understand that for future requests, many of us will expect you to show what data you have (exactly), what data you want (exactly), what regex you already tried (to show that you’re showing effort), why you thought that regex would work (to prove it wasn’t just something randomly typed), and what data you’re getting with an explanation of why that result is wrong. When you show that effort, you’ll see us bend over backward to get things working for you. If you need help formatting the data so that the forum doesn’t mangle it (so that it shows “exactly”, as I said earlier), see this help-with-markdown post, where @Scott-Sumner gives a great summary of how to use Markdown for this forum’s needs.
    Please note that for all “regex” queries – or queries where you want help “matching” or “marking” or “bookmarking” a certain pattern, which amounts to the same thing – it is best if you are explicit about what needs to match, and what shouldn’t match, and have multiple examples of both in your example dataset. Often, what shouldn’t match helps define the regular expression as much or more than what should match.



  • @Edgard etc

    I’m not at a PC currently so cannot provide a tested solution but if your example is accurate it would seem that the splitting occurs at a delimiter, namely the , comma. If that is the case then all we need to do is search for a comma immediately before a end of line marker, and replace both with just the comma.
    So the regex might look something like
    Find what::,$
    Replace with:,

    See if that works and let us know.

    Terry



  • Actually ,\r\n for the find what might be better, that’s what I normally use. I suspect the $ amounts to a zero width position, thus it won’t actually replace the carriage return line feed.

    Terry.



  • @Terry-R,

    I thought about limiting it to comma-EOL, but the OP wasn’t clear when it split to multiple lines, whether it would always split on a comma, or if it was possible to split in the middle of a field, like:

    ObjectId(“662626”), rio, 12/12/2018,
    this is a name, this is a description
    ObjectId(“2342342626”, saopaulo, 14/12/2018,
    this is also a name, another description
    
    ObjectId(“2342342626”, 
    saopaulo, 
    14/12/2018,
    this 
    is 
    also 
    a 
    name, 
    another 
    description
    

    If it could never be in the middle of a field, then ,\h*\R -> ,\x20 will work



  • But that gave me another idea, if it could break inside a field: look for one or more newline, as long as it’s not followed by literal ObjectID(, and replace with a space:

    • \R+(?!ObjectID\() \h*\R+(?!ObjectID\()
    • \x20
    • Search Mode = ☑ Regular Expression

    This does it in one go, even for my most-recent data example

    edit: change to \h*\R+(?!ObjectID\(), so that any whitespace before the newline is also collapsed into the single space (\x20) in the replace string.



  • The problem is, I noticed, that not always the lines has commas. like:

    /--------------------/
    id, location, data, name, description
    ObjectId(“662626”), rio, 12/12/2018,
    this is a name, this is a description
    description continues
    more description
    ObjectId(“2342342626”, saopaulo, 14/12/2018,
    this is also a name, another description
    still description
    more
    and more
    /--------------------/

    Thats why it wont work, Terry :-(



  • @PeterJones said:

    \h*\R+(?!ObjectID()

    Looks like it works!!! Amazing, Peter!!! I’ll test if everything is OK!!!


Log in to reply