Editing 600 mega XML file



  • What size XML can you load in N++ ?



  • I don’t think file type matters, but…
    If I have no plugins and no other file opened, I can open files with maximum size around 362MB, before it displays “File is too big to be opened” error.
    But with my plugins enabled, the limit lowers to about 150MB.



  • Hello, for people needing a x64 build of Notepad++ I made one here.

    Notes:

    1. This is an easy unofficial build – not tested by devs.
    2. You need 64-bit OS.
    3. It is not compatible with 32-bit plugins (ie. all available plugins). You need to rename or move to other place your /plugins folder.
    4. It can open very large files. Important: make sure “Word Wrap” is disabled before opening such files.
    5. The 7z package above contains only the binaries you need to replace. Please do a backup of your current 32bit files (rename or move).


  • I really think that anyone requiring to edit a 600MB XML file should be thinking of alternatives! I mean those files contain structures, and you really need to use something that will respect that structure - which isn’t a text editor.

    I don’t know what this file contains, but it might be useful to arrange that it was stored as a number of much smaller files. Alternatively, I guess you could load its structure into Mathematica, manipulate it, and write it out again!

    I don’t like to see a tool like NP++ being pushed to perform basically ridiculous tasks.

    David



  • hi david
    excuse me.
    be happy, that You never had to edit a 10GB-xml-file.
    therefor notepad++ isn’t NOT Your friend ;-)
    but, I want to give y try to ricardo’s 64bit edition! it’s new to me ;-)
    Yours klaus



  • Klaus,

    “be happy, that You never had to edit a 10GB-xml-file.”

    I am indeed - but I am also happy that I have never tried to solder electronic components with a blow torch, or fry an egg on a smoothing iron! Even if I had, I wouldn’t try to make suggestions on a blow torch forum regarding improvements to blow torches that might make that more feasible!

    I think a lot of software gradually bloats out - both in bytes and in terms of complexity. Ultimately valuable software can die that way! I think NP++ has remained consistently focussed on providing solutions for those who need to edit normally sized text files - particularly program source - and trying to add on outlandish capabilities would be a severe distraction.

    Switching to 64-bits would probably only be the first step in a project to edit 10 GB data files, because I am sure there must be many processes inside NP++ that depend on being able to scan across an entire file in a sensible amount of time.

    It might be more constructive if you described how you got into this mess, and someone might be able to offer some constructive suggestions!

    As a preliminary suggestion, I would suggest that you write a C program to read the file, and recognise the data you want to change. You could then modify the program to actually perform a change (but I would back up your file before you start :) ).

    David



  • woot
    For editing a 10GB file, I guess the best would be an editor that doesn’t load the entire file into memory, but reads it direct from disk. Otherwise, you would need a lot of RAM.



  • Hello Maor Bachar and All,

    Why don’t you give a try to the old, but excellent, script program GAWK.exe, which can be used, in addition, both, on Unix or Windows machines (with the appropriate executable file, of course ) ?

    You can get the latest Windows version of gawk.exe, AFAIK, ( v4.1.0 ), at the address, below :

    https://code.google.com/p/gnu-on-windows/downloads/detail?name=gawk-4.1.0-bin.zip&can=2&q=

    To get an overview of the main features of gawk and what’s new in the v4.1.0 version, follow the link :

    http://www.drdobbs.com/open-source/gnu-awk-this-is-not-your-fathers-awk/240158351

    And, here is, below, the link to download the last reference book, on GAWK, v4.1.x, in various formats :

    http://www.gnu.org/software/gawk/manual/

    The PDF form is quite recent : April 2015 !


    To test it, I created a 1 Gb file. ( I didn’t create a 10 Gb file, as my old Win XP laptop, with only two 40Gb partitions would not accept it ! ). But, when I go back to work , as I’m presently, on holidays, I’ll be able to build bigger files !

    Briefly, I first created, in N++, a 200 Mb file, about, named xxx.txt. Then, I recopied this file, five times, to the All.txt file, in a DOS windows, with the commands :

    type xxx.txt>All.txt
    type xxx.txt>>All.txt
    type xxx.txt>>All.txt
    type xxx.txt>>All.txt
    type xxx.txt>>All.txt
    

    I ended the All.txt file with a last line

    echo END of the FILE>>All.txt
    

    Length of the lines are from 0 to 150. Finally, the All.txt file contains, about, 18 498 000 lines, for 1026 Mo bytes !


    I used the simple gawk script below, named Script.txt. It’s IMPORTANT to note that this file is ANSI encoded

    #----------------------------------------------------------------------------------------------------------------#
    #  SYNTAX to RUN, in a CMD windows, from the FOLDER, where are the 3 files GAWK.exe, Script.txt and All.txt      #
    #            ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯  ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯                        ¯¯¯¯¯¯¯¯  ¯¯¯¯¯¯¯¯¯¯     ¯¯¯¯¯¯¯      #
    #                                                                                                                #
    #  SYNTAX :                                                                                                      #
    #  ¯¯¯¯¯¯                                                                                                        #
    #                                                                                                                #
    #    echo [Regex|String]|gawk -f Script.txt - All.txt[ File2.txt[ File3[ ...]]][ >[>]See.txt]                    #
    #----------------------------------------------------------------------------------------------------------------#
    
    {
      if ( pattern == "" )
        {
    
          #---------------------------------------------------------------------------#
          #  The STANDARD INPUT -  is read, BEFORE the USER FILE(S), so ALL the TEXT  #
          #    of the DOS COMMAND 'echo' will be STORED in the VARIABLE 'pattern'     #
          #                                                                           #
          #   Note : No BLANK, at END of "echo" command, unless part of the REGEX     #
          #                                                                           #
          #  IF the PATTERN is NOT initialized  :                                     #
          #    SET the VARIABLE 'total' to the value 0                                #
          #    SET the VARIABLE 'pattern' to the value of the LINE field $0           #
          #      = ALL the TEXT of the DOS COMMAND 'echo'                             #
          #    SKIP to the NEXT line to READ                                          #
          #---------------------------------------------------------------------------#
    
          total = 0 ; pattern = $0
    
          next
        }
    }
    
    #------------------------------------------------------------------------------#
    #    IF the CURRENT line MATCHES the PATTERN of the VARIABLE 'pattern' :       #
    #      INCREMENT, by ONE, the VARIABLE 'total' [ and PRINT the CURRENT line ]  #
    #    SKIP to the NEXT line to READ, in ALL cases                               #
    #------------------------------------------------------------------------------#
    
    {
      if ($0 ~ pattern) { ++total }
    
    #   OR     if ($0 ~ pattern) { ++total ; print }
    
      next
    }
    
    
    #-------------------------------------------------------------------------------------------------------------------#
    #  AFTER ALL the USER files are READ, DISPLAYS the NUMBER of lines, MATCHING the PATTERN of the VARIABLE 'pattern'  #
    #-------------------------------------------------------------------------------------------------------------------#
    
    END { print "\n  Number of lines MATCHING \x22" pattern "\x22  : " , total }
    

    The command echo .*|gawk -f Script.txt - all.txt give me the number 18 498 231, which is the number of lines matching the regex .*

    The command echo END of the FILE|gawk -f Script.txt - all.txt give me the number 1, noticing that it correctly read the last and 18 498 231th line, of the 1 Gb file All.txt !

    On my old Win XP laptop, with, only, 1GB RAM ( 2 * 512 Mo ), I got the result, from each command, in 1mn 30s about ! Not to bad, isn’t it ?

    Of course, my script just count the matchings but, with GAWK, you can manipulate files, in multiple ways, do calculus or searches/replacements or … It’s a very powerful tool, although tiny in size : 223 246 bytes, for the v4.1 version.

    You’ll just have to scan, a bit, the gawk documentation !

    And, as Ricardo said, it seems, to directly read files, from disk :-) => No limit in size, seemingly

    Best Regards,

    guy038





  • @Ricardo
    I think crisp (64bit) can do it!
    but it costs 100-250 per Year!

    and for my biggest files (with rubbish-xml-gore) in 10GB files, I must to use: VEDIT Pro (64bit)
    attention: high prize! $240 (no joke!)

    Yours klaus


Log in to reply