Community
    • Login

    Editing 600 mega XML file

    Scheduled Pinned Locked Moved General Discussion
    17 Posts 7 Posters 76.1k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Maor BacharM
      Maor Bachar @Ricardo
      last edited by Maor Bachar

      @Ricardo gonna test it now :)

      edited:
      WOW, loaded the file fast, and response fast. (on a 32Bit, 4GB RAM WIN7 PC)
      I’m Impressed… checking what features it have.

      seem like it doesn’t have XML style?

      Thanks,
      Maor.

      RicardoR 1 Reply Last reply Reply Quote 0
      • dailD
        dail
        last edited by

        Keep in mind handling very large files is difficult. There are editors made for editing large files by only loading up parts of the file at a time. If you have a 10GB file, N++ would have to load the entire thing into memory. And on top of that, if it is something like XML there is alot more overhead due to parsing the file, storing the style information for each character, folding states, etc. For that large of files you’re better of using some external program or scripting language, since the entire thing doesn’t have to be loaded into memory usually.

        1 Reply Last reply Reply Quote 0
        • RicardoR
          Ricardo @Maor Bachar
          last edited by

          @Maor-Bachar said:

          seem like it doesn’t have XML style?

          I think they unfortunately restrict extra coloring schemes to their paid version: http://www.editpadpro.com/cscs.html

          1 Reply Last reply Reply Quote 0
          • AJ BaxterA
            AJ Baxter
            last edited by

            What size XML can you load in N++ ?

            1 Reply Last reply Reply Quote 0
            • RicardoR
              Ricardo
              last edited by Ricardo

              I don’t think file type matters, but…
              If I have no plugins and no other file opened, I can open files with maximum size around 362MB, before it displays “File is too big to be opened” error.
              But with my plugins enabled, the limit lowers to about 150MB.

              1 Reply Last reply Reply Quote 0
              • RicardoR
                Ricardo
                last edited by Ricardo

                Hello, for people needing a x64 build of Notepad++ I made one here.

                Notes:

                1. This is an easy unofficial build – not tested by devs.
                2. You need 64-bit OS.
                3. It is not compatible with 32-bit plugins (ie. all available plugins). You need to rename or move to other place your /plugins folder.
                4. It can open very large files. Important: make sure “Word Wrap” is disabled before opening such files.
                5. The 7z package above contains only the binaries you need to replace. Please do a backup of your current 32bit files (rename or move).
                1 Reply Last reply Reply Quote 0
                • David BaileyD
                  David Bailey
                  last edited by

                  I really think that anyone requiring to edit a 600MB XML file should be thinking of alternatives! I mean those files contain structures, and you really need to use something that will respect that structure - which isn’t a text editor.

                  I don’t know what this file contains, but it might be useful to arrange that it was stored as a number of much smaller files. Alternatively, I guess you could load its structure into Mathematica, manipulate it, and write it out again!

                  I don’t like to see a tool like NP++ being pushed to perform basically ridiculous tasks.

                  David

                  1 Reply Last reply Reply Quote 0
                  • Klaus LehmannK
                    Klaus Lehmann
                    last edited by

                    hi david
                    excuse me.
                    be happy, that You never had to edit a 10GB-xml-file.
                    therefor notepad++ isn’t NOT Your friend ;-)
                    but, I want to give y try to ricardo’s 64bit edition! it’s new to me ;-)
                    Yours klaus

                    1 Reply Last reply Reply Quote 0
                    • David BaileyD
                      David Bailey
                      last edited by David Bailey

                      Klaus,

                      “be happy, that You never had to edit a 10GB-xml-file.”

                      I am indeed - but I am also happy that I have never tried to solder electronic components with a blow torch, or fry an egg on a smoothing iron! Even if I had, I wouldn’t try to make suggestions on a blow torch forum regarding improvements to blow torches that might make that more feasible!

                      I think a lot of software gradually bloats out - both in bytes and in terms of complexity. Ultimately valuable software can die that way! I think NP++ has remained consistently focussed on providing solutions for those who need to edit normally sized text files - particularly program source - and trying to add on outlandish capabilities would be a severe distraction.

                      Switching to 64-bits would probably only be the first step in a project to edit 10 GB data files, because I am sure there must be many processes inside NP++ that depend on being able to scan across an entire file in a sensible amount of time.

                      It might be more constructive if you described how you got into this mess, and someone might be able to offer some constructive suggestions!

                      As a preliminary suggestion, I would suggest that you write a C program to read the file, and recognise the data you want to change. You could then modify the program to actually perform a change (but I would back up your file before you start :) ).

                      David

                      1 Reply Last reply Reply Quote 0
                      • RicardoR
                        Ricardo
                        last edited by

                        woot
                        For editing a 10GB file, I guess the best would be an editor that doesn’t load the entire file into memory, but reads it direct from disk. Otherwise, you would need a lot of RAM.

                        Klaus LehmannK 1 Reply Last reply Reply Quote 0
                        • guy038G
                          guy038
                          last edited by guy038

                          Hello Maor Bachar and All,

                          Why don’t you give a try to the old, but excellent, script program GAWK.exe, which can be used, in addition, both, on Unix or Windows machines (with the appropriate executable file, of course ) ?

                          You can get the latest Windows version of gawk.exe, AFAIK, ( v4.1.0 ), at the address, below :

                          https://code.google.com/p/gnu-on-windows/downloads/detail?name=gawk-4.1.0-bin.zip&can=2&q=

                          To get an overview of the main features of gawk and what’s new in the v4.1.0 version, follow the link :

                          http://www.drdobbs.com/open-source/gnu-awk-this-is-not-your-fathers-awk/240158351

                          And, here is, below, the link to download the last reference book, on GAWK, v4.1.x, in various formats :

                          http://www.gnu.org/software/gawk/manual/

                          The PDF form is quite recent : April 2015 !


                          To test it, I created a 1 Gb file. ( I didn’t create a 10 Gb file, as my old Win XP laptop, with only two 40Gb partitions would not accept it ! ). But, when I go back to work , as I’m presently, on holidays, I’ll be able to build bigger files !

                          Briefly, I first created, in N++, a 200 Mb file, about, named xxx.txt. Then, I recopied this file, five times, to the All.txt file, in a DOS windows, with the commands :

                          type xxx.txt>All.txt
                          type xxx.txt>>All.txt
                          type xxx.txt>>All.txt
                          type xxx.txt>>All.txt
                          type xxx.txt>>All.txt
                          

                          I ended the All.txt file with a last line

                          echo END of the FILE>>All.txt
                          

                          Length of the lines are from 0 to 150. Finally, the All.txt file contains, about, 18 498 000 lines, for 1026 Mo bytes !


                          I used the simple gawk script below, named Script.txt. It’s IMPORTANT to note that this file is ANSI encoded

                          #----------------------------------------------------------------------------------------------------------------#
                          #  SYNTAX to RUN, in a CMD windows, from the FOLDER, where are the 3 files GAWK.exe, Script.txt and All.txt      #
                          #            ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯  ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯                        ¯¯¯¯¯¯¯¯  ¯¯¯¯¯¯¯¯¯¯     ¯¯¯¯¯¯¯      #
                          #                                                                                                                #
                          #  SYNTAX :                                                                                                      #
                          #  ¯¯¯¯¯¯                                                                                                        #
                          #                                                                                                                #
                          #    echo [Regex|String]|gawk -f Script.txt - All.txt[ File2.txt[ File3[ ...]]][ >[>]See.txt]                    #
                          #----------------------------------------------------------------------------------------------------------------#
                          
                          {
                            if ( pattern == "" )
                              {
                          
                                #---------------------------------------------------------------------------#
                                #  The STANDARD INPUT -  is read, BEFORE the USER FILE(S), so ALL the TEXT  #
                                #    of the DOS COMMAND 'echo' will be STORED in the VARIABLE 'pattern'     #
                                #                                                                           #
                                #   Note : No BLANK, at END of "echo" command, unless part of the REGEX     #
                                #                                                                           #
                                #  IF the PATTERN is NOT initialized  :                                     #
                                #    SET the VARIABLE 'total' to the value 0                                #
                                #    SET the VARIABLE 'pattern' to the value of the LINE field $0           #
                                #      = ALL the TEXT of the DOS COMMAND 'echo'                             #
                                #    SKIP to the NEXT line to READ                                          #
                                #---------------------------------------------------------------------------#
                          
                                total = 0 ; pattern = $0
                          
                                next
                              }
                          }
                          
                          #------------------------------------------------------------------------------#
                          #    IF the CURRENT line MATCHES the PATTERN of the VARIABLE 'pattern' :       #
                          #      INCREMENT, by ONE, the VARIABLE 'total' [ and PRINT the CURRENT line ]  #
                          #    SKIP to the NEXT line to READ, in ALL cases                               #
                          #------------------------------------------------------------------------------#
                          
                          {
                            if ($0 ~ pattern) { ++total }
                          
                          #   OR     if ($0 ~ pattern) { ++total ; print }
                          
                            next
                          }
                          
                          
                          #-------------------------------------------------------------------------------------------------------------------#
                          #  AFTER ALL the USER files are READ, DISPLAYS the NUMBER of lines, MATCHING the PATTERN of the VARIABLE 'pattern'  #
                          #-------------------------------------------------------------------------------------------------------------------#
                          
                          END { print "\n  Number of lines MATCHING \x22" pattern "\x22  : " , total }
                          

                          The command echo .*|gawk -f Script.txt - all.txt give me the number 18 498 231, which is the number of lines matching the regex .*

                          The command echo END of the FILE|gawk -f Script.txt - all.txt give me the number 1, noticing that it correctly read the last and 18 498 231th line, of the 1 Gb file All.txt !

                          On my old Win XP laptop, with, only, 1GB RAM ( 2 * 512 Mo ), I got the result, from each command, in 1mn 30s about ! Not to bad, isn’t it ?

                          Of course, my script just count the matchings but, with GAWK, you can manipulate files, in multiple ways, do calculus or searches/replacements or … It’s a very powerful tool, although tiny in size : 223 246 bytes, for the v4.1 version.

                          You’ll just have to scan, a bit, the gawk documentation !

                          And, as Ricardo said, it seems, to directly read files, from disk :-) => No limit in size, seemingly

                          Best Regards,

                          guy038

                          1 Reply Last reply Reply Quote 0
                          • Klaus LehmannK
                            Klaus Lehmann @Ricardo
                            last edited by

                            @Ricardo
                            I think crisp (64bit) can do it!
                            but it costs 100-250 per Year!

                            and for my biggest files (with rubbish-xml-gore) in 10GB files, I must to use: VEDIT Pro (64bit)
                            attention: high prize! $240 (no joke!)

                            Yours klaus

                            1 Reply Last reply Reply Quote 0
                            • First post
                              Last post
                            The Community of users of the Notepad++ text editor.
                            Powered by NodeBB | Contributors