Split and extract the first 1000 (N) rows from a text-csv-data file in Windows

One of the major annoyances for me working with very large datasets you get from someone else is that you often don't know the format the file is coming in, not a problem really just take a peak in Notepad++ or Excel and you are away. But quite often these files are millions of rows in length and it is just not possible to work it out.

So I've been trying to chip away at the problem for a while and after I finally got past the first stumbling block of having a computer so lock down I couldn't do a thing with it I was still coming up a little short.

Looking through the defacto bible for any such problem, Stackoverflow, it looks like powershell is what were after running it from a command prompt:

powershell -command "& {get-content input.txt|select-object -first 10}" >output.txt

However there are few failings here:

  • we can't be bothered with all the faff of getting the correct path
  • it actually outputs it through the cmd console before writing it to the file thereby wrapping it the lines at the default width of 80 lines
  • if you have a very large file it will read through the whole serveral Gbs when it doesnt have to

Drag 'n' Drop

We can solve the first problem by creating a drag and drop batch file with the the output file the same as the input but with a predefined suffix, just paste the following code into a file called first_1000_rows.bat:

powershell -command "& {get-content %1|select-object -first 1000}" >%~n1"_1000"%~x1

then all we need to do is to drag the file we want to split and drop it over the batch file and hey presto.

No-wrap

Next task is we have to solve the wrapping issue, a little tweak will see us outputting it as part of the powershell process not through the command console using the set-content command:

powershell -command "& {get-content %1|select-object -first 1000|set-content -path %~n1"_1000"%~x1}"

Optimisation

Finally we'll make a little tweak to stop after the first end rows, easy when you know how, just set the row limit on the get-content command:

powershell -command "& {get-content %1 -TotalCount 1000|set-content -path %~n1"_1000"%~x1}"

Job's a good-un as the saying goes and now we have an extract of the file we can easily have a butchers at.

Dealing with spaces in paths

Unfortunately the Powershell - DOS interaction doesn't play nicely with spaces so we have to deal with them somehow as well:

powershell -command "& {get-content '%~1' -TotalCount 1000|set-content -path '%~dp1%~n1_1000%~x1'}"

Finished

Job's a good-un as the saying goes and now we have an extract of the file we can easily have a butchers at.


Search Posts

Back to top