One of the major annoyances for me working with very large datasets you get from someone else is that you often don't know the format the file is coming in, not a problem really just take a peak in Notepad++ or Excel and you are away. But quite often these files are millions of rows in length and it is just not possible to work it out.
So I've been trying to chip away at the problem for a while and after I finally got past the first stumbling block of having a computer so lock down I couldn't do a thing with it I was still coming up a little short.
Looking through the defacto bible for any such problem, Stackoverflow, it looks like powershell is what were after running it from a command prompt:
powershell -command "& {get-content input.txt|select-object -first 10}" >output.txt
However there are few failings here:
We can solve the first problem by creating a drag and drop batch file with the the output file the same as the input but with a predefined suffix, just paste the following code into a file called first_1000_rows.bat
:
powershell -command "& {get-content %1|select-object -first 1000}" >%~n1"_1000"%~x1
then all we need to do is to drag the file we want to split and drop it over the batch file and hey presto.
Next task is we have to solve the wrapping issue, a little tweak will see us outputting it as part of the powershell process not through the command console using the set-content command:
powershell -command "& {get-content %1|select-object -first 1000|set-content -path %~n1"_1000"%~x1}"
Finally we'll make a little tweak to stop after the first end rows, easy when you know how, just set the row limit on the get-content command:
powershell -command "& {get-content %1 -TotalCount 1000|set-content -path %~n1"_1000"%~x1}"
Job's a good-un as the saying goes and now we have an extract of the file we can easily have a butchers at.
Unfortunately the Powershell - DOS interaction doesn't play nicely with spaces so we have to deal with them somehow as well:
powershell -command "& {get-content '%~1' -TotalCount 1000|set-content -path '%~dp1%~n1_1000%~x1'}"
Job's a good-un as the saying goes and now we have an extract of the file we can easily have a butchers at.