How to split text files by character count in directory

Question

I would like to divide multiple text files in a directory into many smaller text files by a given character count. For example, I want each file in the directory to be divided into smaller text files of 100 characters each. From what I understand, the split command in linux only works by lines not character count so I'm not sure if that would work.

Edit: I am also interested in finding out how to divide the text files by word count.

If you split text files by character count, they will not (strictly) be text files any more. Many lines will be split in the middle, so many files will end with an unterminated (and incomplete) line, and many will start with an incomplete line. Somewhere down the track, that is going to create issues for you. — Paul_Pedant, CommentedJun 30, 2020 at 9:40

Nathan Dorfman · Accepted Answer · 2020-06-30 01:08:49Z

3

If the files are ASCII text, you can use split -b100. This means 100 bytes, which is always 100 ASCII characters.

answered Jun 30, 2020 at 1:08

Nathan Dorfman

1843 bronze badges

1
Thanks, it is in ASCII. How would I do this for each file in the directory and clean it up by removing the original files?
– Myth
CommentedJun 30, 2020 at 1:20
3
I would be careful with automatically removing files. I always prefer to clean them up later, after I've made sure that the new files are correct. That said, it's easy: for file in *; do split -b100 "$file" && rm "$file"; done
– Nathan Dorfman
CommentedJun 30, 2020 at 1:43
Would there also be a way to do the same thing with word count instead?
– Myth
CommentedJun 30, 2020 at 4:15
Not that I know of, sorry.
– Nathan Dorfman
CommentedJun 30, 2020 at 19:27

Add a comment |

Paul_Pedant · Accepted Answer · 2020-07-01 14:34:17Z

Not precisely what you asked for, but may be adapted.

This processes all files with suffix .txt in the current directory. For each file (e.g. Cairo.txt):

It uses tr to replace all white-space by new-line, getting a plain one-per-line word list.
It uses fmt to pack a whole number of words into lines, up to a specified length.
It uses split to make those lines into a series of files named Cairo.seq.0000 and up.

For testability, I used width 60 and lines 30, and my input was three plain-text man pages generated with this:

for cmd in tr fmt split; do man $cmd | col -b > $cmd.txt; done

This is the script:

#! /bin/bash for fn in ./*.txt; do Base="${fn%.txt}" tr -s '[:space:]' '\n' < "${fn}" | fmt -60 | split -a 4 -d -l 30 - "./${Base}.seq." done

The line width is the "60" in the fmt command. So you might want to make this 100.

The number of lines per output file is the "30" in the split command. You seemed to want this to be 1 line per file. However, you are going to get a lot of small files like that. A 100-byte file still takes a 4096-byte block.

You can see that the number of words is unchanged, but the whitespace is reduced, and the lines are fewer.

paul $ wc * 29 214 1561 fmt.seq.0000 61 214 1832 fmt.txt 30 260 1665 split.seq.0000 15 101 780 split.seq.0001 94 361 2892 split.txt 30 263 1724 tr.seq.0000 18 126 929 tr.seq.0001 124 389 3282 tr.txt 410 1955 14821 total paul $

Stack Exchange Network

How to split text files by character count in directory

2 Answers 2

You must log in to answer this question.

Hot Network Questions

How to split text files by character count in directory

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions