I've taken a plain text file book from Project Gutenberg (around 0.5MB) which I want to concatenate to itself n times in order to generate a large text file that I can benchmark some algorithms on. Is there a linux command I can use to achieve this? cat sounds ideal, but doesn't seem to play too nice with concatenating a file onto itself, plus does not directly address the n times part of the question.
- 13,370
- 9
- 51
- 67
- 1,171
- 3
- 14
- 21
-
2use some kind of loop, and appending? so repeat foo.txt>>bar.txt and wrap that up in something that will run the command that many times? – Journeyman Geek Sep 22 '11 at 12:32
4 Answers
Two parts to this, to me - first - to use cat to output the text file to standard output, and use append to add it to another file - eg foo.txt>>bar.txt will append foo.txt to bar.txt
then run it n times with
for i in {1..n};do cat foo.txt >> bar.txt; done
replacing n in that command with your number
should work, where n is your number
If you use csh, there's the 'repeat' command.
repeat related parts of the answer are copied from here , and i tested it on an ubuntu 11.04 system on the default bash shell.
- 127,463
- 52
- 260
- 430
-
3Fun fact: this actually works without replacing 'n', in which case it'll execute the body once for each character between ASCII '1' and ASCII 'n' (so 62 times). But `{1..12}` will correctly run the body 12 times. – Arnout Engelen Mar 25 '16 at 20:25
-
3You might want to just redirect the whole pipeline, rather than appending in each iteration: `for i in {1..n};do cat foo.txt; done > bar.txt` – Toby Speight Mar 02 '17 at 12:59
-
You certainly can use cat for this:
$ cat /tmp/f
foo
$ cat /tmp/f /tmp/f
foo
foo
To get $n copies, you could use yes piped into head -n $n:
$ yes /tmp/f | head -n 10
/tmp/f
/tmp/f
/tmp/f
/tmp/f
/tmp/f
/tmp/f
/tmp/f
/tmp/f
/tmp/f
/tmp/f
Putting that together gives
yes /tmp/f | head -n $n | xargs cat >/tmp/output
- 4,866
- 1
- 26
- 36
I am bored so here are a few more methods on how to concatenate a file to itself, mostly with head as a crutch. Pardon me if I overexplain myself, I just like saying things :P
Assuming N is the number of self concatenations you want to do and that your file is named file.
Variables:
linecount=$(<file wc -l)
total_repeats=$(echo "2^$N - 1" | bc) # obtained through the power of MATH
total_lines=$((linecount*(total_repeats+1)))
tmp=$(mktemp --suffix .concat.self)
Given a copy of file called file2, total_repeats is the number of times file would need to be added to file2 to make it the same as if file was concatenated to itself N times.
Said MATH is here, more or less: MATH (gist)
It's first semester computer science stuff but It's been a while since I did a induction proof so I can't get over it... (also this class of recursion is pretty well known to be 2^Loops so there is that too....)
POSIX
I use a few non-posix things but they are not essential. For my purposes:
yes() { while true; do echo "$1"; done; }
Oh, I only used that. Oh well, the section is already here...
Methods
head with linecount tracking.
ln=$linecount
for i in $(seq 1 $N); do
<file head -n $ln >> file;
ln=$((ln*2))
done
No temp file, no cat, not even too much math yet, all joy.
tee with MATH
<file tee -a file | head -n $total_lines > $tmp
cat $tmp > file
Here tee is reading from file but perpetually appending to it, so it will keep reading the file on repeat until head stops it. And we know when to stop it because of MATH. The appending goes overboard through, so I used a temp file. You could trim the excess lines from file too.
eval, the lord of darkness!
eval "cat $(yes file | head -n $((total_repeats+1)) | tr '\n' ' ')" > $tmp
cat $tmp > file
This just expands to cat file file file ... and evals it.
You can do it without the $tmp file, too:
eval "cat $(yes file | head -n $total_repeats | tr '\n' ' ')" |
head -n $((total_lines-linecount)) >> file
The second head "tricks" cat by putting a middle man between it and the write operation. You could trick cat with another cat as well but that has inconsistent behavior. Try this:
test_double_cat() {
local Expected=0
local Got=0
local R=0
local file="$(mktemp --suffix .double.cat)"
for i in $(seq 1 100); do
printf "" > $file
echo "1" >> $file
echo "2" >> $file
echo "3" >> $file
Expected=$((3*$(<file wc -l)))
cat $file $file | cat >> $file
Got=$(<file wc -l)
[ "$Expected" = "$Got" ] && R="$((R+1))"
done
echo "Got it right $R/100"
rm $file
}
sed:
<file tr '\n' '\0' |
sed -e "s/.*/$(yes '\0' | head -n $total_repeats | tr -d '\n')/g" |
tr '\0' '\n' >> file
Forces sed into reading the entire file as a line, captures all of it, then pastes it $total_repeats number of times.
This will fail of course if you have any null characters in your file. Pick one that you know isn't there.
find_missing_char() {
local file="${1:-/dev/stdin}"
firstbyte="$(<$file fold -w1 | od -An -tuC | sort -un | head -n 1)"
if [ ! "$firstbyte" = "0" ]; then
echo "\0"
else
printf "\\$(printf '%03o\t' $((firstbyte-1)) )"
fi
}
That's all for now lads, I hope this arbitrary answer didn't bother anyone. I tested all of them many times but I am only a two-year shell user so keep that in mind I guess. Now to sleep...
rm $tmp
- 131
- 5
You might be able to use tee for this. tee -a x x will append the same lines twice to file x.
Now we need to write x $N times. We can do that with yes x|head -n $N, giving
<file tee -a $(yes outfile|head -n $N)
Demo:
$ cat foo
foo
bar
$ tee -a $(yes x|head -5) <foo >/dev/null
$ cat x
foo
bar
foo
bar
foo
bar
foo
bar
foo
bar
- 4,866
- 1
- 26
- 36