1

I have the lyrics to a song. (.txt)

I also have lyrics to 50 other songs.

I'm looking for a way to analyse/search those 50 song lyrics with the lyrics to the first song, and find which one of the 50 is most similar to the first (based on shared words/vocabulary).

I'm sorry for layman's speak - this isn't my area of knowledge(!)

Any help or pointers would be much appreciated

Oliver Salzburg
  • 86,445
  • 63
  • 260
  • 306
ioloiol
  • 11
  • 1
  • It would help to have some sample input and desired output. – fedorqui Oct 22 '14 at 10:55
  • Take the lyrics to the Billboard top 50. If you took the current Billboard number 1 and searched its lyrics against 50 other songs, the output you might get would be: SORTED BY MOST SIMILAR 1) "MAROON 5 - ANIMALS", shared words = trouble, hat, coat, me, someone, help, cake etc 2) "ARTIST - SONG NAME", shared words = pretty, never, only, help, etc 3) "ARTIST - SONG NAME", shared wrds: x, x, x, x. The ideal desired output would be the 50 song lyrics sorted by 'similarity', with highlighted shared words. Hope this helps, I work in video - complex search is out of my comfort zone! Cheers – ioloiol Oct 22 '14 at 16:52
  • One problem is that you would need a list of words to ignore like 'The' 'And' 'la la la' etc. – Jack Oct 07 '15 at 08:21

1 Answers1

0

Here's my solution, I presumed that you only care how many words match rather how many times they match (E.g. 'Baby' 5 times in both songs is worth 5x as many 'points).

First:

cat songname.txt | sed ':a;N;$!ba;s/\n/ /g' | tr -cd '[[:alnum:]]\ ' | sed 's#\ \ #\ #g' | sed 's#\ #\n#g' | sort | uniq -i > songnamewords.txt

This turns all newlines into spaces, removes all non-alphanumeric characters (Commas), removes any double spaces, puts every word on a seperate line, sorts them and removes duplicate lines.

You need to do this to all the songs you want to compare, then secondly:

cat songname1words.txt songname2words.txt | sort | uniq -d | wc -l

This will give you a number of how many words matched.

I tried a few examples:

Maroon 5's Animals and Justin Bieber's Baby share 29 words.

Maroon 5's Animals and Opeth's Grand Conjuration share 10 words.

These are the kind of results you'd expect.

Also, here's how you would compare it against all other lyrics files:

a="songname1words.txt" && for f in *; do if [[ "$f" != "$a" ]]; then printf $(cat "$a" "$f" | sort | uniq -d | wc -l) && echo " - $f" | sort; fi; done

Where 'songname1words.txt' is the filename you want to compare them all against.

This compares all other text files against this one, skipping comparing itself to itself, it then sorts them all by score so that the number 1 match is at the top.

It gives output like this:

29 - bieberwords.txt

10 - opethwords.txt

Jack
  • 619
  • 3
  • 10