liblyric - Lyrics Search Library
What is liblyric?
A search result for many song lyrics on popular search engines returns
many mostly relevant results. However, the target pages are filled with
ads, videos and images. Anyone searching for the lyric text would not
be interested in all that paraphernalia.
liblyric is an attempt to automate the process of scanning these
individual result pages and extract the common textual content from
them in the hope that the common parts will definitely be just the
song's lyric text.
The techniques that liblyric employs have turned out to give accurate
results more than 90% of the time.
How does liblyric work?
- To get the possible list of pages(sites) which contain the requested
song's lyrics, liblyric contacts one of these web search engines
(Note: This no longer works since I was screen scraping and these sites
keep changing their result page's structure. I intend to start using the Bing
search API [which is really good!!] and re-write liblyric in python):
- After this, all the links are extracted from the HTML page returned, and are checked for site duplicates. If there are any pages coming from the same site, any one of them is kept whereas the remaining are discarded.
- Now, all the pages returned from the above operation are downloaded, and stored in a temporary directory /tmp/liblyric/p.PID, where PID is the Process ID of the running instance of liblyric.
- All scripts and comments from the downloaded HTML pages are removed. All malformed <br> tags such as <BR>, <BR/ >, and so on are replaced by a single <br> tag. All <br> tags are replaced by newlines. After this, all the downloaded pages are HTML tag-stripped. This operation for each page happens in parallel, so the wait time is minimized by a fair amount.
- After this, we perform an all to all 2-way approximate intersection of these downloaded tag-stripped HTML pages. You can look at the pages to be nodes(vertices) in a fully connected asymmetric graph, and the weight on each edge to be the amount of intersection(quantitative) on these two pages.
- The above operation produces an intermediate file called extents.txt. This file contains many rows, and each row stands for a single entry. The format of each row is as follows:
Extent Size Extent Start Extent End File Name
File Name is the name of the file to which the extent belongs. An extent is that block of text in any two pages which matches approximately. The intersection of any two pages returns the largest extent found, or nothing if none of the extents(if found) exceed the internal throshold limit. This is done to prevent small rogue extents from popping up.
- Next, we sort the entries in this intermediate file in descending order by the Extent Size, and remove all entries where the Extent Start is less than 32. This operation produces another file called ordered_exts.txt.
- Now, if there are at least 2 entries in ordered_exts.txt, the second one is extracted, and Extent Size bytes of text starting from offset Extent Begin in the file are displayed after passing them through some other filters. If there is just one entry, then that is displayed, else an error saying that no lyrics were found is displayed. The choice of using the second entry in this file is a purely empirical one.
Paper & Software