HTML::CruftText - Remove unuseful text from HTML
       Version 0.01
SYNOPSIS
       Removes junk from HTML page text.
       This module uses a regular expression based approach to remove cruft
       from HTML. I.e. content/text that is very unlikely to be useful or
       interesting.
           use HTML::CruftText;
           open (my $MYINPUTFILE, ';
           my $de_crufted_lines = HTML::CruftText::clearCruftText( \@lines);
           ...
DESCRIPTION
       This module was developed for the Media Cloud project
       (http://mediacloud.org) as the first step in differentiating article
       text from ads, navigation, and other boilerplate text. Its approach is
       very conservative and almost never removes legitimate article text.
       However, it still leaves in a lot of cruft so many users will want to
       do additional processing.
       Typically, the clearCruftText method is called with an array reference
       containing the lines of an HTML file. Each line is then altered so that
       the cruft text is removed. After completion some lines will be entirely
       blank, while others will have certain text removed. In a few rare
       cases, additional HTML tags are added. The result is NOT GUARANTEED to
       be valid, balanced HTML though some HTML is retained because it is
       extremely useful for further processing. Thus some users will want to
       run an HTML stripper over the results.
       The following tactics are used to remove cruft text:
       * Nonbody text --anything outside of the  tags -- is
       removed
       * Text within the following tags is removed: