Squeeze.pm - Shorten text to minimum syllables by using hash table and vowel deletion
$Id: Squeeze.pm,v 1.25 1998/12/04 10:00:08 jaalto Exp $
use Squeeze.pm; # imnport only function use Squeeze qw( :ALL ); # import all functions and variables use English;
while (<>) { print SqueezeText $ARG; }
Squeeze English text to most compact format possibly so that it is barely readable. You should convert all text to lowercase for maximum compression, because optimizations have been designed mostly fr uncapitalised letters.
Warning: Each line is processed multiple times, so prepare for slow
conversion time
You can use this module e.g. to preprocess text before it is sent to
electronic media that has some maximum text size limit. For example pagers
have an arbitrary text size limit, typically 200 characters, which you want
to fill as much as possible. Alternatively you may have GSM cellular phone
which is capable of receiving Short Messages (SMS), whose message size
limit is 160 characters. For demonstration of this module's
SqueezeText()
function , the description text of this
paragraph has been converted below. See yourself if it's readable (Yes, it
takes some time to get used to). The compress ratio is typically 30-40%
u _n use thi mod e.g. to prprce txt bfre i_s snt to elrnic mda has som max txt siz lim. f_xmple pag hv abitry txt siz lim, tpcly 200 chr, W/ u wnt to fll as mch as psbleAlternatvly u may hv GSM cllar P8 w_s cpble of rcivng Short msg (SMS), WS/ msg siz lim is 160 chr. 4 demonstrton of thi mods SquezText fnc , dsc txt of thi prgra has ben cnvd_ blow See uself if i_s redble (Yes, it tak som T to get usdto compr rat is tpcly 30-40
And if $SQZ_OPTIMIZE_LEVEL
is set to non-zero
u_nUseThiModE.g.ToPrprceTxtBfreI_sSntTo elrnicMdaHasSomMaxTxtSizLim.F_xmplePag hvAbitryTxtSizLim,Tpcly200Chr,W/UWnt toFllAsMchAsPsbleAlternatvlyUMayHvGSMCllarP8 w_sCpbleOfRcivngShortMsg(SMS),WS/MsgSiz limIs160Chr.4DemonstrtonOfThiModsSquezText fnc,DscTxtOfThiPrgraHasBenCnvd_Blow SeeUselfIfI_sRedble(Yes,ItTakSomTToGetUsdto comprRatIsTpcly30-40
The comparision of these two show
Original text : 627 characters Level 0 : 433 characters reduction 31 % Level 1 : 345 characters reduction 45 % (+14 improvement)
There are few grammar rules which are used to shorten some English tokens very much:
Word that has _ is usually a verb
Word that has / is usually a substantive, noun, pronomine or other non-verb
For example, these tokens must be understood before text can be read. This is not yet like Geek code, because you don't need external parser to understand this, but just some common sense and time to adapt yourself to this text. For a complete up to date list, you have to peek the source code
automatically => 'acly_'
for => 4 for him => 4h for her => 4h for them => 4t for those => 4t
can => _n does => _s
it is => i_s that is => t_s which is => w_s that are => t_r which are => w_r
less => -/ more => +/ most => ++
however => h/ver think => thk_
useful => usful
you => u your => u/ you'd => u/d you'll => u/l they => t/ their => t/r
will => /w would => /d with => w/ without => w/o which => W/ whose => WS/
Time is expressed with big letters
time => T minute => MIN second => SEC hour => HH day => DD month => MM year => YY
Other Big letter acronyms
phone => P8
To add new words e.g. to word conversion hash table, you'd define your
custom set and merge them to existing ones. Do similarly to
%SQZ_WXLATE_MULTI_HASH
and $SQZ_ZAP_REGEXP
and then start using the conversion function.
use English; use Squeeze qw( :ALL );
my %myExtraWordHash = ( new-word1 => 'conversion1' , new-word2 => 'conversion2' , new-word3 => 'conversion3' , new-word4 => 'conversion4' );
# First take the existing tables and merge them with my # translation table
my %mySustomWordHash = ( %SQZ_WXLATE_HASH , %SQZ_WXLATE_EXTRA_HASH , %myExtraWordHash );
my $myXlat = 0; # state flag
while (<>) { if ( $condition ) { SqueezeHashSet \%mySustomWordHash; # Use MY conversions $myXlat = 1; }
if ( $myXlat and $condition ) { SqueezeHashSet "reset"; # Back to default table $myXlat = 0; }
print SqueezeText $ARG; }
Similarly you can redefine the multi word translate table by supplying
another hash reference in call to SqueezeHashSet().
To kill
more text immediately in addtion to default, just concatenate the regexps
to
$SQZ_ZAP_REGEXP
There may be lot of false conversions and if you think that some word squeezing went too far, please 1) turn on the debug 2) send you example text 3) debug log log to the maintainer. To see how the conversion goes e.g. for word Messages:
use English; use Lingua::EN:Squeeze;
# activate debug when case-insensitive worj "Messages" is found from the # line.
SqueezeDebug( 1, '(?i)Messages' );
$ARG = "This line has some Messages in it"; print SqueezeText $ARG;
The defaults may not conquer all possible text, so you may wish to extend the hash tables and $SQZ_ZAP_REGEXP to cope with your typical text.
Text to kill immediately, like ``Hm, Hi, Hello...'' You can only set this
once, because this regexp is compiled immediately when SqueezeText()
is caller for the first time.
This controls how optimized the text will be. Curretly there is only levels 0 (default) and level 1, which squeezes out all spaces. This improves compression by average of 10%, but the text is more harder to read. If space is tight, use this extended compression optimization.
Multi Word conversion hash table: ``for you'' => ``4u'' ...
Single Word conversion hash table: word => conversion. This table is applied after %SQZ_WXLATE_MULTI_HASH
has been used.
Aggressive Single Word conversions like: without => w/o. Applied last.
SqueezeText($)
- Description
Squeeze text by using vowel substitutions and deletions and hash tables that guide text substitutions. The line is parsed multiple times and this will take some time.
- arg1: $text
String. Line of Text.
- Return values
String, squeezed text.
new()
- Description
Return class object.
- Return values
object.
SqueezeHashSet($;$)
- Description
Set hash tables to use for converting text. The multiple word conversion is done first and after that the single words conversions.
- arg1: \%wordHashRef
Pointer to be used to convert single words. If ``reset'', use default hash table.
- arg2: \%multiHashRef [optional]
pointer to be used to convert multiple words. If ``reset'', use default hash table.
- Return values
None.
SqueezeControl(;$)
- Description
Select level of text squeezing: noconv, enable, medium, maximum.
- arg1: $state
String. If nothing, set maximum squeeze level (kinda: restore defualts).
noconv Turn off squeeze conv Turn on squeeze med Set squeezing level to medium max Set squeezing level to maximum- Return values
None.
SqueezeDebug(;$$)
- Description
Activate or deactivate debug.
- arg1: $state [optional]
If not given, turn debug off. If non-zero, turn debug on. You must also supply
regexp
if you turn on debug, unless you have given it previously.- arg1: $regexp [optional]
If given, use regexp to trigger debug output when debug is on.
- Return values
None.
Author can be reached at jari.aalto@poboxes.com HomePage via forwarding service is at http://www.netforward.com/poboxes/?jari.aalto or alternatively absolute url is at ftp://cs.uta.fi/pub/ssjaaa/ but this may move without notice. Prefer keeping the forwarding service link in your bookmark.
Latest version of this module can be found at $CPAN/modules/by-module/Lingua/
Copyright (C) 1998-1999 Jari Aalto. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself or in terms of Gnu General Public licence v2 or later.