Thursday, April 17, 2008

Searching through PDF and DOC files with TWiki KinoSearch and IndigoPERL

Wikis are great but something that has been bothering me is that I wasn't able to search through the contents of Adobe PDF or Microsoft Word DOC files.   IMHO, one of the biggest benefits of wikis is the ability to search.  If there are lots of documents that are unsearchable then there is a lot of information I can't find when I may need it.

 

TWiki runs well on Windows.  It's entirely written in PERL (in this case, IndigoPERL).  There are several TWiki extensions and addons for searching PDF and DOC files but I couldn't get them to work because they all depend on PERL modules that wouldn't install on IndigoPERL (via the CPAN shell).

 

After trial and error I've found that sometimes a module will fail to install from the CPAN shell (e.g., via the CPAN PERL module) but will install without problem manually.  In some cases you have to manually get an earlier version of the module and build it manually; the most recent version won't pass build tests.  The CPAN Search Site is an absolute godsend when it comes to manually installing CPAN modules.

 

Anyway, these are the steps I took to get the SearchEngineKinoSearch TWiki plugin working for TWiki (4.0.5) on Windows (server 2003R2) using IndigoPerl (5.8.6):


  1. Install VC++ from visual studio 2003.net into c:\vs2003 (or a directory name without spaces that is shorter than 9 characters).  Don't forget to include the Windows SDK tools.
  2. Find and download a copy of p2bat.pl.  Save it to your IndigoPERL bin directory.
  3. When manually installing PERL modules always use the visual studio 2003 command prompt (or run vcvars32.bat) so that IndigoPERL can find the VC++ compiler and nmake.  Manually installing is usually as simple as extracting the module into a directory, changing into that directory and running the following commands:

    1. perl Build.PL (for older modules replace Build.PL with Makefile.PL)
    2. Build (for older modules replace this with nmake)
    3. Build test (for older modules replace this with nmake test)
    4. Build install (for older modules replace this with nmake install)

  4. The PERL CPAN shell (perl -MCPAN -e shell) expects to find several GNU tools so install the windows version of the following GNU tools: gnuPG, grep, gzip, tar and unzip.
  5. KinoSearch relies on other programs to extract the text from PDF and DOC files so install xpdf and antiword.
  6. Manually install the following CPAN modules (the version is important): ExtUtils-CBuilder (version 0.21), Spreadsheet-ParseExcel (version 0.32), KinoSearch (version 0.161).  These can be downloaded from the CPAN Search Site.
  7. After manually extracting the SearchEngineKinoSearchAddOn.zip into the twiki dir, manually modify twiki/lib/TWiki/Contrib/SearchEngineKinoSearchAddOn/StringifierPlugins/DOC_antiword.pm and PDF.pm) as follows:

    1. In PDF.pm change system("pdftotext", $filename, $tmp_file, "-q") to system("c:\\bin\\pdftotext.exe", $filename, $tmp_file, "-q"); where c:\\bin\\pdftotext.exe is the fully qualified path of your pdftotext executable (this is a part of xpdf for windows).  The \\ are necessary since \ is a metacharacter in double quoted PERL strings.
    2. In DOC_antiword.pm replace "antiword" with the fully qualified path for your antiword.exe (e.g., c:\\bin\\antiword.exe).

  8. [optional] On my system I've had to modify twiki/lib/TWiki/Sandbox.pm by changing normalizeFileName() to return @result; instead of return join '/', @result; just to get TWiki to work.

 

No comments :

Post a Comment