No announcement yet.

PdfToText has bugs extracting text

  • Filter
  • Time
  • Show
Clear All
new posts

  • PdfToText has bugs extracting text


    I'm using BC4 (which ships with PdfToText 3.04) on Windows 7 x64. With this PDF I'm getting wrong comparison results as the text extracted by PdfToText is not correct.

    I noticed SumatraPDF had a very similar issue which I reported here and it was was fixed straight away.

    Maybe PdfToText's issue has the same root cause and can be fixed in a similar way, or BC4 could start shipping with another text extraction tool that does not have this issue.

  • #2

    We do not produce PdfToText and are unable to update it, but we do update the version that ships with BC4 as updates are made available. If we have not been able to incorporate the newest release yet, it is possible to download any specific version of PdfToText and plug it into BC4's format.

    If you are familiar with any other command line utility which can convert from PDF to Text that supports your files, you can incorporate it for BC4's use. We have an example of how to create a custom file format here:

    It appears that SumatraPDF does not support command line conversion, only display. If they are able to add this support, you could use this tool with a custom File Format to perform the conversion.
    Aaron P Scooter Software


    • #3
      I confirmed that the latest release of xpdf (3.04, 32 and 64 bit) show the problem described. You can get an old version of PdfToText.exe from the "PDF to Doc" helper file for BC2 that looks like it works with your file:
      --> PDF to DOC 9-Nov-2005 v1.1 233kb

      I also have Xpdf 3.01, and it works. So I assume a change in 3.02 or later caused the problem. The latest Xpdf v3.04 "changes" file lists this under 3.02:
      Tweak the TrueType font encoding deciphering algorithm.

      You may want to contact the Xpdf author, and see if there is a setting in the latest version that would handle the font differently, or if this can be addressed in the next update.

      In case you or anyone else wants to keep an older version of Xpdf like this, I suggest a separate folder under the BC4 Helpers folder:
      ex: Helpers\PdfToText_v300\PdfToText.exe
      and adding a separate file format entry (like "PDF to Text v3.00") for PDF that points to this folder

      To quickly find the user folders, type this in the "Start" menu search box:
      then > Scooter Software > Beyond Compare > Helpers