Author Topic: pdf to html converter [SOLVED]  (Read 540 times)

Offline The Chief

  • Hero Member
  • *****
  • Posts: 2248
pdf to html converter [SOLVED]
« on: January 12, 2013, 12:37:05 PM »
menotu's post about Firefox 19 reminded me to ask about this.  

I have about a half dozen pdf files I need to convert to html to put on my web site.  I know, I can just put up the pdf's but I want the css to control the appearance.

These are rather lengthy tables, and I have some css customizations for the appearance of tables that I would want to apply to these.

Anyone have any suggestions?  I've tried importing them into a word processor (WordPerfect) and then exporting them to html.  One seemed to work the rest never seemed to complete the html conversion - maybe too large?  LibreOffice doesn't seem to do much better - making many, many html files with one page each of the tables.

Any recommendations for a stand-alone program?
« Last Edit: January 17, 2013, 01:20:43 PM by The Chief »

Retired Senior Chief, Retired Software Engineer, Active GrandPa

Offline gseaman

  • PCLinuxOS Tester
  • Hero Member
  • *******
  • Posts: 3792
Re: pdf to html converter
« Reply #1 on: January 12, 2013, 01:24:35 PM »
You could combine the output of the pages created by libreoffice with cat pg1.html pg2.html pg3.html > output.html. You would probably have to remove the extra headers. That should be pretty easy with kwrite. (Edit -> Replace)

Galen

Offline kjpetrie

  • PCLinuxOS Tester
  • Hero Member
  • *******
  • Posts: 3989
Re: pdf to html converter
« Reply #2 on: January 12, 2013, 01:52:53 PM »
You probably need to edit the page characteristics in LO so each table appears on only one page. A page doesn't have to be printable on real paper, so you can make it 6 feet tall if you need!
-----------
KJP
-----------------------------------------------------------
PClos64 RC1 on Intel D945GCLF2 motherboard (Atom 330), 2GB DDR2 RAM, Maxtor STM325031, HL-DT-ST DVDRAM GSA-H42N, Amilo LSL 3220T monitor. Also Acer 5810TG (with custom kernel) and Asus eeePC 2G surf

Offline The Chief

  • Hero Member
  • *****
  • Posts: 2248
Re: pdf to html converter
« Reply #3 on: January 13, 2013, 10:13:34 AM »
You could combine the output of the pages created by libreoffice with cat pg1.html pg2.html pg3.html > output.html. You would probably have to remove the extra headers. That should be pretty easy with kwrite. (Edit -> Replace)

Galen

I thought of that - and I'm saving it as a last resort.   :D :D

Retired Senior Chief, Retired Software Engineer, Active GrandPa

Offline The Chief

  • Hero Member
  • *****
  • Posts: 2248
Re: pdf to html converter
« Reply #4 on: January 13, 2013, 10:14:25 AM »
You probably need to edit the page characteristics in LO so each table appears on only one page. A page doesn't have to be printable on real paper, so you can make it 6 feet tall if you need!

Thanks, I'll investigate that.

Retired Senior Chief, Retired Software Engineer, Active GrandPa

Offline ternor

  • Hero Member
  • *****
  • Posts: 1797
Re: pdf to html converter
« Reply #5 on: January 14, 2013, 03:39:25 AM »
According to this page Zamzar is able to convert pdf to html.

Offline Just17

  • PCLinuxOS Tester
  • Super Villain
  • *******
  • Posts: 10644
  • MLUs Forever!
Re: pdf to html converter
« Reply #6 on: January 14, 2013, 06:12:59 AM »
poppler   is in the repository and provides   pdftohtml   command, amongst others including  pdftotext.
« Last Edit: January 14, 2013, 06:40:17 AM by Just17 »
MLUs rule the roost!

Linux XPS 3.4.38-pclos1.bfs  64 bit
Intel Core2 Quad CPU Q9450 @ 2.66GHz
4 GB RAM
MCP51 High Def Audio
GeForce GTX 550 Ti
PHILIPS  ‎DVD+-RW DVD8701
‎Logitech ‎BT Mini-Receiver
Afatech DTT

Offline The Chief

  • Hero Member
  • *****
  • Posts: 2248
Re: pdf to html converter
« Reply #7 on: January 14, 2013, 09:17:54 AM »
According to this page Zamzar is able to convert pdf to html.

For $7 dollars a month - I think I'll pass...  for now, anyway.

Retired Senior Chief, Retired Software Engineer, Active GrandPa

Offline The Chief

  • Hero Member
  • *****
  • Posts: 2248
Re: pdf to html converter
« Reply #8 on: January 14, 2013, 09:19:10 AM »
poppler   is in the repository and provides   pdftohtml   command, amongst others including  pdftotext.
Thanks - sounds like just the thing I need.  Off to update and install....

Retired Senior Chief, Retired Software Engineer, Active GrandPa

Offline The Chief

  • Hero Member
  • *****
  • Posts: 2248
Re: pdf to html converter
« Reply #9 on: January 14, 2013, 10:02:27 AM »
Wwell,  turns out poppler was already installed, and after I figured out how to invoke it, results were not so great.  Still generates multiple files (one per page), and here's the first few lines of the html generated for the pdf table:

Code: [Select]
<DIV style="position:absolute;top:113;left:31"><nobr><span class="ft00"><b>LAST&nbsp;NAME</b></span></nobr></DIV>
<DIV style="position:absolute;top:113;left:187"><nobr><span class="ft00"><b>FIRST&nbsp;NAME</b></span></nobr></DIV>
<DIV style="position:absolute;top:113;left:340"><nobr><span class="ft00"><b>MIDDLE</b></span></nobr></DIV>
<DIV style="position:absolute;top:113;left:458"><nobr><span class="ft00"><b>SPOUSE</b></span></nobr></DIV>
<DIV style="position:absolute;top:113;left:768"><nobr><span class="ft00"><b>PROBATE&nbsp;DATE</b></span></nobr></DIV>
<DIV style="position:absolute;top:113;left:931"><nobr><span class="ft00"><b>BOOK</b></span></nobr></DIV>
<DIV style="position:absolute;top:113;left:1009"><nobr><span class="ft00"><b>PAGE</b></span></nobr></DIV>

Not exactly what I was expecting, I was hoping for something like this:

Code: [Select]
        <table>
          <tr align="center">
            <td>Bride</td>
            <td>Groom</td>
            <td>Date</td>
            <td>Page</td>
            <td>Official</td>
          </tr>
          <tr>
            <td>ABBOTT, Mollie</td>
            <td>Norton, James N.</td>
            <td>09 Aug 1877</td>
            <td>60, 188</td>
            <td>Thomas White JP</td>
          </tr>

but... with some judicious "find and replace" in kwrite, it might be usable.  Otherwise it looks like it's going to be a long and tedious process...  unless I can track down the original author and get the source files...

Thanks everyone, still looking, before I begin any monumental manual conversions...

Retired Senior Chief, Retired Software Engineer, Active GrandPa


Offline The Chief

  • Hero Member
  • *****
  • Posts: 2248

Retired Senior Chief, Retired Software Engineer, Active GrandPa

Offline Just17

  • PCLinuxOS Tester
  • Super Villain
  • *******
  • Posts: 10644
  • MLUs Forever!
Re: pdf to html converter
« Reply #12 on: January 14, 2013, 10:48:19 AM »
What command are you using for Poppler?

Do you need images on the pages?


pdftohtml -s    will give one main file and a bunch of images to be used, all separate as expected.

pdftohtml -s -i   will ignore all the images and just produce the one main file


.......  or so it appears here .......

pdftotext -layout    produces a (somewhat) formatted text file

EDIT

or maybe

pdftotext -layout -htmlmeta

« Last Edit: January 14, 2013, 10:53:03 AM by Just17 »
MLUs rule the roost!

Linux XPS 3.4.38-pclos1.bfs  64 bit
Intel Core2 Quad CPU Q9450 @ 2.66GHz
4 GB RAM
MCP51 High Def Audio
GeForce GTX 550 Ti
PHILIPS  ‎DVD+-RW DVD8701
‎Logitech ‎BT Mini-Receiver
Afatech DTT

Offline The Chief

  • Hero Member
  • *****
  • Posts: 2248
Re: pdf to html converter
« Reply #13 on: January 14, 2013, 10:53:43 AM »
I didn't try the -i option, but even so, I expect it wouldn't really generate table code.  If I can get table code, multiple files is not a problem - they would be simple to concatenate and edit slightly.

Retired Senior Chief, Retired Software Engineer, Active GrandPa

Offline kjpetrie

  • PCLinuxOS Tester
  • Hero Member
  • *******
  • Posts: 3989
Re: pdf to html converter
« Reply #14 on: January 14, 2013, 02:11:46 PM »
The problem will be the PDF. It's very much a layout format, whereas HTML is a mark up format, so the PDF simply doesn't contain the logical structure you want to generate from it. PDF breaks sentences up and puts things that belong together in different parts of the file with positioning information to display them where they ought to go.

I would start by opening the PDF in a viewer and copying out what you want using its selection functions and pasting them into a text document. Then write some basic HTML code for the first couple of rows. Then use copy and paste to separate the rows and cells to make the table. It's tedious, but there just isn't an automatic way to do it because PDFs throw away what you want and generate lines and letters and other graphical devices using a mixture of vector and raster graphics. The table you want just isn't in the PDF any more - only something that looks like a table on the screen.
-----------
KJP
-----------------------------------------------------------
PClos64 RC1 on Intel D945GCLF2 motherboard (Atom 330), 2GB DDR2 RAM, Maxtor STM325031, HL-DT-ST DVDRAM GSA-H42N, Amilo LSL 3220T monitor. Also Acer 5810TG (with custom kernel) and Asus eeePC 2G surf