Author Topic: pdf to html converter [SOLVED]  (Read 544 times)

Offline ternor

  • Hero Member
  • *****
  • Posts: 1797
Re: pdf to html converter
« Reply #15 on: January 14, 2013, 05:10:12 PM »
According to this page Zamzar is able to convert pdf to html.

For $7 dollars a month - I think I'll pass...  for now, anyway.


There is a free service and (if you want more than that provides) a paid service.  I have used it quite a few times without charge.

Offline horusfalcon

  • Hero Member
  • *****
  • Posts: 998
  • Wayfarer of The Western Wastes
Re: pdf to html converter
« Reply #16 on: January 15, 2013, 02:23:47 PM »
+1 for Zamzar.com, (I've had good results with their free service in converting AutoCAD drawings to PDF) but a lot depends on the PDFs.  If you have PDFs that originated with scanned images, any conversion to HTML is going to be problematic at best (you'll wind up with HTML that embeds an image).  PDFs that are watermarked or have security features enabled may present other problems.

If the PDFs are mostly text with a few images, you'll see better (more usable) results, but any conversion process you use will likely leave some "warts" on the finished product which you will have to remove by judicious hand-editing.

Is the information you need not available in a more usable format somewhere else?  Might be a case of, "If I was going there... I wouldn't a started from here!"

Later On,
D

"The Way is not a matter of knowing or not knowing.  One word to a wise man; one lash to a bright horse."

Dell Latitude D620, PCLinuxOS 2012.08 KDE4/LXDE, 3.2.18.pclos.bfs, specs here.

Offline The Chief

  • Hero Member
  • *****
  • Posts: 2248
Re: pdf to html converter
« Reply #17 on: January 15, 2013, 04:09:56 PM »
+1 for Zamzar.com, (I've had good results with their free service in converting AutoCAD drawings to PDF) but a lot depends on the PDFs.  If you have PDFs that originated with scanned images, any conversion to HTML is going to be problematic at best (you'll wind up with HTML that embeds an image).  PDFs that are watermarked or have security features enabled may present other problems.

So I noticed.  The pdf's are tables of data.  Most everything I try extracts the data, but converts the table grid lines into background images.

Quote
Is the information you need not available in a more usable format somewhere else?  Might be a case of, "If I was going there... I wouldn't a started from here!"


I can't seem to locate it - no one will admit anything.  It is hand extracted date from will record books at the courthouse.  Five books worth.  I'm not about to start over, so I've resigned my self to a lot of concatenation coupled with extensive "find and replace."  It'll take awhile, but I'll just keep plugging away until I get there.  I would like to strangle whoever decided pdf's were the way to go. 

Of course, the guy running web site before me used images for EVERYTHING!  Even most of the text.  And absolute positions, so any change mucked things up pretty bad.  When I took it over,  a little over a year ago, most of the text appeared to be scanned images.  And the menu buttons were three slightly different graphic images used for initial, mouse over and visited. For a long time I was stuck adding more pages, as I couldn't manage to duplicate the buttons.  I finally got it all converted to pure html, using css and javascript to control the appearance. 

These pdf files are the last things I need to convert.

By the way - if you're interested, here's the url:

http://www.douglascountygensoc.org/


Retired Senior Chief, Retired Software Engineer, Active GrandPa

Offline kjpetrie

  • PCLinuxOS Tester
  • Hero Member
  • *******
  • Posts: 3991
Re: pdf to html converter
« Reply #18 on: January 15, 2013, 05:41:45 PM »
As I said earlier, the table grid lines are not being replaced by background images by the HTML conversion process. They were replaced by the PDF program. That's how PDF works. Like an incompetent web designer, it only cares about how things look; not what they mean. If the same image is being repeated over and over again, kwrite's search and replace might help. It can do regular expression matches if the images are in tags with small differences.

Otherwise, it's the process I outlined before.

No program can do the impossible. The information has been corrupted by the PDF process and there is no automated way to get it back. Stop wasting your time looking for something which can't exist. Accept you will have to recreate it manually. The sooner you start, the sooner you'll finish.

Hope that doesn't sound harsh, but it's the harsh reality you're in.
« Last Edit: January 15, 2013, 05:44:28 PM by kjpetrie »
-----------
KJP
-----------------------------------------------------------
PClos64 RC1 on Intel D945GCLF2 motherboard (Atom 330), 2GB DDR2 RAM, Maxtor STM325031, HL-DT-ST DVDRAM GSA-H42N, Amilo LSL 3220T monitor. Also Acer 5810TG (with custom kernel) and Asus eeePC 2G surf

Offline ternor

  • Hero Member
  • *****
  • Posts: 1797
Re: pdf to html converter
« Reply #19 on: January 17, 2013, 03:45:47 AM »
If you are unfamiliar with html and css coding, there are applications for creating web pages.  I forget the names.

Offline horusfalcon

  • Hero Member
  • *****
  • Posts: 998
  • Wayfarer of The Western Wastes
Re: pdf to html converter
« Reply #20 on: January 17, 2013, 11:23:39 AM »

{snip:  my previous}

So I noticed.  The pdf's are tables of data.  Most everything I try extracts the data, but converts the table grid lines into background images.


So, are you attempting to render this data in HTML as a set of tables (or table-like structures in CSS)?  This is beginning to sound like a task someone somewhere has already written a Perl script to accomplish, maybe?  The primary modus here would be to take your first pass conversion and run it through the Perl script to do the slice-n-dice and generate the output in a format required by your style sheet.

I wouldn't worry too much about that background image - that can be sliced out pretty easily once the first pass HTML is generated (it's really not needed, is it?) and the remaining data then processed.

I'm not a Perl monger, myself, (I've written a few very basic scripts) but know of a few places where scripts for useful purposes are available for download:

http://www.perlscriptsjavascripts.com/ - PerlScriptJavaScripts.com is a repository of a wide array of Perl scripts, some free and some for pay, and they also offer custom scripts for hire.

http://www.scripts.com/perl-scripts/ - Perl Scripts is another combination repository that's well organized by function.

http://savage.net.au/Perl.html - Ron Savage's Perl Scripts page - all of Ron's stuff is free, open source code.  Take particular note that he has already done some scripts for reading genealogy data.

http://www.bewley.net/perl/  - Dale Bewley's Perl Scripts page - Dale offers custom scripting services, too, and has several scripts up on his page for download (but nothing that looks like what you need)

These are just a few of the pages out there dedicated to scripting - you might conduct a search and see if it's possible someone has chewed some of the same ground you are chewing now...  a little searching on the front end might save you a lot of work on the back end.

Failing the discovery/creation of some kind a tool to help do this work, yeah, you're exactly where kjpetrie says... and the sooner out the sooner done.

{snip: more of my previous}

Quote
I can't seem to locate it - no one will admit anything.  It is hand extracted date from will record books at the courthouse.  Five books worth.  I'm not about to start over, so I've resigned my self to a lot of concatenation coupled with extensive "find and replace."  It'll take awhile, but I'll just keep plugging away until I get there.  I would like to strangle whoever decided pdf's were the way to go. 

Of course, the guy running web site before me used images for EVERYTHING!  Even most of the text.  And absolute positions, so any change mucked things up pretty bad.  When I took it over,  a little over a year ago, most of the text appeared to be scanned images.  And the menu buttons were three slightly different graphic images used for initial, mouse over and visited. For a long time I was stuck adding more pages, as I couldn't manage to duplicate the buttons.  I finally got it all converted to pure html, using css and javascript to control the appearance.


This is so typical.  Don't feel too harshly toward your predecessor - he  was likely just muddling through until he could do better.  So many folks don't keep up with web technologies any more 'cause they're such a crazy quilt.

He may have done everything as images at the advice of an ill-informed attorney who believed it's somehow harder to alter image data than HTML or text. 

Quote
These pdf files are the last things I need to convert.

By the way - if you're interested, here's the url:

http://www.douglascountygensoc.org/


I'll give it a look.  I hope you can find a script out there that gets where you want to go, otherwise it'll be a lot of work.

Later On,
D
"The Way is not a matter of knowing or not knowing.  One word to a wise man; one lash to a bright horse."

Dell Latitude D620, PCLinuxOS 2012.08 KDE4/LXDE, 3.2.18.pclos.bfs, specs here.

Offline The Chief

  • Hero Member
  • *****
  • Posts: 2248
Re: pdf to html converter
« Reply #21 on: January 17, 2013, 01:20:01 PM »
You can all consider the problem solved.  I found an on line converter (upload the pdf, download the html) that  works perfectly.  It is at:

http://www.pdfonline.com/convert-pdf-to-html/

Creates actual tables in the html, although a separate table for each page, but that's easy to fix.  A bit of global find and replace for some unneeded text formatting (trying to make it look exactly like the pdf, I suppose) and it's good to go.

Thanks to all...

Retired Senior Chief, Retired Software Engineer, Active GrandPa

Offline horusfalcon

  • Hero Member
  • *****
  • Posts: 998
  • Wayfarer of The Western Wastes
Re: pdf to html converter [SOLVED]
« Reply #22 on: January 17, 2013, 02:18:26 PM »
Good deal, man!  A little searching on the front end saved work on the back end, that's for sure.

Later On,
D
"The Way is not a matter of knowing or not knowing.  One word to a wise man; one lash to a bright horse."

Dell Latitude D620, PCLinuxOS 2012.08 KDE4/LXDE, 3.2.18.pclos.bfs, specs here.

Offline kjpetrie

  • PCLinuxOS Tester
  • Hero Member
  • *******
  • Posts: 3991
Re: pdf to html converter [SOLVED]
« Reply #23 on: January 17, 2013, 04:39:55 PM »
Glad you found something. I wasn't expecting that, but good.
-----------
KJP
-----------------------------------------------------------
PClos64 RC1 on Intel D945GCLF2 motherboard (Atom 330), 2GB DDR2 RAM, Maxtor STM325031, HL-DT-ST DVDRAM GSA-H42N, Amilo LSL 3220T monitor. Also Acer 5810TG (with custom kernel) and Asus eeePC 2G surf

Offline The Chief

  • Hero Member
  • *****
  • Posts: 2248
Re: pdf to html converter [SOLVED]
« Reply #24 on: January 25, 2013, 06:09:00 PM »
Just to let everyone see the results, here is the first file (Book A - they go through Book E):

http://www.douglascountygensoc.org/will_index_book_a.html

Retired Senior Chief, Retired Software Engineer, Active GrandPa