I am doing a little data scraping, There are 3 types of file from which i am scraping data.
2- PDF
3- Excel(xls)
For HTML i am comfortable, i am using HTML Agility for that.

Tags: Bluebeam Revu, Bluebeam Tips, Convert Files to PDF, PDF Conversion, Stapler trackback Saving time is a Bluebeamer mantra and the Stapler function can help expedite conversion when you have multiple files that you want in PDF format. Format: PDF / ePub / Kindle Ken Binmore's previous game theory textbook, Fun and Games (D.C. Heath, 1991), carved out a significant niche in the advanced undergraduate market; it was intellectually serious and more. Playing for Real: A Text on Game Theory Author. Binmore.Playing for Real: A Text on Game Theory Download.pdf. Pdf optimizer software Ken Binmores previous game theory textbook, Fun and Games D.C. Heath, 1991, carved out a.Apr 1, 2008. Wichardt show all 1 hide.

For PDF and excel i need suggestions from anyone.

Concerning Excel. If you are in a MS environment you can either do Office Automation or use OLEDB. In a Java environment look at Apache POI.

EDIT: Concerning PDF in Java try Apache PDFBox . Can also work in .NET using IKVM


I can recommend Cogniview's PDF2XL, a reasonably inexpensive commercial product, to extract data from tables in PDF files into Excel. We have used it with great success.

HTML Agility is a library. Its good to use. But then, why do you need separate tools for different data extraction purposes? Use Automation Anywhere to extract data from any source. As far as I know, it would work for all the three sources you have specified.Google it.

You can use UiPath in order to achieve this. It can scrape 100% accurate PDF, Excel, HTML, Java, Windows, .NET, WPF, legacy. Also works with virtualized based environments but only via OCR scraping.

Can be used from code (SDK) but also you can create visual automation (workflows) using UiPath Studio.Here's a tutorial on web data extraction

Note: I work at UiPath so I know it can do the job. You should also try other visual automation tools like Automation Anywhere, WinAutomation, Jacada, use them side by side and choose the one that suits you the best.


I am running some test automation on a networked computer resource (remote). The remote computer running the test automation generates some output, which I can customize however I wish - probably a text or excel file.

I would like to create an excel spreadsheet which, from my local machine, monitors this output and provides real-time analytics. Later I would make the networked computer visible to more people, and they can use the same spreadsheet to monitor this output.

My problem is that this networked computer is located on the other side of the earth, and so using any kind of polling in excel VBA to PULL the data from the networked computer results in a very long wait with the pinwheel spinning, making the sheet clumsy and less useful. The same thing happens when I use excel's built in function for linking to 'external resources'

Is there any way to PUSH data to the excel spreadsheet from the networked computer? Something that is easy to set up would be ideal, the latency does not have to be low, so long as there is no awkward 'busy wait' while the sheet updates. If that is not possible, is there any way of using PULL from the excel sheet that avoids the same busy wait?


2 Answers

You can write a Real-Time Data server

There's a lot of resources on this , but here is a good start


Due to the long delays stopping the excel process I can think of 2 possibilities assuming you are sticking with excel

  • Pull the data to a local data source (Access, SQLite, SQL Server) and then query that
  • Run the update query asynchronously so its not having the wait to get the data

Personally I would go for option 1

