Parsing word documents in .Net 2 - Contractor UK Bulletin Board

Visitors can check out the Forum FAQ by clicking this link. You have to register before you can post: click the REGISTER link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. View our Forum Privacy Policy.
Want to receive the latest contracting news and advice straight to your inbox? Sign up to the ContractorUK newsletter here. Every sign up will also be entered into a draw to WIN £100 Amazon vouchers!

You are not logged in or you do not have permission to access this page. This could be due to one of several reasons:

You are not logged in. If you are already registered, fill in the form below to log in, or follow the "Sign Up" link to register a new account.
You may not have sufficient privileges to access this page. Are you trying to edit someone else's post, access administrative features or some other privileged system?
If you are trying to post, the administrator may have disabled your account, or it may be awaiting activation.

mcquiggd replied

22 August 2006, 22:17
You are indeed correct, Mr P.

I have asked him, and he said he will get back to me, but that he felt it was an important issue, and it must not be judged on past failures. Execept, the tories made it more difficult, and he urges me to celebrate the differences between the bizarre output from Word, and any sane XML format. In fact, under new labour, Word's XML format will be taught as part of the national curriculum, in tandem with XML that supports 160 different langauge enhancements and encompasses all religions and ethnic origins, including the new processing instruction 'explode near people'.
Leave a comment:
DimPrawn replied

22 August 2006, 21:58
Originally posted by mcquiggd

The last step I need to cover is extracting images embedded within the XML into the database (they are small images).

I believe John Reid is going to sort it out. After a period of consultation.
Leave a comment:
mcquiggd replied

22 August 2006, 21:28
Originally posted by vetran

Create in list using automatic doc template (new document), using office 2003, automagically save in a list.Even our salesmen can manage it.

memory is cheap!

Well, I have found a very suitable solution, based on a Codeproject article, which includes a template file, a toolbar to insert styles, and an XSLT that is applied to the absolutely ridiculous Word 'XML' format, that manages to make sense of it by throwing 90+% of it away.

Now the word document is supplied to an editor type person, who clicks a button added to their standard toolbar, which adds a new word template and toolbar to the new document, that allows formatting with embedded XML tags, that in turn will allow server based processing of documents, and a schema that validates the document as it is altered. Effectively the editor now takes any old word document, selects and applies predefined xml tags to the content, presses a toolbar button, and an XML file is generated that can be uploaded to the server-based application where it processed. It is quite neat - the original authors work is here: http://www.codeproject.com/soap/Word...leTemplate.asp

And all credit to him.

The last step I need to cover is extracting images embedded within the XML into the database (they are small images).

Last edited by mcquiggd; 22 August 2006, 22:31.
Leave a comment:
vetran replied

21 August 2006, 22:11
Originally posted by mcquiggd

Sharepoint is not an option - I need to take a document that has been written in Word by people who refuse to use anything else, and magically turn it into website content that can, and will, be displayed in many different ways...

Create in list using automatic doc template (new document), using office 2003, automagically save in a list.Even our salesmen can manage it.

memory is cheap!
Leave a comment:
mcquiggd replied

18 August 2006, 19:54
Originally posted by vetran

if you are talking about properties such as title etc & custom properties they will go straight into a sharepoint list and autofill the columns.

They can be added offline using Colligio Contributor or Digilink revelation.

Security is taken care of and you will be able to full text search if you use full SQL server as the back end.

Sharepoint, rotating not reinventing the wheel!

Sharepoint is not an option - I need to take a document that has been written in Word by people who refuse to use anything else, and magically turn it into website content that can, and will, be displayed in many different ways...

My current plan is to combine a schema based template to 'encourage' them to follow certain guidelines - such as 'title' rather than simply selecting text and making it bold and 18pt, and creating an add in to Word that parses the document and outputs XML with tags that my import procedures can use to dissect the document into the relevant persistable objects. I am basically 50% there, I just need to extract images, store them and replace them with references to the correct imageID that is then rendered as and when neccessary by the website.
Leave a comment:
TheMonkey replied

18 August 2006, 12:13
Originally posted by vetran

Sharepoint, rotating not reinventing the wheel!

Rotating it slowly with lots of memory...
Leave a comment:
vetran replied

18 August 2006, 11:46
which parts?

if you are talking about properties such as title etc & custom properties they will go straight into a sharepoint list and autofill the columns.

They can be added offline using Colligio Contributor or Digilink revelation.

Security is taken care of and you will be able to full text search if you use full SQL server as the back end.

Sharepoint, rotating not reinventing the wheel!
Leave a comment:
mcquiggd replied

17 August 2006, 20:47
Monkey, that sounds interesting - I am purely reading the word document and extracting its content into my object hierarchy for display via a reasonably complex website.... I do not have to, (nor want to!) create word documents - I ignore them once I have retrieved the data I want.

The people submitting the content are rather well known on TV and submit articles for publication from laptops in Word 2003... and as always I like to keep it simple....

Last edited by mcquiggd; 17 August 2006, 20:55.
Leave a comment:
TheMonkey replied

17 August 2006, 19:34
You can use the Office Web Components on a server safely WITHOUT the overhead or problems with using Word through COM. They can only READ a document using the same object model as Excel and Word.

Don't ever bother thinking about creating word documents on the server. Third party is the ONLY option, or using a Word macro on a dedicated box to batch process stuff. Yuck.
Leave a comment:
mcquiggd replied

17 August 2006, 17:32
Originally posted by hyperD

On top of my head, out of depth and loosely associated.... there are some functions in MS Index Server and SharePoint object model that may extract some properties from office documents but for any more detail, I fear you need to use the Word object model, and probably on the client side as well for any thread issues.

Thats where IFilter comes in... I wrote a little utility using it today and it does allow you to get chunks of text, but its more for simply grabbing x amounts of chars into a buffer than being able to say 'get next paragraph'... might be sufficient for you Alexei - I have seen an example on codesmith that uses dotLucerne to search content extracted using IFilter - the relevant implementation is loaded according to the file type. It is more geared towards searching filesystems, but you can also stream content into the IFilter implementation which is probably what you want (I know I do).

Last edited by mcquiggd; 17 August 2006, 18:36.
Leave a comment:
MrsGoof replied

17 August 2006, 13:15
open doc in OpenOffice save in OD... that XLM standard, hen its a piece of err umm you know.

Alternatively scan the OO code to find out how it parses MS Word docs.
Leave a comment:
DimPrawn replied

17 August 2006, 13:13
Originally posted by mcquiggd

I am building an app that needs to dissect a Word (Office 2003) document into Titles, Paragraphs and Images, which are then ... anybody have experince of such tasks? Alexei, have you used anything similar for parsing content?

I have looked at creating a template that includes tags that I can then parse into my own format, and also IFilter, but that seems to only deal with text, not images...

The general consensus is that using Office Automation on a server is not a good idea... and the app is ASP.Net (C#)and I cant create a WinForms utility to do this...

Any pointers welcome...

Word automation on a server is a big NO NO. Don't even think about it. Word was never designed as anything other than a single user application on a desktop with a full user profile loaded. It is not thread safe or re-entrant and without an interactive user logged into the Windows Desktop, many of the features will bomb. I know a company that tried to make this work and it was unreliable and caused big support issues.

Now, is it possible to have the word documents saved as Word XML documents? If so you could parse the XML to obtain the information you require.

Or simply buy this http://www.syncfusion.com/Products/p...?p=26&tab_id=0
Leave a comment:
AtW replied

17 August 2006, 13:11
There should be .net libraries online that would allow to do the job - I will need something like this myself actually, I think I will be using converter of it to HTML and then parsing that. If you find anything useful then post here please.

A heavier way is to use Microsoft Office SDK or something like this - it has got to have some hooks into doc files, surely they should work in Visual Studio.
Leave a comment:
hyperD replied

17 August 2006, 13:02
Originally posted by mcquiggd

I am building an app that needs to dissect a Word (Office 2003) document into Titles, Paragraphs and Images, which are then ... anybody have experince of such tasks? Alexei, have you used anything similar for parsing content?

I have looked at creating a template that includes tags that I can then parse into my own format, and also IFilter, but that seems to only deal with text, not images...

The general consensus is that using Office Automation on a server is not a good idea... and the app is ASP.Net (C#)and I cant create a WinForms utility to do this...

Any pointers welcome...

On top of my head, out of depth and loosely associated.... there are some functions in MS Index Server and SharePoint object model that may extract some properties from office documents but for any more detail, I fear you need to use the Word object model, and probably on the client side as well for any thread issues.
Leave a comment:
Cowboy Bob replied

17 August 2006, 12:31
Tell them to save it in RTF
Leave a comment: