• Visitors can check out the Forum FAQ by clicking this link. You have to register before you can post: click the REGISTER link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. View our Forum Privacy Policy.
  • Want to receive the latest contracting news and advice straight to your inbox? Sign up to the ContractorUK newsletter here. Every sign up will also be entered into a draw to WIN £100 Amazon vouchers!

Parsing word documents in .Net 2

Collapse
X
  •  
  • Filter
  • Time
  • Show
Clear All
new posts

    Parsing word documents in .Net 2

    I am building an app that needs to dissect a Word (Office 2003) document into Titles, Paragraphs and Images, which are then stored in a SQL Server 2000 DB... anybody have experience of such tasks? Alexei, have you used anything similar for parsing content?

    I have looked at creating a template that includes tags that I can then parse into my own format, and also IFilter, but that seems to only deal with text, not images...

    The general consensus is that using Office Automation on a server is not a good idea... and the app is ASP.Net (C#)and I cant create a WinForms utility to do this...

    Any pointers welcome...
    Last edited by mcquiggd; 17 August 2006, 18:30.
    Vieze Oude Man

    #2
    Tell them to save it in RTF
    Listen to my last album on Spotify

    Comment


      #3
      Originally posted by mcquiggd
      I am building an app that needs to dissect a Word (Office 2003) document into Titles, Paragraphs and Images, which are then ... anybody have experince of such tasks? Alexei, have you used anything similar for parsing content?

      I have looked at creating a template that includes tags that I can then parse into my own format, and also IFilter, but that seems to only deal with text, not images...

      The general consensus is that using Office Automation on a server is not a good idea... and the app is ASP.Net (C#)and I cant create a WinForms utility to do this...

      Any pointers welcome...
      On top of my head, out of depth and loosely associated.... there are some functions in MS Index Server and SharePoint object model that may extract some properties from office documents but for any more detail, I fear you need to use the Word object model, and probably on the client side as well for any thread issues.
      If you think my attitude stinks, you should smell my fingers.

      Comment


        #4
        There should be .net libraries online that would allow to do the job - I will need something like this myself actually, I think I will be using converter of it to HTML and then parsing that. If you find anything useful then post here please.

        A heavier way is to use Microsoft Office SDK or something like this - it has got to have some hooks into doc files, surely they should work in Visual Studio.

        Comment


          #5
          Originally posted by mcquiggd
          I am building an app that needs to dissect a Word (Office 2003) document into Titles, Paragraphs and Images, which are then ... anybody have experince of such tasks? Alexei, have you used anything similar for parsing content?

          I have looked at creating a template that includes tags that I can then parse into my own format, and also IFilter, but that seems to only deal with text, not images...

          The general consensus is that using Office Automation on a server is not a good idea... and the app is ASP.Net (C#)and I cant create a WinForms utility to do this...

          Any pointers welcome...
          Word automation on a server is a big NO NO. Don't even think about it. Word was never designed as anything other than a single user application on a desktop with a full user profile loaded. It is not thread safe or re-entrant and without an interactive user logged into the Windows Desktop, many of the features will bomb. I know a company that tried to make this work and it was unreliable and caused big support issues.

          Now, is it possible to have the word documents saved as Word XML documents? If so you could parse the XML to obtain the information you require.

          Or simply buy this http://www.syncfusion.com/Products/p...?p=26&tab_id=0

          Comment


            #6
            open doc in OpenOffice save in OD... that XLM standard, hen its a piece of err umm you know.

            Alternatively scan the OO code to find out how it parses MS Word docs.
            Your parents ruin the first half of your life and your kids ruin the second half

            Comment


              #7
              Originally posted by hyperD
              On top of my head, out of depth and loosely associated.... there are some functions in MS Index Server and SharePoint object model that may extract some properties from office documents but for any more detail, I fear you need to use the Word object model, and probably on the client side as well for any thread issues.

              Thats where IFilter comes in... I wrote a little utility using it today and it does allow you to get chunks of text, but its more for simply grabbing x amounts of chars into a buffer than being able to say 'get next paragraph'... might be sufficient for you Alexei - I have seen an example on codesmith that uses dotLucerne to search content extracted using IFilter - the relevant implementation is loaded according to the file type. It is more geared towards searching filesystems, but you can also stream content into the IFilter implementation which is probably what you want (I know I do).
              Last edited by mcquiggd; 17 August 2006, 18:36.
              Vieze Oude Man

              Comment


                #8
                You can use the Office Web Components on a server safely WITHOUT the overhead or problems with using Word through COM. They can only READ a document using the same object model as Excel and Word.

                Don't ever bother thinking about creating word documents on the server. Third party is the ONLY option, or using a Word macro on a dedicated box to batch process stuff. Yuck.
                Serving religion with the contempt it deserves...

                Comment


                  #9
                  Monkey, that sounds interesting - I am purely reading the word document and extracting its content into my object hierarchy for display via a reasonably complex website.... I do not have to, (nor want to!) create word documents - I ignore them once I have retrieved the data I want.

                  The people submitting the content are rather well known on TV and submit articles for publication from laptops in Word 2003... and as always I like to keep it simple....
                  Last edited by mcquiggd; 17 August 2006, 20:55.
                  Vieze Oude Man

                  Comment


                    #10
                    which parts?

                    if you are talking about properties such as title etc & custom properties they will go straight into a sharepoint list and autofill the columns.

                    They can be added offline using Colligio Contributor or Digilink revelation.

                    Security is taken care of and you will be able to full text search if you use full SQL server as the back end.


                    Sharepoint, rotating not reinventing the wheel!

                    Comment

                    Working...
                    X