desrializing XML from a TCP port

Visitors can check out the Forum FAQ by clicking this link. You have to register before you can post: click the REGISTER link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. View our Forum Privacy Policy.
Want to receive the latest contracting news and advice straight to your inbox? Sign up to the ContractorUK newsletter here. Every sign up will also be entered into a draw to WIN £100 Amazon vouchers!

You are not logged in or you do not have permission to access this page. This could be due to one of several reasons:

You are not logged in. If you are already registered, fill in the form below to log in, or follow the "Sign Up" link to register a new account.
You may not have sufficient privileges to access this page. Are you trying to edit someone else's post, access administrative features or some other privileged system?
If you are trying to post, the administrator may have disabled your account, or it may be awaiting activation.

NickFitz replied

13 June 2010, 21:31
Originally posted by VectraMan View Post

Are <'s and >'s allowed inside quotes, or is it just CDATA and attributes? I thought that all <'s and >'s were escaped to &lt and &gt in XML.

The only characters that have to be escaped (other than within a comment, a processing instruction, a CDATA section, or an internal entity declaration) are "<" and "&". ">" only has to be escaped when used in the string "]]>" other than for the purpose of marking the end of a CDATA section. Most tools escape ">" anyway though.

XML 1.0 section 2.4 Character Data and Markup
Leave a comment:
VectraMan replied

13 June 2010, 20:00
Originally posted by NickFitz View Post

Code:

<element attribute="hello>/>>"> <![CDATA[ The expression "1 < 2" is true. ]]> </element>

I didn't add "excluding anything within quotes" because I felt that was so obvious as to be insulting the intelligence of the reader.

Are <'s and >'s allowed inside quotes, or is it just CDATA and attributes? I thought that all <'s and >'s were escaped to &lt and &gt in XML.
Leave a comment:
NickFitz replied

13 June 2010, 16:03
Originally posted by VectraMan View Post

But if you're determined to parse it: XML is pretty simple. Just looking for <'s, </'s and >'s ought to do it.

Code:

<element attribute="hello>/>>"> <![CDATA[ The expression "1 < 2" is true. ]]> </element>
Last edited by NickFitz; 13 June 2010, 16:08.
Leave a comment:
VectraMan replied

13 June 2010, 14:01
I implemented something just like this (but not .NET) with a SAX parser. As Nick says, the parser just waits for the input. I can't remember the name of the open source C++ library I used, but there seems to be a sax .NET on SourceForge which presumably works much the same.

But if you're determined to parse it: XML is pretty simple. Just looking for <'s, </'s and >'s ought to do it.
Leave a comment:
NickFitz replied

13 June 2010, 02:23
Originally posted by ASB View Post

They can send it with a packet prefix and suffix. So the parsing code is simple (but does still rely on it not being in any of the data).

An algorithm for generating multipart boundaries for MIME ought to take care of that. This chap suggests using an MD5 hash of the timestamp of the message included in some boilerplate text, which sounds like a viable approach.
Leave a comment:
ASB replied

12 June 2010, 13:49
And the answer is...

They can send it with a packet prefix and suffix. So the parsing code is simple (but does still rely on it not being in any of the data).
Leave a comment:
ASB replied

11 June 2010, 08:04
Originally posted by NickFitz View Post

Off the top of my head...

To identify boundaries between objects you could look for <?xml version="1.0" encoding="utf-8"?>: the point between the character preceding the first character of that and the first character of that delimits both the end of one object and the start of a new one.

To identify incomplete object representations you can, after splitting things into chunks on the basis of the above and processing all chunks that aren't last, check to see if the last chunk's opening tag is closed: grab the first bit (<SomeObject xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">) and use that to construct a string representing the appropriate end tag (</SomeObject>) then check to see if that's present before the end of the input; if not, wait for further input, append it to that chunk, and go back to step 1.

Of course a simple string match or RegExp test for step 2 will fail if there are any circumstances in which root objects of some type can contain objects of the same type at some level of nesting (e.g. <foo><bar><baz><foo></foo>[EOF] would fail).

There could be some more elegant solution to step 2 that involving piping what you have to a SAX parser which will happily wait until it gets the remaining chunk of the document once it arrives on your input stream. Trying to do it with RegExp definitely falls into the "now you have two problems" category, as you'll have signed up to write an XML parser using RegExp, which isn't possible in the first place (XML being a Chomsky Type 2 language (context-free), and regular expressions only being able to handle Chomsky Type 3 languages (regular)). That said, if your case is suitably constrained - and will remain so in the future - then you might be able to get way with RegExp.

In your place, I'd be looking at step 1 to kick things off and then hoping that a suitably well-behaved SAX parser would take care of step 2.

Thanks Nick,

I think I had pretty much come to the same conclusion. I am in the position where I can parse it using the sort of methods you describe. However it is inherently a bit fragile and is constrained by the actual data. In any event its probably the best I can do.

Cheers.
Leave a comment:
NickFitz replied

11 June 2010, 03:04
Off the top of my head...

To identify boundaries between objects you could look for <?xml version="1.0" encoding="utf-8"?>: the point between the character preceding the first character of that and the first character of that delimits both the end of one object and the start of a new one.

To identify incomplete object representations you can, after splitting things into chunks on the basis of the above and processing all chunks that aren't last, check to see if the last chunk's opening tag is closed: grab the first bit (<SomeObject xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">) and use that to construct a string representing the appropriate end tag (</SomeObject>) then check to see if that's present before the end of the input; if not, wait for further input, append it to that chunk, and go back to step 1.

Of course a simple string match or RegExp test for step 2 will fail if there are any circumstances in which root objects of some type can contain objects of the same type at some level of nesting (e.g. <foo><bar><baz><foo></foo>[EOF] would fail).

There could be some more elegant solution to step 2 that involving piping what you have to a SAX parser which will happily wait until it gets the remaining chunk of the document once it arrives on your input stream. Trying to do it with RegExp definitely falls into the "now you have two problems" category, as you'll have signed up to write an XML parser using RegExp, which isn't possible in the first place (XML being a Chomsky Type 2 language (context-free), and regular expressions only being able to handle Chomsky Type 3 languages (regular)). That said, if your case is suitably constrained - and will remain so in the future - then you might be able to get way with RegExp.

In your place, I'd be looking at step 1 to kick things off and then hoping that a suitably well-behaved SAX parser would take care of step 2.
Last edited by NickFitz; 11 June 2010, 03:21.
Leave a comment:
ASB started a topic desrializing XML from a TCP port

10 June 2010, 22:36
desrializing XML from a TCP port
Well, I have a tcp port (.NET2) that is being read. It receives raw XML. So the nominal input stream looks something like this:-

Code:

<?xml version="1.0" encoding="utf-8"?> <SomeObject xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <AProperty>46982</AProperty> </SomeObject> <?xml version="1.0" encoding="utf-8"?> <SomeOtherObject xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <AProperty>46982</AProperty> </SomeOtherObject>

I can deserialise each of the objects which I may receive - I know what the potential set of them is.

My problem is that a read on the port will simply return whatever is there. This can of course be a partial message or potentially more than one message. Any suggestions as to how I might be able to extract the strings representing the serialised objects. Stuffed if I can figure out a way.

Whenever I've done it I've tended to use a length at the beginning and a sentinel at the end so it's easy to extract the packets, sadly I don't have this option I'm not in any sort of control of the sender.
Tags: None