• Visitors can check out the Forum FAQ by clicking this link. You have to register before you can post: click the REGISTER link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. View our Forum Privacy Policy.
  • Want to receive the latest contracting news and advice straight to your inbox? Sign up to the ContractorUK newsletter here. Every sign up will also be entered into a draw to WIN £100 Amazon vouchers!

test please delete

Collapse
This is a sticky topic.
X
X
  •  
  • Filter
  • Time
  • Show
Clear All
new posts

    Originally posted by NickFitz View Post
    I found it last night when I was doing my binary search for the deleted post - I had the database open in one tab and TPD in another, and narrowed it down by going back and forth in ever-decreasing circles, seeing if the CUK ID of the first post on a page correlated with the spidered TPD ID

    Your first TPD post
    Did you write something custom to spider TPD, or did you use a pre-written tool?

    Have you extracted it to a MySQL database or something else?

    Does it auto-spider periodically, or do you need to run something to do it?

    When you re-spider the thread, does it do the whole thing, or just new bits that it hasn't seen before?

    The techie geek in me (which accounts for at least 75%) is mightily intrigued.
    Best Forum Advisor 2014
    Work in the public sector? You can read my FAQ here
    Click here to get 15% off your first year's IPSE membership

    Comment


      Originally posted by cailin maith View Post
      Kathy sweetheart, it's 3.10pm
      In a sign of how little obsessed with TPD posts we have become, I've only just realised that this was CM's 7000th TPD post.

      Have a




      to celebrate
      Last edited by TheFaQQer; 27 March 2008, 18:55. Reason: Badly spaced bananananananas
      Best Forum Advisor 2014
      Work in the public sector? You can read my FAQ here
      Click here to get 15% off your first year's IPSE membership

      Comment


        Originally posted by TheFaQQer View Post
        Did you write something custom to spider TPD, or did you use a pre-written tool?

        Have you extracted it to a MySQL database or something else?

        Does it auto-spider periodically, or do you need to run something to do it?

        When you re-spider the thread, does it do the whole thing, or just new bits that it hasn't seen before?

        The techie geek in me (which accounts for at least 75%) is mightily intrigued.
        The initial version was written in PHP, and just grabbed a page, scraped through it to find the section with the posts (which was usually valid XHTML), then used SimpleXML to grab the posts. These were then stuck into a mySQL database. The problem there was that my el-cheapo hosting provider doesn't provide any facilities for cron jobs or the like.

        So, cunningly, the template that returned the message that page xxx had been spidered contained a <meta refresh> element in the head. I pointed a tab of my browser at it, and every so many seconds it reloaded, causing the next page to be crawled

        This was why it took so long - I didn't bother leaving my laptop on all the time, so it was only when I was at home and conscious that it was working.

        This got most of the pages, except a few with malformed markup or illegal characters.

        For those, I used curl from the OS X command line, piping the output into John Cowan's TagSoup parser, and into an XSLT stylesheet that turned it into nice neat XML. (It could as easily turn it into SQL, or JSON, or whatever.)

        The next version will use the Java approach, running on an Amazon EC2 machine instance, so it can run independently as a regular thing. This will do two things initially: send the results of spidering new material to the testpleasedelete.com server via HTTP, so it can update the database, and also re-crawl the thread, as I forgot to grab post titles

        The results of re-crawling the thread, and the new data, will be stored in Amazon's SimpleDB, mainly because I want to play with it. Then testpleasedelete.com will become a front end for the Amazon EC2 machine, which will act as a query server to the SimpleDB database, and also carry out any long-winded analysis. Also, the front end can cache useful stuff that's potentially expensive to generate in terms of resources, like graphs and so forth. The beauty of EC2 is that, if somebody comes up with an idea that would potentially take ages to process, I can just fire up additional instances and form a cluster to blitz through the task

        The second spidering won't be complete for a while, as I don't want to be hitting CUK as much as I did at the weekend. There's enough fun to be had with the mySQL DB at the moment, and the other stuff is mainly for experimentation anyway - I might end up moving it all back to mySQL when I get bored.

        As of the coming weekend, I should have it keeping up to date, and be able to get some interesting analyses going

        I posted the XSLT here at the weekend, if you dig back a little... hang on... here it is

        Comment


          Originally posted by NickFitz View Post
          The initial version was written in PHP, and just grabbed a page, scraped through it to find the section with the posts (which was usually valid XHTML), then used SimpleXML to grab the posts. These were then stuck into a mySQL database. The problem there was that my el-cheapo hosting provider doesn't provide any facilities for cron jobs or the like.

          So, cunningly, the template that returned the message that page xxx had been spidered contained a <meta refresh> element in the head. I pointed a tab of my browser at it, and every so many seconds it reloaded, causing the next page to be crawled

          This was why it took so long - I didn't bother leaving my laptop on all the time, so it was only when I was at home and conscious that it was working.

          This got most of the pages, except a few with malformed markup or illegal characters.

          For those, I used curl from the OS X command line, piping the output into John Cowan's TagSoup parser, and into an XSLT stylesheet that turned it into nice neat XML. (It could as easily turn it into SQL, or JSON, or whatever.)

          The next version will use the Java approach, running on an Amazon EC2 machine instance, so it can run independently as a regular thing. This will do two things initially: send the results of spidering new material to the testpleasedelete.com server via HTTP, so it can update the database, and also re-crawl the thread, as I forgot to grab post titles

          The results of re-crawling the thread, and the new data, will be stored in Amazon's SimpleDB, mainly because I want to play with it. Then testpleasedelete.com will become a front end for the Amazon EC2 machine, which will act as a query server to the SimpleDB database, and also carry out any long-winded analysis. Also, the front end can cache useful stuff that's potentially expensive to generate in terms of resources, like graphs and so forth. The beauty of EC2 is that, if somebody comes up with an idea that would potentially take ages to process, I can just fire up additional instances and form a cluster to blitz through the task

          The second spidering won't be complete for a while, as I don't want to be hitting CUK as much as I did at the weekend. There's enough fun to be had with the mySQL DB at the moment, and the other stuff is mainly for experimentation anyway - I might end up moving it all back to mySQL when I get bored.

          As of the coming weekend, I should have it keeping up to date, and be able to get some interesting analyses going

          I posted the XSLT here at the weekend, if you dig back a little... hang on... here it is
          And that, in many words, explains why I haven't considered doing anything like that!!

          I did wonder whether it would be dead easy and straightforward to pull the whole thing into a MySQL database on my thefaqqer.com website. You've answered me with a fairly detailed, complex "no"
          Best Forum Advisor 2014
          Work in the public sector? You can read my FAQ here
          Click here to get 15% off your first year's IPSE membership

          Comment


            And that concludes another carppy day at work.

            Best Forum Advisor 2014
            Work in the public sector? You can read my FAQ here
            Click here to get 15% off your first year's IPSE membership

            Comment


              Originally posted by TheFaQQer View Post
              And that concludes another carppy day at work.

              Confusion is a natural state of being

              Comment


                Evening Diver

                How's the hand?

                Comment


                  Evening, Gents...
                  "I can put any old tat in my sig, put quotes around it and attribute to someone of whom I've heard, to make it sound true."
                  - Voltaire/Benjamin Franklin/Anne Frank...

                  Comment


                    Comment


                      Originally posted by cojak View Post
                      Evening, Gents...
                      evening cojak

                      Comment

                      Working...
                      X