1

Topic: Web a parcer

Good afternoon, often there is a desire on a constant basis  sites with the different markets. The typical scenario - to receive the list of sentences on the market . To transit all list and to save in a DB each object of real estate and all that to it concerns - parameters of apartment, a photo,  and so forth That in a DB always there was the full history of the market from the moment of parcer start. Further it is supposed to be analyzed in different sections. Wrote the parcer which bypassed the necessary site and  (xpath, css selectors) everything that is required in a DB earlier, but a lot of time leaves on its support, I suspect that it should be easier the decision, however, lasting many hours  did not lead to the ready decision. I will specify requirements: 1. It is twisted in separate process which starts the automatic machine 2. Emulates the modern browser with javascript, including ajax (as, naturally, it it is necessary to emulate login) 3. Multi-threaded 4. Easily and rather reliably  contents (it is desirable generally a mouse  what div' interest also what parts in them) to Work with API even if it is, it is frequent not a variant. If a unique variant to write the, in what language and with what libraries it is easier/more reliably? Yours faithfully, Alexey.

2

Re: Web a parcer

Hello, Keith, you wrote: K> I Will specify requirements: K> 1. It is twisted in separate process which starts automatic machine K> 2. Emulates the modern browser with javascript, including ajax (as, naturally, it it is necessary to emulate login) K> 3. Multi-threaded Look Selenium. It is mentioned in a testing context more often, but for your task too fits. K> 4. Easily and rather reliably  contents (it is desirable generally a mouse  what div' interest also what parts in them) the Mouse hardly. A code sample on Java (there is a support of other languages) http://www.qaautomation.net/?p=263

3

Re: Web a parcer

B> Hello, Keith, you wrote: K>> I Will specify requirements: K>> 1. It is twisted in separate process which starts automatic machine K>> 2. Emulates the modern browser with javascript, including ajax (as, naturally, it it is necessary to emulate login) K>> 3. Multi-threaded B> Look Selenium. It is mentioned in a testing context more often, but for your task too fits. Selenium it is heavy - each flow of performance creates separate process c the browser. Easier the UI-component "Web browser" in  application without  UI - there will not be such expenditure . I suspect that they and in a pool can be thrust. Selenium Grid, on how many I understand, too does not solve a problem in a case with 1  for $5 (2 already search). K>> 4. Easily and rather reliably  contents (it is desirable generally a mouse  what div' interest also what parts in them) B> the Mouse hardly. Why? At Selenium IDE there is a possibility to write down mouse and keypad action. Works well, if  the standard. Saw successful examples with generation rss for news sites at which is not present rss.

4

Re: Web a parcer

K> Selenium it is heavy - each flow of performance creates separate process c the browser. There is phantomjs, too a process with the browser, but at the browser a head sailing (headless) therefore it is easier.

5

Re: Web a parcer

Hello, hi_octane, you wrote: K>> Selenium it is heavy - each flow of performance creates separate process c the browser. _> there is phantomjs, too a process with the browser, but at the browser a head sailing (headless) therefore it is easier. Thanks, found and at Selenium headess the driver: http://www.seleniumhq.org/docs/03_webdr … nit-driver

6

Re: Web a parcer

Hello, Keith, you wrote: K> Why? At Selenium IDE there is a possibility to write down mouse and keypad action. Did not know, since almost with it did not work. Thanks!

7

Re: Web a parcer

Hello, Keith, you wrote: K> to Work with API even if it is, it is frequent not a variant. And to work by API - it is frequent . Not you one such clever.

8

Re: Web a parcer

K>> to Work with API even if it is, it is frequent not a variant. Ops> and to work by API - it is frequent . Not you one such clever. I already was engaged in it and changed , did requests through a proxy and restricted frequency of requests. The browser with javascript' not to distinguish in any way from the user if to put a few efforts. API often is not present generally.

9

Re: Web a parcer

10

Re: Web a parcer

Hello, Keith, you wrote: K> wrote the parcer which bypassed the necessary site and  (xpath, css selectors) everything what is required in a DB Earlier, K> but a lot of time leaves on its support What was specific "support"? To update XPath under new design? K> I suspect, what it should be easier the decision On the basis of what there were such suspicions? K> however, lasting many hours  did not lead to the ready decision. Because suspicions - groundless, it is pure "to me laziness so to do". K> I Will specify requirements: K> 2. Emulates the modern browser with javascript, including ajax (as, naturally, it it is necessary to emulate login) K> 4. Easily and rather reliably  contents (it is desirable generally a mouse  what div' interest also what parts in them) Ha! Well you threatened on Shakespeare.... From such even Google would not refuse, but alas - cannot because of unreality of the task. Itself you do not feel, what complexity of the task bounds with nonsense? That Mahlo that , moreover cliques. If to a parcer give at least static HTML which it is possible  - already happiness!

11

Re: Web a parcer

K> What was specific "support"? To update XPath under new design? If it was single procedure I would agree. Such 10 parcers and already aloud on support will leave. I want to spend time minimum. K>> I suspect, what it should be easier the decision, K> On the basis of what there were such suspicions? On a basis  me of tools of similar functionality. K>> however, lasting many hours  did not lead to the ready decision. K> because suspicions - groundless, it is pure "to me laziness so to do". Certainly to me laziness to do operation by which it is possible not to do. K>> I will specify requirements: K>> 2. Emulates the modern browser with javascript, including ajax (as, naturally, it it is necessary to emulate login) K>> 4. Easily and rather reliably  contents (it is desirable generally a mouse  what div' interest also what parts in them) K> Ha! Well you threatened on Shakespeare.... Really as beautifully I rhyme? > From such even Google would not refuse, but alas - cannot because of unreality of the task. Generally, I already found myshko-klikatelnoe the decision - there xpath is generated by the automatic machine at a choice a mouse. However I examined not all from found, probably, I will find better implementation...> Itself you do not feel, what complexity of the task bounds with nonsense? That Mahlo that , moreover cliques. I am sorry, you the expert in this area? With pleasure would buy from you hour consultation on this subject for $200. K> If to a parcer give at least static HTML which it is possible  - already happiness! Probably, I was mistaken, look for more information on this subject that  to have talk.

12

Re: Web a parcer

Hello, Keith, you wrote: K> If a unique variant to write the, in what language and with what libraries it is easier/more reliably? nodejs + jquery + jsdom all modern sites are ground under CSS and jquery so the same jquery they  it is elementary. Plus legendary reliability and asynchrony node.js

13

Re: Web a parcer

L> At me already orders 20 was on , on such subject If it is a question about.NET used here that: L> - WebBrowser - the working decision from a box. Well works in WinForms, it is a little  in WPF, but for parsing suffices. From lacks is a wrapper for IE7, with all that it implies. There are variants in the register to register that it IE8,9,10, but on functionality it did not add effect. L> - WatiN - on an idea should be there where it is necessary Web Test Automation. But it is possible to adjust for parsing and  on a site. Itself in  did not use L> - HTML Agility Pack - simply parcer without js These points as used, but now I want to leave towards the Linux-compatible decision. L> if about Python, not my moped, but to me under the order wrote and there such : For a python is BeautifulSoup + a standard method to do http-inquiries. Well and Selenium as is. L> And still is UBot in which a mouse  it is possible , but I would not recommend as the general-purpose decision. It is already closer to that I would like, only is expensive. And why do not recommend? Besides, that it is dependence from . L> And on Upwork'e such tasks stand around 50-200$ depending on complexity. And often such tasks come across? A competition high?

14

Re: Web a parcer

P> nodejs + jquery + jsdom jsdom looks as interesting changeover phantomjs'. And there are any advantages at node in parsing which are not present, say, at java/python/ruby?> Plus legendary reliability and asynchrony node.js Still synchronism would be, generally the price was not) And generally, many complain of leaks.

15

Re: Web a parcer

Hello, Keith, you wrote: K> And there are any advantages at node in parsing which are not present, say, at java/python/ruby?  JS, we tell it is possible to fulfill JS arrived with a site, to catch any intermediate object and to derive from it the data. K> still synchronism would be, generally the price was not) Open for itself async.js and forget about synchronism as a bad dream. K> and generally, many complain of leaks. There are there no leaks. We have scripts on  at which  it is measured by years.

16

Re: Web a parcer

K>> And there are any advantages at node in parsing which are not present, say, at java/python/ruby? P> Nativnyj JS, we tell it is possible to fulfill JS arrived with a site, to catch any intermediate object and to derive from it the data. And than it differs from other platforms? There too it is possible to make http-inquiry and to deserialise json in objects of language. K>> still synchronism would be, generally the price was not) P> Open for itself async.js and forget about synchronism as a bad dream. When searched for the such did not find, thanks. But all the same to turn each time calls in synchronous it is inconvenient, when for UI in the core it is necessary to lock the interface between actions. K>> and generally, many complain of leaks. P> there are there no leaks. We have scripts on  at which  it is measured by years. Scripts, , simple? But, generally, it is not important. I am ready to test it if that will cost.

17

Re: Web a parcer

Hello, Keith, you wrote: K> And than it differs from other platforms? K> there too it is possible to make http-inquiry and to deserialise json in objects of language. In other platforms you cannot launch the code which has arrived from the server. K> but all the same to turn each time calls in synchronous it is inconvenient, when for UI in the core it is necessary K> to lock the interface between actions. It is not necessary. For UI it is necessary to show a sand-glass, to make request and to hang up  in which to hide sand hours and to show result. In an asynchronous variant it becomes a Prorussian cabbage soup than in the synchronous. Plus of asynchronous implementation in that that you can show a sand-glass only in that place  which is connected to request, and all remaining can continue to work .

18

Re: Web a parcer

K>> And than it differs from other platforms? K>> there too it is possible to make http-inquiry and to deserialise json in objects of language. P> in other platforms you cannot launch the code which has arrived from the server. It agree, in case of the web application it is convenient. But in a parcer context, it has no value. K>> but all the same to turn each time calls in synchronous it is inconvenient, when for UI in the core it is necessary K>> to lock the interface between actions. P> it is not necessary. For UI it is necessary to show a sand-glass, to make request and to hang up  in which to hide sand hours and to show result. P> in an asynchronous variant it becomes a Prorussian cabbage soup than in the synchronous. It do not agree, it is more difficult to turn result in additional function + the context this changes. Easier in the synchronous type: progress. Start (); uicomponent.data = domainService. GetByIds (ids); progress. Stop (); P> Plus of asynchronous implementation in that that you can show a sand-glass only in that place  which is connected to request, and all remaining can continue to work . By the synchronous call too it is possible to lock only a part, I did not mean .

19

Re: Web a parcer

Touch for PhantomJS the Full emulator of the browser together with  which  to work itself, in the console, absolutely  .

20

Re: Web a parcer

S> Touch for PhantomJS S> the Full emulator of the browser together with  which  to work itself, in the console, absolutely  . And there is any advantage node.js + phantomjs before selenium' or a browser engine on other platform (java, python, ruby)? You have an experience of development of parcers?

21

Re: Web a parcer

Hello, Keith, you wrote: K> And there is any advantage node.js + phantomjs before selenium' or K> a browser engine on other platform (java, python, ruby)? K> you have an experience of development of parcers? Yes, still when wget' plaid about together with , gray-haired and their brothers. Still then started experience to save. And then still parser-bilder jsona with   (Kamrad, the Phantom is able  a site in a picture generally. And it is correct. Much is told by it. Only I cannot understand at what here .  it is independent. Q: Why is PhantomJS not written as Node.js module? A: The short answer: "No one can serve two masters." A longer explanation is as follows. As of now, it is technically very challenging to do so. Every Node.js module is essentially "a slave" to the core of Node.js, i.e." the master ". In its current state, PhantomJS (and its included WebKit) needs to have the full control (in a synchronous matter) over everything: event loop, network stack, and JavaScript execution. If the intention is just about using PhantomJS right from a script running within Node.js, such a" loose binding "can be achieved by launching a PhantomJS process and interact with it.

22

Re: Web a parcer

S> Kamrad, the Phantom is able  a site in a picture generally. And it is correct. Much is told by it. Selenium it too is able. As well as engines of browsers. S> If the intention is just about using PhantomJS right from a script running within Node.js, such a "loose binding" can be achieved by launching a PhantomJS process and interact with it. Did not know. It it is casual not ? It is possible to launch simultaneously some "browsers" in one process PahntomJS?

23

Re: Web a parcer

Hello, Keith, you wrote: S>> Kamrad, the Phantom is able  a site in a picture generally. And it is correct. Much is told by it. K> Selenium it too is able. He/she is the monster what that. Both the web driver and the client and  , in addition only under Windows... I do not tell that it bad, but on mine it is a gun and sparrows... K> As well as engines of browsers. It would be interesting to me to esteem as to ask a chrome engine   in a picture on  stations... S>> If the intention is just about using PhantomJS right from a script running within Node.js, such a "loose binding" can be achieved by launching a PhantomJS process and interact with it. K> did not know. K> it it is casual not ? K> it is possible to launch simultaneously some "browsers" in one process PahntomJS? It a little on another works. It   to a script on , which you  in parameter to the phantom. http://phantomjs.org/quick-start.html That is all that you should make useful - you write in this script and it together with  you feed phantom command line options.

24

Re: Web a parcer

Hello, Keith, you wrote: L>> And still is UBot in which a mouse  it is possible , but I would not recommend as the general-purpose decision. K> it is already closer to that I would like, only is expensive. K> and why do not recommend? K> Besides, that is dependence from . The matter is that when you write what  a parcer which runs on sites - the probability is great that on one of them it hangs up, appears , frames will not be finished loading, one million reasons. Using , you can bypass these nuances encoding for example timers-ozhidalki necessary to you , and in case of long hangup -  a program. UBot in this plan - never flexible. If something goes not so, to bypass a problem even using local  language it will be very heavy. , using a sheaf the application + UBot, probably it is possible to achieve , by an application launch there where UBot does not consult. L>> And on Upwork'e such tasks stand around 50-200$ depending on complexity. K> and often such tasks come across? A competition high? For the last year only 4 times. And I yet did not search for it, offered. Tasks appear permanently, the problem essential, is a guess that developers Ubot already sit on bags with gold. Look for example this section. I put that on the first page there will be something like Need a scrapper / Need a bot.