CN104484382A - Method and device for generating time-based seed page set - Google Patents

Method and device for generating time-based seed page set Download PDF

Info

Publication number
CN104484382A
CN104484382A CN201410758178.2A CN201410758178A CN104484382A CN 104484382 A CN104484382 A CN 104484382A CN 201410758178 A CN201410758178 A CN 201410758178A CN 104484382 A CN104484382 A CN 104484382A
Authority
CN
China
Prior art keywords
page
judge
subpage
new url
ageing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410758178.2A
Other languages
Chinese (zh)
Inventor
魏少俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201410758178.2A priority Critical patent/CN104484382A/en
Publication of CN104484382A publication Critical patent/CN104484382A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention provides a method and a device for generating a time-based seed page set. The method comprises the following steps: acquiring and analyzing the attribute information of a plurality of pages to be judged; screening pages of which the attribute information satisfies a time-based seed page attribute condition from the pages to be judged; aggregating the screened pages satisfying the time-based seed page attribute condition to generate the time-based seed page set. According to the method and the device, the pages to be judged can be screened based on the time-based seed page attribute condition, and the pages satisfying the time-based seed page attribute condition are screened and aggregated into the time-based seed page set, so that the recall rate and accuracy of time-based seed pages are ensured.

Description

Generate the method and apparatus of ageing kind of subpage set
Technical field
The present invention relates to information search field, particularly a kind of method and apparatus generating ageing kind of subpage set, and a kind of method and search engine utilizing ageing kind of subpage set of generation to carry out page crawl.
Background technology
Search engine needs the very first time find and include for the real-time focus that internet occurs.Search engine reptile has a huge URL (Uniform Resource Locator, URL(uniform resource locator)) storehouse, and scale reaches hundreds billion of and even goes up trillion magnitudes.The all crawl of reptile, all from this URL storehouse, namely picks out a collection of URL from URL storehouse, implements to capture to their, therefrom finds that new url adds in URL storehouse again.Focus link is exactly therefrom be found and pass through to choose next time to capture and include.
But due to reptile, that the data volume of grabbing again for a time handled by a time is selected in whole links is comparatively huge, needs to consume a large amount of time, is thus difficult to ensure that all focuses can be found in the very first time and include.Therefore, the efficiency how improving crawler capturing becomes technical matters urgently to be resolved hurrily at present.
Summary of the invention
In view of the above problems, propose the present invention to provide a kind of method and apparatus of generation ageing kind of subpage set overcoming the problems referred to above or solve the problem at least in part, and utilize the ageing kind of subpage set generated to carry out method and the search engine of page crawl.
According to one aspect of the present invention, provide a kind of method generating ageing kind of subpage set, comprising: obtain and analyze multiple attribute information waiting to judge the page; Judge in the page, to filter out the page that attribute information meets ageing seed page attribute condition described multiple waiting; The page meeting described ageing seed page attribute condition filtered out is assembled, generates ageing kind of subpage set.
Alternatively, described acquisition and analyze multiple wait to judge the attribute information of the page before, also comprise: according to cycle fixed time capture described multiple waiting judge the page.
Alternatively, judge in the page, to filter out the page that attribute information meets ageing seed page attribute condition described multiple waiting, comprising: the page is judged for each treating, compare this and wait to judge the link in the page and linking of including; According to the result compared, count the quantity that this waits the new url judged in the page; Filter out described multiple waiting and judge that in the page, the quantity of new url meets the page of the quantity of the new url of ageing kind of subpage.
Alternatively, filter out described multiple waiting and judge that in the page, the quantity of new url meets the page of the quantity of the new url of ageing kind of subpage, comprise: the page is judged for each treating, judge whether this quantity waiting the new url judged in the page is greater than the first predetermined threshold value; If so, then judge that this waits to judge that the page is the page of the quantity of the new url meeting ageing kind of subpage.
Alternatively, described method also comprises: judge the page for each waiting, counts the quantity that this waits to judge to possess the new url that index is worth in the page; Judge whether the quantity that this waits to judge to possess in the page new url that index is worth is greater than the second predetermined threshold value; If so, then judge that this waits to judge that the page is meet the page that ageing kind of subpage possesses the quantity of the new url that index is worth.
Alternatively, described method also comprises: judge that the page is sorted out according to URL to described multiple waiting; For every class URL, add up the quantity waiting the new url judged in the page corresponding to such URL and possess the quantity of the new url that index is worth; Whether the quantity of the new url judging the quantity waiting the new url judged in the page that such URL is corresponding and possess index value meets the quantity of the new url of ageing kind of subpage and possesses the quantity of the new url that index is worth; If so, then judge such URL corresponding wait judge that the page is ageing kind of subpage.
According to another aspect of the present invention, provide a kind of page grasping means, comprise the ageing kind of subpage set that utilization generates above and carry out page crawl.
According to another aspect of the present invention, additionally provide a kind of device generating ageing kind of subpage set, comprising:
Analyzer, is suitable for obtaining and analyzes multiple attribute information waiting to judge the page;
Page screening washer, is suitable for judging to filter out the page that attribute information meets ageing seed page attribute condition in the page described multiple waiting;
Plant subpage maker, be suitable for the page meeting described ageing seed page attribute condition filtered out to assemble, generate ageing kind of subpage set.
Alternatively, described analyzer obtain and analyze multiple wait to judge the attribute information of the page before, also comprise:
Grabber, is suitable for capturing described multiple waiting according to cycle fixed time and judges the page.
Alternatively, described page screening washer is also suitable for: judge the page for each waiting, compares this and waits to judge the link in the page and linking of including; According to the result compared, count the quantity that this waits the new url judged in the page; Filter out described multiple waiting and judge that in the page, the quantity of new url meets the page of the quantity of the new url of ageing kind of subpage.
Alternatively, described page screening washer is also suitable for: judge the page for each waiting, judges whether this quantity waiting the new url judged in the page is greater than the first predetermined threshold value; If so, then judge that this waits to judge that the page is the page of the quantity of the new url meeting ageing kind of subpage.
Alternatively, described page screening washer is also suitable for: judge the page for each waiting, counts the quantity that this waits to judge to possess the new url that index is worth in the page; Judge whether the quantity that this waits to judge to possess in the page new url that index is worth is greater than the second predetermined threshold value; If so, then judge that this waits to judge that the page is meet the page that ageing kind of subpage possesses the quantity of the new url that index is worth.
Alternatively, described page screening washer is also suitable for: judge that the page is sorted out according to URL to described multiple waiting; For every class URL, add up the quantity waiting the new url judged in the page corresponding to such URL and possess the quantity of the new url that index is worth; Whether the quantity of the new url judging the quantity waiting the new url judged in the page that such URL is corresponding and possess index value meets the quantity of the new url of ageing kind of subpage and possesses the quantity of the new url that index is worth; If so, then judge such URL corresponding wait judge that the page is ageing kind of subpage.
According to another aspect of the invention, additionally provide a kind of search engine, comprising: the device of above-mentioned generation ageing kind of subpage set.In technical scheme provided by the invention, obtain and analyze multiple attribute information waiting to judge the page, and then filtering out multiple page waiting to judge to meet ageing seed page attribute condition in the page, broad covered area, can generate ageing kind of comparatively comprehensive, complete subpage set.And, the present invention can carry out based on ageing seed page attribute condition the screening waiting to judge the page, filter out the page meeting ageing seed page attribute condition, and aggregation formation ageing kind of subpage set, ensure that recall rate and the accuracy rate of ageing kind of subpage.Further, 1,000,000 magnitudes are only had through screening scale while the ageing kind of subpage obtained is integrated into guarantee recall rate, dramatically reduce the burden that reptile implements to capture, solving reptile in correlation technique, to need that the data volume of grabbing again for a time handled by a time is selected in whole links (namely scale reaches hundreds billion of link of even going up trillion magnitudes) comparatively huge, need the problem consuming a large amount of time, thus improve the efficiency of crawler capturing, and ensure that focus can be found in the very first time and include.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to technological means of the present invention can be better understood, and can be implemented according to the content of instructions, and can become apparent, below especially exemplified by the specific embodiment of the present invention to allow above and other objects of the present invention, feature and advantage.
According to hereafter by reference to the accompanying drawings to the detailed description of the specific embodiment of the invention, those skilled in the art will understand above-mentioned and other objects, advantage and feature of the present invention more.
Accompanying drawing explanation
By reading hereafter detailed description of the preferred embodiment, various other advantage and benefit will become cheer and bright for those of ordinary skill in the art.Accompanying drawing only for illustrating the object of preferred implementation, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts by identical reference symbol.In the accompanying drawings:
Fig. 1 shows the process flow diagram of the method generating ageing kind of subpage set according to an embodiment of the invention;
Fig. 2 shows the structural representation of the device generating ageing kind of subpage set according to an embodiment of the invention; And
Fig. 3 shows the structural representation of the device generating ageing kind of subpage set in accordance with another embodiment of the present invention.
Embodiment
Below with reference to accompanying drawings exemplary embodiment of the present disclosure is described in more detail.Although show exemplary embodiment of the present disclosure in accompanying drawing, however should be appreciated that can realize the disclosure in a variety of manners and not should limit by the embodiment set forth here.On the contrary, provide these embodiments to be in order to more thoroughly the disclosure can be understood, and complete for the scope of the present disclosure can be conveyed to those skilled in the art.
For solving the problems of the technologies described above, embodiments provide a kind of method generating ageing kind of subpage set, Fig. 1 shows the process flow diagram of the method generating ageing kind of subpage set according to an embodiment of the invention.As shown in Figure 1, the method at least comprises the following steps S102 to step S106.
Step S102, acquisition analyze multiple attribute information waiting to judge the page.
Step S104, judge in the page, to filter out the page that attribute information meets ageing seed page attribute condition multiple waiting.
Step S106, the page gathering meeting ageing seed page attribute condition that will filter out, generate ageing kind of subpage set.
In technical scheme provided by the invention, obtain and analyze multiple attribute information waiting to judge the page, and then filtering out multiple page waiting to judge to meet ageing seed page attribute condition in the page, broad covered area, can generate ageing kind of comparatively comprehensive, complete subpage set.And, the present invention can carry out based on ageing seed page attribute condition the screening waiting to judge the page, filter out the page meeting ageing seed page attribute condition, and aggregation formation ageing kind of subpage set, ensure that recall rate and the accuracy rate of ageing kind of subpage.Further, 1,000,000 magnitudes are only had through screening scale while the ageing kind of subpage obtained is integrated into guarantee recall rate, dramatically reduce the burden that reptile implements to capture, solving reptile in correlation technique, to need that the data volume of grabbing again for a time handled by a time is selected in whole links (namely scale reaches hundreds billion of link of even going up trillion magnitudes) comparatively huge, need the problem consuming a large amount of time, thus improve the efficiency of crawler capturing, and ensure that focus can be found in the very first time and include.
Above step S102 obtain and analyze multiple wait to judge the attribute information of the page before, the present invention can also capture multiple waiting according to cycle fixed time judge the page, and cycle fixed time was here as 1 day, 1 hour etc.Such as, in units of sky, capture 360 video pages, the URL address of the page of crawl is respectively http://video.so.com/11-01, http://video.so.com/11-02 etc., can find out the page distinguishing crawl every day with date " 11-01 " " 11-02 ".
Step S102 obtain and analyze multiple wait to judge the attribute information of the page after, step S104 judges to filter out the page that attribute information meets ageing seed page attribute condition in the page multiple waiting.In the embodiment of the present invention, ageing seed page attribute condition can be that this kind of subpage can produce new url, and the new url that this kind of subpage produces has index and to be worth and this kind of subpage can continue generation new url, etc.Here, plant the new url that produces of subpage there is index to be worth and to refer to that new url that kind of subpage produces can not be the page of the classes such as repetition, rubbish, cheating.In addition, it is that (scheduling is here that reptile is implemented crawl and dispatches because the new url of disposable generation does not have repetitive schedule to be worth that kind of subpage can continue to produce new url, namely reptile determines which page needs to initiate to capture, and select the process of this kind of page), thus need repeatedly or continue to produce new url.Corresponding screening scheme will be introduced in detail below for each ageing seed page attribute condition.
First, new url can be produced about kind of subpage.In scheme provided by the invention, the page is judged for each treating, relatively this waits to judge the link in the page and linking of including, and then according to the result compared, count the quantity that this waits the new url judged in the page, thus filter out multiple waiting and judge that in the page, the quantity of new url meets the page of the quantity of the new url of ageing kind of subpage.For example, multiplely wait that judging that the page is respectively waits to judge the page A, B and C, wait to judge to be linked as A1, A2, A3, A4, A5 in page A, wait to judge to be linked as B1, B2, B3, B4 in page B, wait to judge to be linked as C1, C2, C3 in page C.Now, the page is judged for each treating, compare this and wait to judge the link in the page and linking of including, and then according to the result compared, count the quantity that this waits the new url judged in the page.Process compares and statistics obtains, and waits to judge that the quantity that can produce new url in the page A, B and C is respectively 4,4,3.It should be noted that, waiting of enumerating here judges that the page and quantity thereof are only schematic, is not limited to the present invention.Filter out multiple waiting subsequently and judge that in the page, the quantity of new url meets the page of the quantity of the new url of ageing kind of subpage.Such as, the page is judged for each treating, judge that this waits whether the quantity judging new url in the page is greater than the first predetermined threshold value, if so, then judge that this waits to judge that the page is the page of the quantity of the new url meeting ageing kind of subpage.Otherwise, then judge that this waits to judge the page not as meeting the page of the quantity of the new url of ageing kind of subpage.Further, may occurring that certain new url is in multiple situation waiting to judge to be found in the page in actual applications, for avoiding contribution to spread unchecked, in preferred version of the present invention, only this new url being waited as one of them the new url judging the page.
Secondly, the new url produced about kind of subpage has index value, and the new url that namely kind subpage produces can not be the page of the classes such as repetition, rubbish, cheating.In an embodiment of the present invention, the page is judged for each treating, count the quantity that this waits to judge to possess the new url that index is worth in the page, and then judge whether the quantity that this waits to judge to possess in the page new url that index is worth is greater than the second predetermined threshold value, if so, then can judge that this waits to judge that the page is meet the page that ageing kind of subpage possesses the quantity of the new url that index is worth.Otherwise, then judge that this waits to judge that the page does not possess the page of the quantity of the new url that index is worth as meeting ageing kind of subpage.Still wait that judge the page be respectively wait judge the page A, B and C (namely wait judge the quantity that in the page A, B and C produces new url be respectively 4,4,3) for multiple above, the page is judged for each treating, count the quantity that this waits to judge to possess the new url that index is worth in the page, the quantity obtaining waiting to judge to possess in the page A, B and C the new url that index is worth through statistics is respectively 4,4,1.Subsequently, the page of the second predetermined threshold value is greater than from the quantity waiting to judge to filter out the page A, B and C the new url possessing index value.It should be noted that, waiting of enumerating above judges that the page and quantity thereof are only schematic, is not limited to the present invention, in actual applications, waits to judge that the quantity size of the page can reach hundreds billion of link of even going up trillion magnitudes.
Moreover can continue to produce new url about kind of subpage, i.e. the new url of disposable generation does not have repetitive schedule to be worth, thus need repeatedly or continue to produce new url.The invention provides a kind of preferred scheme, in this scenario, multiple waiting is judged that the page is sorted out according to URL, subsequently for every class URL, adds up the quantity waiting the new url judged in the page corresponding to such URL and possess the quantity of the new url that index is worth.And then whether the quantity of the new url judging the quantity waiting the new url judged in the page that such URL is corresponding and possess index value meets the quantity of the new url of ageing kind of subpage and possesses the quantity of the new url that index is worth, if so, then can judge such URL corresponding wait judge that the page is ageing kind of subpage.Otherwise, then judge such URL corresponding wait judge the page not as ageing kind of subpage.
For example, in units of sky, 1st day, multiplely wait that judging that the page is respectively waits to judge the page A01, B01 and C01, wait to judge to be linked as A11, A12, A13, A14, A15 in page A01, wait to judge to be linked as B11, B12, B13, B14 in page B01, wait to judge to be linked as C11, C12, C13 in page C01.Now, the page is judged for each treating, relatively this waits to judge the link in the page and linking of including, and then according to the result compared, count the quantity that this waits the new url judged in the page, obtain waiting to judge that the quantity of the new url that can produce in the page A01, B01 and C01 is respectively 4,4,3 through comparing and adding up, the quantity possessing the new url that index is worth is respectively 4,4,3.2nd day, multiplely wait that judging that the page is respectively waits to judge the page A02, B02, C02 and D02, wait to judge to be linked as A21, A22, A23 in page A02, wait to judge to be linked as B21, B22, B23, B24, B25 in page B02, wait to judge to be linked as C21, C22 in page C02, wait to judge to be linked as D21, D22 in page D02.Now, the page is judged for each treating, relatively this waits to judge the link in the page and linking of including, and then according to the result compared, count the quantity that this waits the new url judged in the page, obtain waiting to judge that the quantity of the new url that can produce in the page A02, B02, C02 and D02 is respectively 1,5,1,2 through comparing and adding up, the quantity possessing the new url that index is worth is respectively 0,4,1,2.By that analogy, in units of sky, can count multiple quantity waiting to judge the new url that can produce in the page and the quantity possessing the new url that index is worth, details are as shown in table 1 below.
Table 1
Subsequently, multiple the waiting shown in table 1 is judged that the page is sorted out according to URL, through analyzing A01, A02 can be classified as a class to the URL of the page, B01, B02 are classified as a class, C01, C02 are classified as a class.For every class URL, add up the quantity waiting the new url judged in the page corresponding to such URL and possess the quantity of the new url that index is worth.And then whether the quantity of the new url judging the quantity waiting the new url judged in the page that such URL is corresponding and possess index value meets the quantity of the new url of ageing kind of subpage and possesses the quantity of the new url that index is worth.If so, then judge such URL corresponding wait judge that the page is ageing kind of subpage.Otherwise, then judge such URL corresponding wait judge the page not as ageing kind of subpage.It should be noted that, capturing of enumerating above wait to judge the page time cycle, wait to judge that the page and quantity thereof are only schematic, be not limited to the present invention, in actual applications, also can capture every 1 hour or 2 hours equal time cycles, and wait to judge that the quantity size of the page can reach hundreds billion of link of even going up trillion magnitudes.
Accordingly, present invention also offers a kind of page grasping means, comprise the ageing kind of subpage set that utilization generates above and carry out page crawl.Only having 1,000,000 magnitudes through screening scale while the ageing kind of subpage obtained is integrated into guarantee recall rate, dramatically reducing the burden that reptile implements to capture, thus improving the efficiency of crawler capturing, ensure that focus can be found in the very first time and include.
Based on same inventive concept, the embodiment of the present invention additionally provides a kind of device generating ageing kind of subpage set, to realize the method for above-mentioned generation ageing kind of subpage set.
Fig. 2 shows the structural representation of the device generating ageing kind of subpage set according to an embodiment of the invention.See Fig. 2, this device at least comprises: analyzer 210, page screening washer 220 and kind subpage maker 230.
Now introduce the annexation between each composition of device of the generation ageing kind of subpage set of the embodiment of the present invention or the function of device and each several part:
Analyzer 210, is suitable for obtaining and analyzes multiple attribute information waiting to judge the page;
Page screening washer 220, is coupled with analyzer 210, is suitable for judging to filter out the page that attribute information meets ageing seed page attribute condition in the page multiple waiting;
Plant subpage maker 230, be coupled with page screening washer 220, be suitable for the page meeting ageing seed page attribute condition filtered out to assemble, generate ageing kind of subpage set.
In one embodiment of the invention, Fig. 3 shows the structural representation of the device generating ageing kind of subpage set in accordance with another embodiment of the present invention, this device can also comprise grabber 310, be coupled with analyzer 210, be suitable for analyzer 210 obtain and analyze multiple wait to judge the attribute information of the page before, according to cycle fixed time capture multiple waiting judge the page.
In one embodiment of the invention, above-mentioned page screening washer 220 is also suitable for: judge the page for each waiting, compares this and waits to judge the link in the page and linking of including; According to the result compared, count the quantity that this waits the new url judged in the page; Filter out multiple waiting and judge that in the page, the quantity of new url meets the page of the quantity of the new url of ageing kind of subpage.
In one embodiment of the invention, above-mentioned page screening washer 220 is also suitable for: judge the page for each waiting, judges whether this quantity waiting the new url judged in the page is greater than the first predetermined threshold value; If so, then judge that this waits to judge that the page is the page of the quantity of the new url meeting ageing kind of subpage.
In one embodiment of the invention, above-mentioned page screening washer 220 is also suitable for: judge the page for each waiting, counts the quantity that this waits to judge to possess the new url that index is worth in the page; Judge whether the quantity that this waits to judge to possess in the page new url that index is worth is greater than the second predetermined threshold value; If so, then judge that this waits to judge that the page is meet the page that ageing kind of subpage possesses the quantity of the new url that index is worth.
In one embodiment of the invention, above-mentioned page screening washer 220 is also suitable for: judge that the page is sorted out according to URL to multiple waiting; For every class URL, add up the quantity waiting the new url judged in the page corresponding to such URL and possess the quantity of the new url that index is worth; Whether the quantity of the new url judging the quantity waiting the new url judged in the page that such URL is corresponding and possess index value meets the quantity of the new url of ageing kind of subpage and possesses the quantity of the new url that index is worth; If so, then judge such URL corresponding wait judge that the page is ageing kind of subpage.
Accordingly, present invention also offers a kind of search engine, comprising: the device of above-mentioned generation ageing kind of subpage set.
According to the combination of any one preferred embodiment above-mentioned or multiple preferred embodiment, the embodiment of the present invention can reach following beneficial effect:
In technical scheme provided by the invention, obtain and analyze multiple attribute information waiting to judge the page, and then filtering out multiple page waiting to judge to meet ageing seed page attribute condition in the page, broad covered area, can generate ageing kind of comparatively comprehensive, complete subpage set.And, the present invention can carry out based on ageing seed page attribute condition the screening waiting to judge the page, filter out the page meeting ageing seed page attribute condition, and aggregation formation ageing kind of subpage set, ensure that recall rate and the accuracy rate of ageing kind of subpage.Further, 1,000,000 magnitudes are only had through screening scale while the ageing kind of subpage obtained is integrated into guarantee recall rate, dramatically reduce the burden that reptile implements to capture, solving reptile in correlation technique, to need that the data volume of grabbing again for a time handled by a time is selected in whole links (namely scale reaches hundreds billion of link of even going up trillion magnitudes) comparatively huge, need the problem consuming a large amount of time, thus improve the efficiency of crawler capturing, and ensure that focus can be found in the very first time and include.
The invention also discloses:
A1, a kind of method generating ageing kind of subpage set, comprising:
Obtain and analyze multiple attribute information waiting to judge the page;
Judge in the page, to filter out the page that attribute information meets ageing seed page attribute condition described multiple waiting;
The page meeting described ageing seed page attribute condition filtered out is assembled, generates ageing kind of subpage set.
A2, method according to A1, wherein, described acquisition and analyze multiple wait to judge the attribute information of the page before, also comprise:
Capture described multiple waiting according to cycle fixed time and judge the page.
A3, method according to any one of A1-A2, wherein, judge to filter out the page that attribute information meets ageing seed page attribute condition in the page described multiple waiting, comprising:
The page is judged for each treating, compares this and wait to judge the link in the page and linking of including;
According to the result compared, count the quantity that this waits the new url judged in the page;
Filter out described multiple waiting and judge that in the page, the quantity of new url meets the page of the quantity of the new url of ageing kind of subpage.
A4, method according to any one of A1-A3, wherein, filter out described multiple waiting and judge that in the page, the quantity of new url meets the page of the quantity of the new url of ageing kind of subpage, comprising:
The page is judged for each treating, judges whether this quantity waiting the new url judged in the page is greater than the first predetermined threshold value;
If so, then judge that this waits to judge that the page is the page of the quantity of the new url meeting ageing kind of subpage.
A5, method according to any one of A1-A4, wherein, also comprise:
The page is judged for each treating, counts the quantity that this waits to judge to possess the new url that index is worth in the page;
Judge whether the quantity that this waits to judge to possess in the page new url that index is worth is greater than the second predetermined threshold value;
If so, then judge that this waits to judge that the page is meet the page that ageing kind of subpage possesses the quantity of the new url that index is worth.
A6, method according to any one of A1-A5, wherein, also comprise:
Described multiple waiting is judged that the page is sorted out according to URL;
For every class URL, add up the quantity waiting the new url judged in the page corresponding to such URL and possess the quantity of the new url that index is worth;
Whether the quantity of the new url judging the quantity waiting the new url judged in the page that such URL is corresponding and possess index value meets the quantity of the new url of ageing kind of subpage and possesses the quantity of the new url that index is worth;
If so, then judge such URL corresponding wait judge that the page is ageing kind of subpage.
A7, a kind of page grasping means, comprise and utilize the ageing kind of subpage set generated as any one of claim A1-A6 to carry out page crawl.
B8, a kind of device generating ageing kind of subpage set, comprising:
Analyzer, is suitable for obtaining and analyzes multiple attribute information waiting to judge the page;
Page screening washer, is suitable for judging to filter out the page that attribute information meets ageing seed page attribute condition in the page described multiple waiting;
Plant subpage maker, be suitable for the page meeting described ageing seed page attribute condition filtered out to assemble, generate ageing kind of subpage set.
B9, device according to B8, wherein, described analyzer obtain and analyze multiple wait to judge the attribute information of the page before, also comprise:
Grabber, is suitable for capturing described multiple waiting according to cycle fixed time and judges the page.
B10, device according to any one of B8-B9, described page screening washer is also suitable for:
The page is judged for each treating, compares this and wait to judge the link in the page and linking of including;
According to the result compared, count the quantity that this waits the new url judged in the page;
Filter out described multiple waiting and judge that in the page, the quantity of new url meets the page of the quantity of the new url of ageing kind of subpage.
B11, device according to any one of B8-B10, described page screening washer is also suitable for:
The page is judged for each treating, judges whether this quantity waiting the new url judged in the page is greater than the first predetermined threshold value;
If so, then judge that this waits to judge that the page is the page of the quantity of the new url meeting ageing kind of subpage.
B12, device according to any one of B8-B11, described page screening washer is also suitable for:
The page is judged for each treating, counts the quantity that this waits to judge to possess the new url that index is worth in the page;
Judge whether the quantity that this waits to judge to possess in the page new url that index is worth is greater than the second predetermined threshold value;
If so, then judge that this waits to judge that the page is meet the page that ageing kind of subpage possesses the quantity of the new url that index is worth.
B13, device according to any one of B8-B12, described page screening washer is also suitable for:
Described multiple waiting is judged that the page is sorted out according to URL;
For every class URL, add up the quantity waiting the new url judged in the page corresponding to such URL and possess the quantity of the new url that index is worth;
Whether the quantity of the new url judging the quantity waiting the new url judged in the page that such URL is corresponding and possess index value meets the quantity of the new url of ageing kind of subpage and possesses the quantity of the new url that index is worth;
If so, then judge such URL corresponding wait judge that the page is ageing kind of subpage.
B14, a kind of search engine, comprise the device of the generation ageing kind of subpage set as described in any one of claim B8-B13.
In instructions provided herein, describe a large amount of detail.But can understand, embodiments of the invention can be put into practice when not having these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand in each inventive aspect one or more, in the description above to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes.But, the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires feature more more than the feature clearly recorded in each claim.Or rather, as claims below reflect, all features of disclosed single embodiment before inventive aspect is to be less than.Therefore, the claims following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and adaptively can change the module in the equipment in embodiment and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and multiple submodule or subelement or sub-component can be put them in addition.Except at least some in such feature and/or process or unit be mutually repel except, any combination can be adopted to combine all processes of all features disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) and so disclosed any method or equipment or unit.Unless expressly stated otherwise, each feature disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) can by providing identical, alternative features that is equivalent or similar object replaces.
In addition, those skilled in the art can understand, although embodiments more described herein to comprise in other embodiment some included feature instead of further feature, the combination of the feature of different embodiment means and to be within scope of the present invention and to form different embodiments.Such as, in detail in the claims, the one of any of embodiment required for protection can use with arbitrary array mode.
All parts embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions of the some or all parts in the search engine of page crawl are carried out in the ageing kind of subpage set that microprocessor or digital signal processor (DSP) can be used in practice to realize generating according to device and the utilization of the generation of the embodiment of the present invention ageing kind of subpage set.The present invention can also be embodied as part or all equipment for performing method as described herein or device program (such as, computer program and computer program).Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.
The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim listing some devices, several in these devices can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.
So far, those skilled in the art will recognize that, although multiple exemplary embodiment of the present invention is illustrate and described herein detailed, but, without departing from the spirit and scope of the present invention, still can directly determine or derive other modification many or amendment of meeting the principle of the invention according to content disclosed by the invention.Therefore, scope of the present invention should be understood and regard as and cover all these other modification or amendments.

Claims (10)

1. generate a method for ageing kind of subpage set, comprising:
Obtain and analyze multiple attribute information waiting to judge the page;
Judge in the page, to filter out the page that attribute information meets ageing seed page attribute condition described multiple waiting;
The page meeting described ageing seed page attribute condition filtered out is assembled, generates ageing kind of subpage set.
2. method according to claim 1, wherein, described acquisition and analyze multiple wait to judge the attribute information of the page before, also comprise:
Capture described multiple waiting according to cycle fixed time and judge the page.
3. the method according to any one of claim 1-2, wherein, judges to filter out the page that attribute information meets ageing seed page attribute condition in the page described multiple waiting, comprising:
The page is judged for each treating, compares this and wait to judge the link in the page and linking of including;
According to the result compared, count the quantity that this waits the new url judged in the page;
Filter out described multiple waiting and judge that in the page, the quantity of new url meets the page of the quantity of the new url of ageing kind of subpage.
4. the method according to any one of claim 1-3, wherein, filters out described multiple waiting and judges that in the page, the quantity of new url meets the page of the quantity of the new url of ageing kind of subpage, comprising:
The page is judged for each treating, judges whether this quantity waiting the new url judged in the page is greater than the first predetermined threshold value;
If so, then judge that this waits to judge that the page is the page of the quantity of the new url meeting ageing kind of subpage.
5. the method according to any one of claim 1-4, wherein, also comprises:
The page is judged for each treating, counts the quantity that this waits to judge to possess the new url that index is worth in the page;
Judge whether the quantity that this waits to judge to possess in the page new url that index is worth is greater than the second predetermined threshold value;
If so, then judge that this waits to judge that the page is meet the page that ageing kind of subpage possesses the quantity of the new url that index is worth.
6. the method according to any one of claim 1-5, wherein, also comprises:
Described multiple waiting is judged that the page is sorted out according to URL;
For every class URL, add up the quantity waiting the new url judged in the page corresponding to such URL and possess the quantity of the new url that index is worth;
Whether the quantity of the new url judging the quantity waiting the new url judged in the page that such URL is corresponding and possess index value meets the quantity of the new url of ageing kind of subpage and possesses the quantity of the new url that index is worth;
If so, then judge such URL corresponding wait judge that the page is ageing kind of subpage.
7. a page grasping means, comprises and utilizes the ageing kind of subpage set generated as any one of claim 1-6 to carry out page crawl.
8. generate a device for ageing kind of subpage set, comprising:
Analyzer, is suitable for obtaining and analyzes multiple attribute information waiting to judge the page;
Page screening washer, is suitable for judging to filter out the page that attribute information meets ageing seed page attribute condition in the page described multiple waiting;
Plant subpage maker, be suitable for the page meeting described ageing seed page attribute condition filtered out to assemble, generate ageing kind of subpage set.
9. device according to claim 8, wherein, described analyzer obtain and analyze multiple wait to judge the attribute information of the page before, also comprise:
Grabber, is suitable for capturing described multiple waiting according to cycle fixed time and judges the page.
10. a search engine, comprises the device of the generation ageing kind of subpage set as described in any one of claim 8-9.
CN201410758178.2A 2014-12-10 2014-12-10 Method and device for generating time-based seed page set Pending CN104484382A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410758178.2A CN104484382A (en) 2014-12-10 2014-12-10 Method and device for generating time-based seed page set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410758178.2A CN104484382A (en) 2014-12-10 2014-12-10 Method and device for generating time-based seed page set

Publications (1)

Publication Number Publication Date
CN104484382A true CN104484382A (en) 2015-04-01

Family

ID=52758923

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410758178.2A Pending CN104484382A (en) 2014-12-10 2014-12-10 Method and device for generating time-based seed page set

Country Status (1)

Country Link
CN (1) CN104484382A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106874282A (en) * 2015-12-11 2017-06-20 北京奇虎科技有限公司 The generation method and device of candidate page set

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6356899B1 (en) * 1998-08-29 2002-03-12 International Business Machines Corporation Method for interactively creating an information database including preferred information elements, such as preferred-authority, world wide web pages
CN103838865A (en) * 2014-03-20 2014-06-04 北京奇虎科技有限公司 Method and device for mining timeliness seed page
CN104182482A (en) * 2014-08-06 2014-12-03 中国科学院计算技术研究所 Method for judging news list page and method for screening news list page

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6356899B1 (en) * 1998-08-29 2002-03-12 International Business Machines Corporation Method for interactively creating an information database including preferred information elements, such as preferred-authority, world wide web pages
CN103838865A (en) * 2014-03-20 2014-06-04 北京奇虎科技有限公司 Method and device for mining timeliness seed page
CN104182482A (en) * 2014-08-06 2014-12-03 中国科学院计算技术研究所 Method for judging news list page and method for screening news list page

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106874282A (en) * 2015-12-11 2017-06-20 北京奇虎科技有限公司 The generation method and device of candidate page set

Similar Documents

Publication Publication Date Title
CN104951512A (en) Public sentiment data collection method and system based on Internet
CN104077402B (en) Data processing method and data handling system
CN103383665B (en) Be suitable in url data crawl the method for data buffer storage and device
CN104408169A (en) Multi-dimensional expression language based dimension query method and device
CN103116638B (en) Webpage screening method and device thereof
CN104331419A (en) Method and device for measuring importance of news
CN105279276A (en) Database index optimization system
CN102833233B (en) Method and device for recognizing web pages
CN103077254B (en) Webpage acquisition methods and device
CN105786851A (en) Question and answer knowledge base construction method as well as search provision method and apparatus
CN103530336A (en) Equipment and method for identifying invalid parameters in URLs
CN105528454A (en) Log treatment method and distributed cluster computing device
CN106897280A (en) Data query method and device
CN104866556A (en) Database fault handling method and apparatus, and database system
CN105224661A (en) Conversational information search method and device
CN105302815A (en) Web page uniform resource locator URL filtering method and apparatus
CN105786874A (en) Method and device for constructing question-answer knowledge base data items based on encyclopedic entries
CN102811163A (en) Method and apparatus for streaming netflow data analysis
US20130226840A1 (en) Deriving a Nested Chain of Densest Subgraphs from a Graph
CN104484382A (en) Method and device for generating time-based seed page set
CN109145194A (en) The acquisition method and device of user behavior data
CN105426407A (en) Web data acquisition method based on content analysis
CN106293650A (en) A kind of folder attribute method to set up and device
CN105630983A (en) Resource obtaining and optimizing device and method
CN105608202B (en) Data packet analysis method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20150401

RJ01 Rejection of invention patent application after publication