WO2008145031A1 - Method and system for judging of the inportance of article, and sliding window - Google Patents
Method and system for judging of the inportance of article, and sliding window Download PDFInfo
- Publication number
- WO2008145031A1 WO2008145031A1 PCT/CN2008/070600 CN2008070600W WO2008145031A1 WO 2008145031 A1 WO2008145031 A1 WO 2008145031A1 CN 2008070600 W CN2008070600 W CN 2008070600W WO 2008145031 A1 WO2008145031 A1 WO 2008145031A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- article
- sliding
- sliding window
- words
- importance
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
Definitions
- the present invention relates to the field of network retrieval, and in particular to a method and system for determining the importance of an article and a sliding window. Background of the invention
- Retrieving relevant articles using keywords on the network is one of the important ways to share network resources. Due to the rich network resources, a search keyword often corresponds to a large number of articles, which requires the network system to judge the importance of each article, so that the relatively important articles are displayed in front of the search results, which will be relatively unimportant. The article is displayed at the back, allowing users to read more important articles first, saving users time.
- the current common practice is to judge the importance of the vocabulary in the article. If an article has a rich vocabulary, it means that it has something to say, and it belongs to an important article; on the other hand, if the article is only a few words repeated in the whole article or in part, the vocabulary is poor, indicating that it has nothing to say, and it is an unimportant article.
- the prior art is based on the method of word frequency statistics to judge the importance of an article.
- Step S101 Separate adjacent words in the article sentence by using a space.
- Step S102 Count the word frequency of each word, that is, the number of times the word appears in the article. E.g:
- Step S103 Calculate and determine whether the number of occurrences of the words satisfies a preset condition. If not, the article is considered to be relatively important; if satisfied, the article is considered to be relatively unimportant.
- the preset conditions can be:
- the word frequency of the most single word is greater than 30% of the total word frequency, or the most frequently occurring 5% word frequency is greater than 50% of the total word frequency, or the most frequent 20% word frequency is greater than 80% of the total word frequency;
- the average word frequency is more than 5.
- the article is considered to be relatively unimportant.
- the above method judges the importance of the article by counting the word frequency of each word in the article, but the word frequency reflects the global characteristics of the article and does not reflect the local characteristics of the article. For example: If an overall vocabulary is rich, but the local vocabulary is poor, the existing judgment method is easy to misjudge the article as an important article. Therefore, the existing judgment methods cannot effectively judge articles such as the overall vocabulary but the local vocabulary is poor, which brings inconvenience to users. Summary of the invention
- the embodiment of the invention provides a method for judging the importance of an article, which can more effectively identify an article and is convenient for the user to use.
- the embodiment of the invention further provides a system for judging the importance of an article, which can more effectively identify the article and is convenient for the user to use.
- the embodiment of the invention further provides a sliding window for traversing the article on the network, and can effectively obtain related parameters of the vocabulary richness in the article.
- Embodiments of the present invention relate to a method for determining the importance of an article, including: sliding a window from The initial sliding start point of the article starts to slide, and the sliding window does not repeatedly collect the slipped words.
- the words collected by the sliding window reach a preset number, the total number of words that have been slipped is recorded, and the sliding is reset. a starting point, and continuing the sliding process from the resetting starting point of the sliding;
- the maximum number of values is obtained in the number of the sliding window records, and the importance of the article is determined according to the obtained maximum number of values.
- the embodiment of the invention further relates to a system for determining the importance of an article, comprising: a sliding window, a maximum value obtaining unit and an importance determining unit, wherein the sliding window comprises a word collecting unit, a word recording unit and a starting unit:
- the starting unit is configured to control the sliding window to start sliding from an initial sliding starting point of the article
- the word collecting unit is configured to perform non-repeating collection of the slipped words, and when the collected words reach a preset quantity, send startup information to the activation unit and the word recording unit; Resetting the sliding starting point after starting the information, and starting the sliding window to continue the sliding process from the reset starting point of the sliding;
- the term recording unit is configured to record, after receiving the startup information, the total number of words that the sliding window has slid;
- the maximum value obtaining unit is configured to obtain a maximum quantity value in the quantity recorded by the word recording unit
- the importance judging unit is configured to judge the importance of the article according to the obtained maximum quantity value.
- the embodiment of the invention further relates to a sliding window, the sliding window comprising a word collecting unit, a word recording unit and a starting unit:
- the starting unit is configured to control the sliding window to start sliding from an initial sliding starting point of the article;
- the word collecting unit is configured to perform non-repeating collection of the slipped words, and when the collected words reach a preset quantity, send startup information to the activation unit and the word recording unit; Resetting the sliding starting point after starting the information, and starting the sliding window to continue the sliding process from the reset starting point of the sliding;
- the term recording unit is configured to record, after receiving the activation information, the number of words that the sliding window has slipped.
- 1 is a flow chart of a method for judging the importance of an article on a network
- FIG. 2 is a flowchart of a method for determining the importance of an article according to a first embodiment of the present invention
- FIG. 3 is a flowchart of a method for determining the importance of an article according to a second embodiment of the present invention
- FIG. 5 is a schematic diagram of a system for determining the importance of an article according to a fourth embodiment of the present invention
- FIG. 6 is a schematic diagram of a system for determining the importance of an article according to a fifth embodiment of the present invention
- a schematic structural diagram of a sliding window provided by a sixth embodiment of the present invention. Mode for carrying out the invention
- the embodiment of the present invention uses a preset sliding window to start sliding from the initial sliding starting point of the article, and the sliding window does not repeatedly collect the slipped words.
- the words collected by the sliding window reach a preset number
- the number of slipped words is recorded.
- reset the starting point continue to slide, the sliding window repeats the process until the entire article is slid; the largest number of records is obtained in the number of sliding window records, and the importance of the article is determined according to the size of the number.
- the initial sliding starting point can be the starting point of the article, the ending point of the article, or the advance
- the set article is scheduled to slide.
- the article can be either a web article stored on the network or a local article that is offline with the network.
- the importance of the article is further output to the network or locally.
- FIG. 2 a flow chart of a method for determining the importance of an article according to a first embodiment of the present invention is provided. The specific steps are as follows.
- Step S201 Start sliding with the preset sliding window starting from the beginning of the article.
- a sliding window which includes a left border, a right border, and a database, and the database does not repeatedly store the words between the left and right borders.
- the database can contain up to a preset number of words. For example: The preset number is preferably 6.
- Step S202 The sliding window does not repeatedly collect the slipped words.
- the sliding window does not repeatedly collect the slipped words during the sliding process, that is, the collected words do not overlap each other.
- the sliding window stores the collected words in the database.
- Step S203 When the words collected by the sliding window reach a preset number, the number of words slipped is recorded.
- the amount of words collected by the sliding window increases continuously, and when the amount of words collected by the sliding window reaches a preset number, the total amount of words that the sliding window slides from the starting point is recorded.
- Step S204 reset the starting point, and continue to slide until the entire article is slid.
- the starting point for resetting may be the last word collected by the sliding window, or the next word or previous word of the word, or the first few words of the word.
- the sliding window records the total number of words that have slipped again, and clears the collected words, resets the starting point again, and continues to slide until the entire article is slid.
- Step S205 Acquire the largest number of values in the number of records in the sliding window.
- the number of slipped words per record is obtained for the sliding window, and the largest one is extracted from the above number.
- Step S206 Determine the importance of the article according to the magnitude of the quantity value.
- the quantity value obtained is used as the basis for judging the importance of the article. If the value is large, the importance of the article is relatively low; if the value is small, the article is relatively important.
- the embodiment of the present invention adopts a correlation value that reflects the most inferior part of the article vocabulary, and serves as a basis for judging the importance of the article, and can effectively judge the article with rich overall vocabulary but partial vocabulary, which is convenient for the user to use.
- the embodiment of the present invention can also start from the beginning of the article and start to slide toward the beginning of the article.
- the embodiment of the present invention does not limit the manner of the sliding starting point.
- adjacent words in the sentence sentence are spaced apart by a space before the sliding window starts to slide, so that the sliding window can recognize the words in the sliding.
- the embodiment of the present invention can also judge the importance of the article by setting a predetermined value.
- FIG. 3 a method flow for determining the importance of an article according to a second embodiment of the present invention is provided. Figure, the specific steps are as follows.
- Step S301 Space adjacent words in the article by spaces.
- the beginning of the article is: “The flowers are also beautiful, the grass is flourishing, the scenery here is good”, using spaces to separate the adjacent words: “Flowers are also beautiful grass is also flourishing here is good scenery.”
- Step S302 Start sliding by using a preset sliding window starting from the beginning of the article.
- the sliding window includes a left border "[", a right border "]” and a database, and the database stores the words between the left and right borders without repeating.
- the database can contain up to a preset number of words. For example: The preset number can be 6.
- the position of the sliding window at this time is: "[] The flower is also beautiful, the grass is also flourishing, the scenery is good.”
- the right border of the sliding window "] starts to move right.
- Step S303 The sliding window does not repeatedly collect the slipped words.
- the position of the sliding window is: "[Flower] Also beautiful grass is flourishing here is good scenery", when the word “flower” is judged not to overlap with the words collected in the database, the word “flower” is collected; the position of the sliding window is moved. For: “When the flowers are beautiful, the grass is good,” when the word “also” is repeated with the words “also” in the database, the word “also” is no longer collected.
- Step S304 When the words collected by the sliding window reach a preset number, the number of words slipped is recorded.
- Step S305 setting a word collected at the last sliding window as a starting point, and clearing the sliding The words collected in the window continue to slide until the entire article is slid.
- the left border of the sliding window moves to the right border, and the right border continues to slide to the right.
- the position of the sliding window is: "Flower is also beautiful and grass is also flourishing here [] Good scenery.”
- the sliding window When the words collected by the sliding window reach the maximum again, the sliding window records the total number of words that have slipped again, and clears the collected words, and slides again until it slides over the entire article.
- Step S306 Obtaining the largest number of values in the number of records in the sliding window.
- the number of slipped words per record is obtained for the sliding window, and the largest one is extracted from the above number.
- the number of first records is 7, the second is 8, the third is 12..., and when compared, 12 is the largest, then 12 is the maximum number of words.
- Step S307 Compare the quantity value with a predetermined value, if less than, determine that the article is an important article.
- this article is a relatively important article.
- the article is judged to be an important article or a non-important article by means of a predetermined value, and the importance of the article can be directly judged accurately, which is convenient and practical.
- the number of words slipped can be recorded by calculating the length of the sliding window, and the retrieved articles can be sorted according to the maximum number of acquired words, so that the articles are arranged in order of importance.
- the length of the sliding window is the amount of words contained between the left and right borders of the sliding window.
- FIG. 4 a flow chart of a method for determining the importance of an article according to a third embodiment of the present invention is provided. The specific steps are as follows.
- step S401 adjacent words in the article are separated by spaces. For example, there is a paragraph in the middle of the article: “I am so happy today, I am so happy, I am so happy, I am very happy.” I used spaces to separate the adjacent words: “I am so happy today, I am so happy, I am very happy.”
- Step S402 Start sliding by using a preset sliding window starting from the beginning of the article.
- the sliding window includes a left border "[", a right border "]” and a database, and the database stores the words between the left and right borders without repeating.
- the database can contain up to a preset number of words. For example: The preset number can be 6.
- the position of the sliding window at this time is: "[] Good day, happy, ah, happy, really happy, very happy.”
- the right edge of the sliding window "" starts to move to the right.
- Step S403 The sliding window does not repeatedly collect the slipped words, and records the number of words that the sliding window has slipped.
- the position of the sliding window is: "[Today] I am so happy, I am so happy, I am very happy.”
- the word “Today” does not overlap with the words collected in the database, collect the words "Today”; The position is: “[I am so happy today] I am happy, I am very happy.”
- the judgment word "good” is repeated with the word "good” in the database, the word “good” is no longer collected.
- the length of the sliding window is 5 and the sliding window collects 4 words.
- Step S404 When the words collected by the sliding window reach a preset number, the number of words slipped is recorded.
- the position of the sliding window is: "[Good day, happy, happy, really happy"], at this time, the sliding window has a length of 10, and the sliding window collects the word 6.
- Step S405 setting a word collected at the last sliding window as a starting point, and clearing the sliding The words collected in the window continue to slide until the entire article is slid.
- the left border of the sliding window moves to the right border, and the right border continues to slide to the right.
- the sliding window records the total number of words that have slipped again, and clears the collected words. Words, slide again again, until you slide over the entire article.
- Step S406 Obtaining the largest number of values in the number of records in the sliding window.
- the number of slipped words per record is obtained for the sliding window, and the largest one is extracted from the above number.
- the length of the sliding window recorded for the first time is 10
- the second time is 11
- the third time is 18...
- Step S407 the quantity value is used as the weight of the article importance, and the importance degree of the article is determined according to the weight value.
- 18 is used as the weight of the article's importance, compared with the weights of other articles, and sorted by weight.
- the embodiment by sorting by weight, the frontmost of the most important articles can be arranged, and the rest are arranged in order of importance, which is very convenient for the user to use. Moreover, the embodiment does not need to set relevant values for judging whether the article is important, and can more objectively reflect the importance of the article.
- the embodiment of the present invention provides a system for judging the importance of an article, which can effectively identify an article with a rich overall vocabulary but a lack of local vocabulary, and is convenient for the user to use.
- the sliding window 51 includes a word collecting unit 511 and a word recording unit. 512, and a starting unit 513.
- the activation unit 513 controls the sliding window 51 to start sliding from the initial sliding start point of the article.
- the initial sliding starting point can be the starting point of the article, the ending point of the article, or the advance
- the set article is scheduled to slide.
- the word collecting unit 511 does not repeatedly collect the slipped words, and transmits the activation information to the activation unit 513 and the word recording unit 512 when the collected words reach the preset number.
- the startup unit 513 resets the starting point after receiving the startup information, and starts the sliding window 51 until the entire article is slid.
- the preset number may preferably be 6.
- the word recording unit 512 records the number of sliding windows 51 sliding over the words.
- the term recording unit 512 can record the total number of words that the sliding window has slid after receiving the activation information.
- the vocabulary recording unit 512 can also continuously record the total number of words that the sliding window has slid during the sliding process, and end the recording of the slid after receiving the activation information, and record the total number of words that have finally slipped.
- the maximum value acquisition unit 52 acquires the largest value among the number recorded by the word recording unit 512, and transmits it to the importance judging unit 53.
- the importance judging unit 53 judges the importance of the article based on the maximum number value.
- the embodiment of the present invention can break up the words in the article by the word breaking unit.
- the sliding window 51 starts to slide from the beginning of the article is described in detail.
- the sliding window 51 can also start from the beginning of the article and start to slide toward the beginning of the article.
- the embodiment of the present invention does not limit the manner in which the sliding starting point is agreed.
- a schematic diagram of a system for determining the importance of an article according to a fifth embodiment of the present invention includes a sliding window 51, a maximum value obtaining unit 52, an importance determining unit 53, and a word breaking unit 54, and the sliding window 51 includes word collection.
- the word breaking unit 54 spaces adjacent words in the article with spaces.
- the sliding window 51, the maximum value obtaining unit 52, and the importance determining unit 53 are in this embodiment.
- the functions and functions in the examples are the same as those in the embodiment shown in FIG. 5 and will not be described again.
- the system may further comprise an importance level output unit.
- the importance level output unit is connected to the importance judgment unit 53.
- the importance level output unit is configured to output the importance of the article to the network or locally after the importance judging unit 53 judges the importance of the article.
- the entire article is traversed through a sliding window, and the relevant parameters of the vocabulary richness in the article can be effectively obtained.
- a schematic structural diagram of a sliding window 51 includes a word collecting unit 511, a word recording unit 512, and a starting unit 513.
- the start unit 512 controls the sliding window 51 to start sliding at the beginning of the article or at the end of the article.
- the word collecting unit 512 performs non-repeating collection of the slipped words, and when the collected words reach the preset number, sends the activation information to the initiating unit 513 and the word recording unit 512; the last word collected by the initiating unit 513 by the word collecting unit 511 As a starting point, restart the sliding window 51 until it slides through the entire article.
- the preset number is preferably 6.
- the word recording unit 512 records the number of sliding windows 51 sliding over the words.
- the sliding window 51 further includes a left boundary and a right boundary.
- the right boundary moves rightward from the starting point.
- the words collected by the word collecting unit 512 reach a preset number, the right boundary stops moving, and the left boundary moves to the right, until the left and right borders Contains only one word.
- the embodiment of the present invention uses a preset sliding window to start sliding at the beginning of the article or the end of the article, and the sliding window does not repeatedly collect the slipped words, and the words collected by the sliding window reach a preset number.
- the sliding window repeats the process until the entire article is slid; the largest number of records is obtained in the number of sliding window records, and the article is judged according to the size of the number The importance of it. If the article's overall vocabulary is rich but the local vocabulary is poor, there are a lot of word duplication. The phenomenon. In the embodiment of the present invention, when the sliding window slides over the portion, the slipped window is not repeatedly collected due to the sliding window.
- the embodiment of the present invention adopts a correlation value that reflects the most inferior part of the article vocabulary as a basis for judging the importance of the article.
- the embodiment of the present invention can effectively judge the article with rich overall vocabulary but partial vocabulary, which is convenient for the user to use.
Abstract
Description
Claims
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200710105297.8 | 2007-05-31 | ||
CNB2007101052978A CN100520767C (en) | 2007-05-31 | 2007-05-31 | Method and system for judging article importance in network, and sliding window |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2008145031A1 true WO2008145031A1 (en) | 2008-12-04 |
Family
ID=38898646
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2008/070600 WO2008145031A1 (en) | 2007-05-31 | 2008-03-27 | Method and system for judging of the inportance of article, and sliding window |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN100520767C (en) |
WO (1) | WO2008145031A1 (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN100520767C (en) * | 2007-05-31 | 2009-07-29 | 腾讯科技(深圳)有限公司 | Method and system for judging article importance in network, and sliding window |
CN100545847C (en) * | 2007-09-25 | 2009-09-30 | 腾讯科技(深圳)有限公司 | A kind of method and system that blog articles is sorted |
CN103336771B (en) * | 2013-04-02 | 2016-12-28 | 江苏大学 | Data similarity detection method based on sliding window |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5576954A (en) * | 1993-11-05 | 1996-11-19 | University Of Central Florida | Process for determination of text relevancy |
CN1133127C (en) * | 1996-05-29 | 2003-12-31 | 松下电器产业株式会社 | Document retrieval system |
CN1818908A (en) * | 2006-03-16 | 2006-08-16 | 董崇军 | Feedbakc information use of searcher in search engine |
CN101071419A (en) * | 2007-05-31 | 2007-11-14 | 腾讯科技(深圳)有限公司 | Method and system for judging article importance in network, and sliding window |
-
2007
- 2007-05-31 CN CNB2007101052978A patent/CN100520767C/en active Active
-
2008
- 2008-03-27 WO PCT/CN2008/070600 patent/WO2008145031A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5576954A (en) * | 1993-11-05 | 1996-11-19 | University Of Central Florida | Process for determination of text relevancy |
CN1133127C (en) * | 1996-05-29 | 2003-12-31 | 松下电器产业株式会社 | Document retrieval system |
CN1818908A (en) * | 2006-03-16 | 2006-08-16 | 董崇军 | Feedbakc information use of searcher in search engine |
CN101071419A (en) * | 2007-05-31 | 2007-11-14 | 腾讯科技(深圳)有限公司 | Method and system for judging article importance in network, and sliding window |
Also Published As
Publication number | Publication date |
---|---|
CN101071419A (en) | 2007-11-14 |
CN100520767C (en) | 2009-07-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105183897B (en) | A kind of method and system of video search sequence | |
CN103414943B (en) | television program comment information processing method and system | |
CN104636336B (en) | A kind of method and apparatus of video search | |
WO2014056369A1 (en) | Method and system for sorting online videos of search | |
EP2354981A2 (en) | Image management apparatus, method of controlling the same, and storage medium storing program therefor | |
WO2014146550A1 (en) | Search suggestion method and apparatus for map search, and computer storage medium and device | |
US20070265720A1 (en) | Content marking method, content playback apparatus, content playback method, and storage medium | |
CN102566928A (en) | System and method for automatically managing desktop application icons of mobile terminal | |
US8397263B2 (en) | Information processing apparatus, information processing method and information processing program | |
CN103053156B (en) | Present invention, interval manufacture method and interval production process | |
WO2008145031A1 (en) | Method and system for judging of the inportance of article, and sliding window | |
CN103955533B (en) | A kind of page tree data acquisition device based on buffer queue and method | |
JP2014506355A5 (en) | ||
CN106682012A (en) | Commodity object information searching method and device | |
CN105095251A (en) | Terminal automatic display method and device based on user habit | |
JP2006319442A5 (en) | ||
WO2014056370A1 (en) | Method and system for use in providing personalized search list | |
KR20190022761A (en) | Method and apparatus for updating search cache | |
CN102929954A (en) | Method and device for controlling content displaying of search frame | |
CN106815284A (en) | The recommendation method and recommendation apparatus of news video | |
KR20150004681A (en) | Server for providing media information, apparatus, method and computer readable recording medium for searching media information related to media contents | |
CN105812917B (en) | channel searching method and device | |
CN103294670A (en) | Searching method and system based on word list | |
CN103365986A (en) | Method for collecting short message in mobile terminal and mobile terminal | |
CN107239451A (en) | Database index creation method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 08715336 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 5543/CHENP/2009 Country of ref document: IN |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12/02/2010) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 08715336 Country of ref document: EP Kind code of ref document: A1 |