WO2008145031A1 - Method and system for judging of the inportance of article, and sliding window - Google Patents

Method and system for judging of the inportance of article, and sliding window Download PDF

Info

Publication number
WO2008145031A1
WO2008145031A1 PCT/CN2008/070600 CN2008070600W WO2008145031A1 WO 2008145031 A1 WO2008145031 A1 WO 2008145031A1 CN 2008070600 W CN2008070600 W CN 2008070600W WO 2008145031 A1 WO2008145031 A1 WO 2008145031A1
Authority
WO
WIPO (PCT)
Prior art keywords
article
sliding
sliding window
words
importance
Prior art date
Application number
PCT/CN2008/070600
Other languages
French (fr)
Chinese (zh)
Inventor
Liang Dong
Rongfang Shao
Original Assignee
Tencent Technology (Shenzhen) Company Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology (Shenzhen) Company Limited filed Critical Tencent Technology (Shenzhen) Company Limited
Publication of WO2008145031A1 publication Critical patent/WO2008145031A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis

Definitions

  • the present invention relates to the field of network retrieval, and in particular to a method and system for determining the importance of an article and a sliding window. Background of the invention
  • Retrieving relevant articles using keywords on the network is one of the important ways to share network resources. Due to the rich network resources, a search keyword often corresponds to a large number of articles, which requires the network system to judge the importance of each article, so that the relatively important articles are displayed in front of the search results, which will be relatively unimportant. The article is displayed at the back, allowing users to read more important articles first, saving users time.
  • the current common practice is to judge the importance of the vocabulary in the article. If an article has a rich vocabulary, it means that it has something to say, and it belongs to an important article; on the other hand, if the article is only a few words repeated in the whole article or in part, the vocabulary is poor, indicating that it has nothing to say, and it is an unimportant article.
  • the prior art is based on the method of word frequency statistics to judge the importance of an article.
  • Step S101 Separate adjacent words in the article sentence by using a space.
  • Step S102 Count the word frequency of each word, that is, the number of times the word appears in the article. E.g:
  • Step S103 Calculate and determine whether the number of occurrences of the words satisfies a preset condition. If not, the article is considered to be relatively important; if satisfied, the article is considered to be relatively unimportant.
  • the preset conditions can be:
  • the word frequency of the most single word is greater than 30% of the total word frequency, or the most frequently occurring 5% word frequency is greater than 50% of the total word frequency, or the most frequent 20% word frequency is greater than 80% of the total word frequency;
  • the average word frequency is more than 5.
  • the article is considered to be relatively unimportant.
  • the above method judges the importance of the article by counting the word frequency of each word in the article, but the word frequency reflects the global characteristics of the article and does not reflect the local characteristics of the article. For example: If an overall vocabulary is rich, but the local vocabulary is poor, the existing judgment method is easy to misjudge the article as an important article. Therefore, the existing judgment methods cannot effectively judge articles such as the overall vocabulary but the local vocabulary is poor, which brings inconvenience to users. Summary of the invention
  • the embodiment of the invention provides a method for judging the importance of an article, which can more effectively identify an article and is convenient for the user to use.
  • the embodiment of the invention further provides a system for judging the importance of an article, which can more effectively identify the article and is convenient for the user to use.
  • the embodiment of the invention further provides a sliding window for traversing the article on the network, and can effectively obtain related parameters of the vocabulary richness in the article.
  • Embodiments of the present invention relate to a method for determining the importance of an article, including: sliding a window from The initial sliding start point of the article starts to slide, and the sliding window does not repeatedly collect the slipped words.
  • the words collected by the sliding window reach a preset number, the total number of words that have been slipped is recorded, and the sliding is reset. a starting point, and continuing the sliding process from the resetting starting point of the sliding;
  • the maximum number of values is obtained in the number of the sliding window records, and the importance of the article is determined according to the obtained maximum number of values.
  • the embodiment of the invention further relates to a system for determining the importance of an article, comprising: a sliding window, a maximum value obtaining unit and an importance determining unit, wherein the sliding window comprises a word collecting unit, a word recording unit and a starting unit:
  • the starting unit is configured to control the sliding window to start sliding from an initial sliding starting point of the article
  • the word collecting unit is configured to perform non-repeating collection of the slipped words, and when the collected words reach a preset quantity, send startup information to the activation unit and the word recording unit; Resetting the sliding starting point after starting the information, and starting the sliding window to continue the sliding process from the reset starting point of the sliding;
  • the term recording unit is configured to record, after receiving the startup information, the total number of words that the sliding window has slid;
  • the maximum value obtaining unit is configured to obtain a maximum quantity value in the quantity recorded by the word recording unit
  • the importance judging unit is configured to judge the importance of the article according to the obtained maximum quantity value.
  • the embodiment of the invention further relates to a sliding window, the sliding window comprising a word collecting unit, a word recording unit and a starting unit:
  • the starting unit is configured to control the sliding window to start sliding from an initial sliding starting point of the article;
  • the word collecting unit is configured to perform non-repeating collection of the slipped words, and when the collected words reach a preset quantity, send startup information to the activation unit and the word recording unit; Resetting the sliding starting point after starting the information, and starting the sliding window to continue the sliding process from the reset starting point of the sliding;
  • the term recording unit is configured to record, after receiving the activation information, the number of words that the sliding window has slipped.
  • 1 is a flow chart of a method for judging the importance of an article on a network
  • FIG. 2 is a flowchart of a method for determining the importance of an article according to a first embodiment of the present invention
  • FIG. 3 is a flowchart of a method for determining the importance of an article according to a second embodiment of the present invention
  • FIG. 5 is a schematic diagram of a system for determining the importance of an article according to a fourth embodiment of the present invention
  • FIG. 6 is a schematic diagram of a system for determining the importance of an article according to a fifth embodiment of the present invention
  • a schematic structural diagram of a sliding window provided by a sixth embodiment of the present invention. Mode for carrying out the invention
  • the embodiment of the present invention uses a preset sliding window to start sliding from the initial sliding starting point of the article, and the sliding window does not repeatedly collect the slipped words.
  • the words collected by the sliding window reach a preset number
  • the number of slipped words is recorded.
  • reset the starting point continue to slide, the sliding window repeats the process until the entire article is slid; the largest number of records is obtained in the number of sliding window records, and the importance of the article is determined according to the size of the number.
  • the initial sliding starting point can be the starting point of the article, the ending point of the article, or the advance
  • the set article is scheduled to slide.
  • the article can be either a web article stored on the network or a local article that is offline with the network.
  • the importance of the article is further output to the network or locally.
  • FIG. 2 a flow chart of a method for determining the importance of an article according to a first embodiment of the present invention is provided. The specific steps are as follows.
  • Step S201 Start sliding with the preset sliding window starting from the beginning of the article.
  • a sliding window which includes a left border, a right border, and a database, and the database does not repeatedly store the words between the left and right borders.
  • the database can contain up to a preset number of words. For example: The preset number is preferably 6.
  • Step S202 The sliding window does not repeatedly collect the slipped words.
  • the sliding window does not repeatedly collect the slipped words during the sliding process, that is, the collected words do not overlap each other.
  • the sliding window stores the collected words in the database.
  • Step S203 When the words collected by the sliding window reach a preset number, the number of words slipped is recorded.
  • the amount of words collected by the sliding window increases continuously, and when the amount of words collected by the sliding window reaches a preset number, the total amount of words that the sliding window slides from the starting point is recorded.
  • Step S204 reset the starting point, and continue to slide until the entire article is slid.
  • the starting point for resetting may be the last word collected by the sliding window, or the next word or previous word of the word, or the first few words of the word.
  • the sliding window records the total number of words that have slipped again, and clears the collected words, resets the starting point again, and continues to slide until the entire article is slid.
  • Step S205 Acquire the largest number of values in the number of records in the sliding window.
  • the number of slipped words per record is obtained for the sliding window, and the largest one is extracted from the above number.
  • Step S206 Determine the importance of the article according to the magnitude of the quantity value.
  • the quantity value obtained is used as the basis for judging the importance of the article. If the value is large, the importance of the article is relatively low; if the value is small, the article is relatively important.
  • the embodiment of the present invention adopts a correlation value that reflects the most inferior part of the article vocabulary, and serves as a basis for judging the importance of the article, and can effectively judge the article with rich overall vocabulary but partial vocabulary, which is convenient for the user to use.
  • the embodiment of the present invention can also start from the beginning of the article and start to slide toward the beginning of the article.
  • the embodiment of the present invention does not limit the manner of the sliding starting point.
  • adjacent words in the sentence sentence are spaced apart by a space before the sliding window starts to slide, so that the sliding window can recognize the words in the sliding.
  • the embodiment of the present invention can also judge the importance of the article by setting a predetermined value.
  • FIG. 3 a method flow for determining the importance of an article according to a second embodiment of the present invention is provided. Figure, the specific steps are as follows.
  • Step S301 Space adjacent words in the article by spaces.
  • the beginning of the article is: “The flowers are also beautiful, the grass is flourishing, the scenery here is good”, using spaces to separate the adjacent words: “Flowers are also beautiful grass is also flourishing here is good scenery.”
  • Step S302 Start sliding by using a preset sliding window starting from the beginning of the article.
  • the sliding window includes a left border "[", a right border "]” and a database, and the database stores the words between the left and right borders without repeating.
  • the database can contain up to a preset number of words. For example: The preset number can be 6.
  • the position of the sliding window at this time is: "[] The flower is also beautiful, the grass is also flourishing, the scenery is good.”
  • the right border of the sliding window "] starts to move right.
  • Step S303 The sliding window does not repeatedly collect the slipped words.
  • the position of the sliding window is: "[Flower] Also beautiful grass is flourishing here is good scenery", when the word “flower” is judged not to overlap with the words collected in the database, the word “flower” is collected; the position of the sliding window is moved. For: “When the flowers are beautiful, the grass is good,” when the word “also” is repeated with the words “also” in the database, the word “also” is no longer collected.
  • Step S304 When the words collected by the sliding window reach a preset number, the number of words slipped is recorded.
  • Step S305 setting a word collected at the last sliding window as a starting point, and clearing the sliding The words collected in the window continue to slide until the entire article is slid.
  • the left border of the sliding window moves to the right border, and the right border continues to slide to the right.
  • the position of the sliding window is: "Flower is also beautiful and grass is also flourishing here [] Good scenery.”
  • the sliding window When the words collected by the sliding window reach the maximum again, the sliding window records the total number of words that have slipped again, and clears the collected words, and slides again until it slides over the entire article.
  • Step S306 Obtaining the largest number of values in the number of records in the sliding window.
  • the number of slipped words per record is obtained for the sliding window, and the largest one is extracted from the above number.
  • the number of first records is 7, the second is 8, the third is 12..., and when compared, 12 is the largest, then 12 is the maximum number of words.
  • Step S307 Compare the quantity value with a predetermined value, if less than, determine that the article is an important article.
  • this article is a relatively important article.
  • the article is judged to be an important article or a non-important article by means of a predetermined value, and the importance of the article can be directly judged accurately, which is convenient and practical.
  • the number of words slipped can be recorded by calculating the length of the sliding window, and the retrieved articles can be sorted according to the maximum number of acquired words, so that the articles are arranged in order of importance.
  • the length of the sliding window is the amount of words contained between the left and right borders of the sliding window.
  • FIG. 4 a flow chart of a method for determining the importance of an article according to a third embodiment of the present invention is provided. The specific steps are as follows.
  • step S401 adjacent words in the article are separated by spaces. For example, there is a paragraph in the middle of the article: “I am so happy today, I am so happy, I am so happy, I am very happy.” I used spaces to separate the adjacent words: “I am so happy today, I am so happy, I am very happy.”
  • Step S402 Start sliding by using a preset sliding window starting from the beginning of the article.
  • the sliding window includes a left border "[", a right border "]” and a database, and the database stores the words between the left and right borders without repeating.
  • the database can contain up to a preset number of words. For example: The preset number can be 6.
  • the position of the sliding window at this time is: "[] Good day, happy, ah, happy, really happy, very happy.”
  • the right edge of the sliding window "" starts to move to the right.
  • Step S403 The sliding window does not repeatedly collect the slipped words, and records the number of words that the sliding window has slipped.
  • the position of the sliding window is: "[Today] I am so happy, I am so happy, I am very happy.”
  • the word “Today” does not overlap with the words collected in the database, collect the words "Today”; The position is: “[I am so happy today] I am happy, I am very happy.”
  • the judgment word "good” is repeated with the word "good” in the database, the word “good” is no longer collected.
  • the length of the sliding window is 5 and the sliding window collects 4 words.
  • Step S404 When the words collected by the sliding window reach a preset number, the number of words slipped is recorded.
  • the position of the sliding window is: "[Good day, happy, happy, really happy"], at this time, the sliding window has a length of 10, and the sliding window collects the word 6.
  • Step S405 setting a word collected at the last sliding window as a starting point, and clearing the sliding The words collected in the window continue to slide until the entire article is slid.
  • the left border of the sliding window moves to the right border, and the right border continues to slide to the right.
  • the sliding window records the total number of words that have slipped again, and clears the collected words. Words, slide again again, until you slide over the entire article.
  • Step S406 Obtaining the largest number of values in the number of records in the sliding window.
  • the number of slipped words per record is obtained for the sliding window, and the largest one is extracted from the above number.
  • the length of the sliding window recorded for the first time is 10
  • the second time is 11
  • the third time is 18...
  • Step S407 the quantity value is used as the weight of the article importance, and the importance degree of the article is determined according to the weight value.
  • 18 is used as the weight of the article's importance, compared with the weights of other articles, and sorted by weight.
  • the embodiment by sorting by weight, the frontmost of the most important articles can be arranged, and the rest are arranged in order of importance, which is very convenient for the user to use. Moreover, the embodiment does not need to set relevant values for judging whether the article is important, and can more objectively reflect the importance of the article.
  • the embodiment of the present invention provides a system for judging the importance of an article, which can effectively identify an article with a rich overall vocabulary but a lack of local vocabulary, and is convenient for the user to use.
  • the sliding window 51 includes a word collecting unit 511 and a word recording unit. 512, and a starting unit 513.
  • the activation unit 513 controls the sliding window 51 to start sliding from the initial sliding start point of the article.
  • the initial sliding starting point can be the starting point of the article, the ending point of the article, or the advance
  • the set article is scheduled to slide.
  • the word collecting unit 511 does not repeatedly collect the slipped words, and transmits the activation information to the activation unit 513 and the word recording unit 512 when the collected words reach the preset number.
  • the startup unit 513 resets the starting point after receiving the startup information, and starts the sliding window 51 until the entire article is slid.
  • the preset number may preferably be 6.
  • the word recording unit 512 records the number of sliding windows 51 sliding over the words.
  • the term recording unit 512 can record the total number of words that the sliding window has slid after receiving the activation information.
  • the vocabulary recording unit 512 can also continuously record the total number of words that the sliding window has slid during the sliding process, and end the recording of the slid after receiving the activation information, and record the total number of words that have finally slipped.
  • the maximum value acquisition unit 52 acquires the largest value among the number recorded by the word recording unit 512, and transmits it to the importance judging unit 53.
  • the importance judging unit 53 judges the importance of the article based on the maximum number value.
  • the embodiment of the present invention can break up the words in the article by the word breaking unit.
  • the sliding window 51 starts to slide from the beginning of the article is described in detail.
  • the sliding window 51 can also start from the beginning of the article and start to slide toward the beginning of the article.
  • the embodiment of the present invention does not limit the manner in which the sliding starting point is agreed.
  • a schematic diagram of a system for determining the importance of an article according to a fifth embodiment of the present invention includes a sliding window 51, a maximum value obtaining unit 52, an importance determining unit 53, and a word breaking unit 54, and the sliding window 51 includes word collection.
  • the word breaking unit 54 spaces adjacent words in the article with spaces.
  • the sliding window 51, the maximum value obtaining unit 52, and the importance determining unit 53 are in this embodiment.
  • the functions and functions in the examples are the same as those in the embodiment shown in FIG. 5 and will not be described again.
  • the system may further comprise an importance level output unit.
  • the importance level output unit is connected to the importance judgment unit 53.
  • the importance level output unit is configured to output the importance of the article to the network or locally after the importance judging unit 53 judges the importance of the article.
  • the entire article is traversed through a sliding window, and the relevant parameters of the vocabulary richness in the article can be effectively obtained.
  • a schematic structural diagram of a sliding window 51 includes a word collecting unit 511, a word recording unit 512, and a starting unit 513.
  • the start unit 512 controls the sliding window 51 to start sliding at the beginning of the article or at the end of the article.
  • the word collecting unit 512 performs non-repeating collection of the slipped words, and when the collected words reach the preset number, sends the activation information to the initiating unit 513 and the word recording unit 512; the last word collected by the initiating unit 513 by the word collecting unit 511 As a starting point, restart the sliding window 51 until it slides through the entire article.
  • the preset number is preferably 6.
  • the word recording unit 512 records the number of sliding windows 51 sliding over the words.
  • the sliding window 51 further includes a left boundary and a right boundary.
  • the right boundary moves rightward from the starting point.
  • the words collected by the word collecting unit 512 reach a preset number, the right boundary stops moving, and the left boundary moves to the right, until the left and right borders Contains only one word.
  • the embodiment of the present invention uses a preset sliding window to start sliding at the beginning of the article or the end of the article, and the sliding window does not repeatedly collect the slipped words, and the words collected by the sliding window reach a preset number.
  • the sliding window repeats the process until the entire article is slid; the largest number of records is obtained in the number of sliding window records, and the article is judged according to the size of the number The importance of it. If the article's overall vocabulary is rich but the local vocabulary is poor, there are a lot of word duplication. The phenomenon. In the embodiment of the present invention, when the sliding window slides over the portion, the slipped window is not repeatedly collected due to the sliding window.
  • the embodiment of the present invention adopts a correlation value that reflects the most inferior part of the article vocabulary as a basis for judging the importance of the article.
  • the embodiment of the present invention can effectively judge the article with rich overall vocabulary but partial vocabulary, which is convenient for the user to use.

Abstract

A method for judging of the importance of an article, including: a sliding window start sliding take the article jumping-off point as a starting point, the sliding window not collects the slid words again, when the words of the sliding window collected reach a preset number, the total number of the slid words is recorded, reset the starting point of slide, and continue the sliding process from the reset the starting point of slide; the largest number of value is obtained from the number of the sliding window recorded, and the importance of the article is judged based on the largest number of value. At the same time, the invention also relates to a system for judging of the importance of the article, and a sliding window.

Description

判断文章重要性的方法和系统及滑动窗口  Method and system for judging the importance of an article and sliding window
技术领域 Technical field
本发明涉及网络检索领域, 特别是涉及一种判断文章重要性的方法 和系统及滑动窗口。 发明背景  The present invention relates to the field of network retrieval, and in particular to a method and system for determining the importance of an article and a sliding window. Background of the invention
在网络上利用关键词检索相关文章, 是网络资源共享的重要方式之 一。 因网络资源非常丰富, 一个检索关键词往往对应大量的文章, 这就 需要网络系统能够判断每篇文章的重要性, 以便在检索结果中将相对重 要的文章排在前面显示, 将相对不重要的文章排在后面显示, 使用户先 阅读较重要的文章, 节约用户时间。  Retrieving relevant articles using keywords on the network is one of the important ways to share network resources. Due to the rich network resources, a search keyword often corresponds to a large number of articles, which requires the network system to judge the importance of each article, so that the relatively important articles are displayed in front of the search results, which will be relatively unimportant. The article is displayed at the back, allowing users to read more important articles first, saving users time.
目前通用的做法是根据文章中词汇的丰富程度来判断其重要性。 如 果一篇文章词汇丰富, 说明它言之有物, 属于重要文章; 反之, 如果文 章通篇或者局部只是少数词汇重复出现, 词汇贫乏, 说明它言之无物, 属于不重要文章。 现有技术是基于词频统计的方法来判断文章的重要 性。  The current common practice is to judge the importance of the vocabulary in the article. If an article has a rich vocabulary, it means that it has something to say, and it belongs to an important article; on the other hand, if the article is only a few words repeated in the whole article or in part, the vocabulary is poor, indicating that it has nothing to say, and it is an unimportant article. The prior art is based on the method of word frequency statistics to judge the importance of an article.
参阅图 1 , 为现有的在网络上判断文章重要性的方法流程图, 具体 步骤如下所述。  Referring to Figure 1, a flow chart of a method for judging the importance of an article on the network is described below. The specific steps are as follows.
步骤 S101、 用空格将文章语句中相邻的词语分开。  Step S101: Separate adjacent words in the article sentence by using a space.
例如:  E.g:
分词前的文章: 今天看了一部电视剧, 被剧中的一个男生给感动 了, ……  Before the word segmentation: I watched a TV series today, and I was touched by a boy in the drama, ......
分词后的文章: 今天 看 了 一 部 电视剧 ,被剧中 的 一 个 男 生 给感动 了 , 步骤 S102、 统计每个词语的词频, 即该词语在文章中出现的次数。 例如: After the word segmentation: I watched a TV series today and was touched by a boy in the drama. Step S102: Count the word frequency of each word, that is, the number of times the word appears in the article. E.g:
今天, 5次; 看, 35次; 了, 100次; 电视剧, 10次 .....  Today, 5 times; see, 35 times; out, 100 times; TV series, 10 times .....
步骤 S103、 计算并判断上述词语出现的次数是否满足预设条件, 如 不满足, 则认为该文章相对重要; 如满足, 则认为该文章相对不重要。  Step S103: Calculate and determine whether the number of occurrences of the words satisfies a preset condition. If not, the article is considered to be relatively important; if satisfied, the article is considered to be relatively unimportant.
预设条件可为:  The preset conditions can be:
1 ) 词语总数目小于 5;  1) The total number of words is less than 5;
2 ) 出现最多的单个词的词频大于总词频的 30% ,或出现最多的 5% 词词频大于总词频的 50%,或出现最多的 20%词词频大于总词频的 80%;  2) The word frequency of the most single word is greater than 30% of the total word frequency, or the most frequently occurring 5% word frequency is greater than 50% of the total word frequency, or the most frequent 20% word frequency is greater than 80% of the total word frequency;
3 ) 平均词频超过 5。  3) The average word frequency is more than 5.
如, 上述词语"文章"的平均词频超过 5 , 则认为该文章相对不重要。 上述方法通过统计文章中各词语的词频来判断文章的重要性, 但是 词频反映的是文章全局的特性, 不能反映出文章局部的特性。 比如: 如 果一篇整体词汇丰富, 但局部词汇贫乏, 现有的判断方法就容易将该类 文章误判为重要文章。 因此, 现有的判断方法无法对整体词汇丰富但局 部词汇贫乏等文章进行有效的判断, 给用户使用带来不便。 发明内容  For example, if the average word frequency of the above word "article" exceeds 5, the article is considered to be relatively unimportant. The above method judges the importance of the article by counting the word frequency of each word in the article, but the word frequency reflects the global characteristics of the article and does not reflect the local characteristics of the article. For example: If an overall vocabulary is rich, but the local vocabulary is poor, the existing judgment method is easy to misjudge the article as an important article. Therefore, the existing judgment methods cannot effectively judge articles such as the overall vocabulary but the local vocabulary is poor, which brings inconvenience to users. Summary of the invention
本发明实施例提供一种判断文章重要性的方法, 该方法能够对文章 进行更加有效的识别, 方便用户使用。  The embodiment of the invention provides a method for judging the importance of an article, which can more effectively identify an article and is convenient for the user to use.
本发明实施例还提供一种判断文章重要性的系统, 该系统能够对文 章更加进行有效的识别, 方便用户使用。  The embodiment of the invention further provides a system for judging the importance of an article, which can more effectively identify the article and is convenient for the user to use.
本发明实施例还提供一种滑动窗口, 该滑动窗口用于在网络上遍历 文章, 能够有效地获取该文章中词汇丰富程度的相关参数。  The embodiment of the invention further provides a sliding window for traversing the article on the network, and can effectively obtain related parameters of the vocabulary richness in the article.
本发明实施例涉及一种判断文章重要性的方法, 包括: 滑动窗口从 文章的初始滑动起始点处开始滑动, 所述滑动窗口对滑过的词语进行不 重复收集, 当所述滑动窗口收集的词语达到预设数量时, 记录已滑过的 总词语数量, 重新设置滑动起始点, 并从该重新设置的滑动起始点处继 续上述滑动过程; Embodiments of the present invention relate to a method for determining the importance of an article, including: sliding a window from The initial sliding start point of the article starts to slide, and the sliding window does not repeatedly collect the slipped words. When the words collected by the sliding window reach a preset number, the total number of words that have been slipped is recorded, and the sliding is reset. a starting point, and continuing the sliding process from the resetting starting point of the sliding;
在所述滑动窗口记录的数量中获取最大数量值, 并依据获取的所述 最大数量值判断文章的重要性。  The maximum number of values is obtained in the number of the sliding window records, and the importance of the article is determined according to the obtained maximum number of values.
本发明实施例还涉及一种判断文章重要性的系统, 包括: 滑动窗口、 最大数值获取单元和重要性判断单元, 所述滑动窗口包括词语收集单 元、 词语记录单元及启动单元:  The embodiment of the invention further relates to a system for determining the importance of an article, comprising: a sliding window, a maximum value obtaining unit and an importance determining unit, wherein the sliding window comprises a word collecting unit, a word recording unit and a starting unit:
所述启动单元, 用于控制滑动窗口从文章的初始滑动起始点处开始 滑动;  The starting unit is configured to control the sliding window to start sliding from an initial sliding starting point of the article;
所述词语收集单元, 用于对滑过的词语进行不重复收集, 当收集的 词语达到预设数量时, 发送启动信息到所述启动单元和所述词语记录单 元; 所述启动单元收到所述启动信息后重新设置滑动起始点, 并启动所 述滑动窗口从该重新设置的滑动起始点处继续上述滑动过程;  The word collecting unit is configured to perform non-repeating collection of the slipped words, and when the collected words reach a preset quantity, send startup information to the activation unit and the word recording unit; Resetting the sliding starting point after starting the information, and starting the sliding window to continue the sliding process from the reset starting point of the sliding;
所述词语记录单元, 用于在收到启动信息后记录所述滑动窗口已滑 过的总词语数量;  The term recording unit is configured to record, after receiving the startup information, the total number of words that the sliding window has slid;
所述最大值获取单元, 用于在所述词语记录单元记录的数量中获取 最大数量值;  The maximum value obtaining unit is configured to obtain a maximum quantity value in the quantity recorded by the word recording unit;
所述重要性判断单元, 用于依据获取的最大数量值判断文章的重要 性。  The importance judging unit is configured to judge the importance of the article according to the obtained maximum quantity value.
本发明实施例还涉及一种滑动窗口, 所述滑动窗口包括词语收集单 元、 词语记录单元及启动单元:  The embodiment of the invention further relates to a sliding window, the sliding window comprising a word collecting unit, a word recording unit and a starting unit:
所述启动单元, 用于控制滑动窗口从文章的初始滑动起始点处开始 滑动; 所述词语收集单元, 用于对滑过的词语进行不重复收集, 当收集的 词语达到预设数量时, 发送启动信息到所述启动单元和所述词语记录单 元; 所述启动单元收到所述启动信息后重新设置滑动起始点, 并启动所 述滑动窗口从该重新设置的滑动起始点处继续上述滑动过程; The starting unit is configured to control the sliding window to start sliding from an initial sliding starting point of the article; The word collecting unit is configured to perform non-repeating collection of the slipped words, and when the collected words reach a preset quantity, send startup information to the activation unit and the word recording unit; Resetting the sliding starting point after starting the information, and starting the sliding window to continue the sliding process from the reset starting point of the sliding;
所述词语记录单元, 用于在收到启动信息后记录所述滑动窗口已滑 过词语的数量。 附图简要说明  The term recording unit is configured to record, after receiving the activation information, the number of words that the sliding window has slipped. BRIEF DESCRIPTION OF THE DRAWINGS
图 1为现有的在网络上判断文章重要性的方法流程图;  1 is a flow chart of a method for judging the importance of an article on a network;
图 2为本发明第一实施例提供的判断文章重要性的方法流程图; 图 3为本发明第二实施例提供的判断文章重要性的方法流程图; 图 4为本发明第三实施例提供的判断文章重要性的方法流程图; 图 5为本发明第四实施例提供的判断文章重要性的系统示意图; 图 6为本发明第五实施例提供的判断文章重要性的系统示意图; 图 7为本发明第六实施例提供的滑动窗口的结构示意图。 实施本发明的方式  2 is a flowchart of a method for determining the importance of an article according to a first embodiment of the present invention; FIG. 3 is a flowchart of a method for determining the importance of an article according to a second embodiment of the present invention; FIG. 5 is a schematic diagram of a system for determining the importance of an article according to a fourth embodiment of the present invention; FIG. 6 is a schematic diagram of a system for determining the importance of an article according to a fifth embodiment of the present invention; A schematic structural diagram of a sliding window provided by a sixth embodiment of the present invention. Mode for carrying out the invention
为使本发明实施例的上述目的、 特征和优点能够更加明显易懂, 下 面结合附图和具体实施方式对本发明实施例作进一步详细的说明。  The above described objects, features, and advantages of the embodiments of the present invention will be more apparent from the embodiments of the invention.
本发明实施例使用预先设置的滑动窗口从文章的初始滑动起始点处 开始滑动, 滑动窗口对滑过的词语进行不重复收集, 滑动窗口收集的词 语达到预设数量时, 记录滑过词语的数量, 重新设置起点, 继续滑动, 滑动窗口重复该过程, 直至滑过整篇文章; 在滑动窗口记录的数量中获 取数量值最大者, 并依据该数量的大小判断文章的重要性。  The embodiment of the present invention uses a preset sliding window to start sliding from the initial sliding starting point of the article, and the sliding window does not repeatedly collect the slipped words. When the words collected by the sliding window reach a preset number, the number of slipped words is recorded. , reset the starting point, continue to slide, the sliding window repeats the process until the entire article is slid; the largest number of records is obtained in the number of sliding window records, and the importance of the article is determined according to the size of the number.
其中, 初始滑动起始点可以为文章的起始点、 文章的结束点或预先 设定的文章预定滑动点。而且,文章既可以是保存在网络中的网络文章, 也可以是与网络脱机的本地文章。 Wherein, the initial sliding starting point can be the starting point of the article, the ending point of the article, or the advance The set article is scheduled to slide. Moreover, the article can be either a web article stored on the network or a local article that is offline with the network.
优选地, 在判断文章的重要性之后, 进一步向网络或本地输出所述 文章的重要程度。  Preferably, after determining the importance of the article, the importance of the article is further output to the network or locally.
参照图 2, 为本发明第一实施例提供的判断文章重要性的方法流程 图, 具体步骤如下所述。  Referring to FIG. 2, a flow chart of a method for determining the importance of an article according to a first embodiment of the present invention is provided. The specific steps are as follows.
步骤 S201、 使用预先设置的滑动窗口以文章起始处为起点开始滑 动。  Step S201: Start sliding with the preset sliding window starting from the beginning of the article.
设置一滑动窗口, 该滑动窗口包括左边界、 右边界和数据库, 数据 库不重复地存储左、 右边界之间的各词语。 数据库最多可包含预设数量 的词语。 比如: 预设数量优选为 6。  A sliding window is provided, which includes a left border, a right border, and a database, and the database does not repeatedly store the words between the left and right borders. The database can contain up to a preset number of words. For example: The preset number is preferably 6.
使用该滑动窗口以文章起始处的第一个词语为起点, 开始滑动。 滑 动时, 滑动窗口的左边界不动, 右边界向右滑动。  Use this sliding window to start with the first word at the beginning of the article. When sliding, the left edge of the sliding window does not move, and the right border slides to the right.
步骤 S202、 滑动窗口对滑过的词语进行不重复收集。  Step S202: The sliding window does not repeatedly collect the slipped words.
滑动窗口在滑动过程中, 对滑过的词语进行不重复收集, 即收集的 词语相互不重复。 滑动窗口将收集的词语存储到数据库。  The sliding window does not repeatedly collect the slipped words during the sliding process, that is, the collected words do not overlap each other. The sliding window stores the collected words in the database.
步骤 S203、 滑动窗口收集的词语达到预设数量时, 记录滑过词语的 数量。  Step S203: When the words collected by the sliding window reach a preset number, the number of words slipped is recorded.
随着滑动窗口的滑动, 滑动窗口收集的词语量不断增加, 当滑动窗 口收集的词语量达到预设数量时, 记录滑动窗口从起点开始所滑过的词 语总量。  As the sliding window slides, the amount of words collected by the sliding window increases continuously, and when the amount of words collected by the sliding window reaches a preset number, the total amount of words that the sliding window slides from the starting point is recorded.
步骤 S204、 重新设置起点, 继续滑动, 直至滑过整篇文章。  Step S204, reset the starting point, and continue to slide until the entire article is slid.
滑动窗口重新设置起点, 清空所收集词语, 继续滑动。 重新设置的 起点可以是滑动窗口所收集的最后一个词语, 也可以是该词语的下一个 词语或上一个词语, 还可以是该词语的上几个词语。 当滑动窗口收集的词语再次达到最大值时, 滑动窗口再次记录该次 滑过的词语总量, 并清空所收集的词语,再次重新设置起点, 继续滑动, 直至滑过整篇文章。 Slide the window to reset the starting point, empty the collected words, and continue to slide. The starting point for resetting may be the last word collected by the sliding window, or the next word or previous word of the word, or the first few words of the word. When the words collected by the sliding window reach the maximum again, the sliding window records the total number of words that have slipped again, and clears the collected words, resets the starting point again, and continues to slide until the entire article is slid.
当滑动窗口收集的词语达到预设数量时, 右边界停止移动, 左边界 再右移到作为起点的词语位置。  When the words collected by the sliding window reach the preset number, the right border stops moving, and the left border moves to the right to the word position as the starting point.
步骤 S205、 在滑动窗口记录的数量中获取数量值最大者。  Step S205: Acquire the largest number of values in the number of records in the sliding window.
当滑动窗口滑完整篇文章后, 获取滑动窗口每次记录的滑过词语的 数量, 并在上述数量中提取数值最大的一个。  After the sliding window slides through the entire article, the number of slipped words per record is obtained for the sliding window, and the largest one is extracted from the above number.
步骤 S206、 依据该数量值的大小判断文章的重要性。  Step S206: Determine the importance of the article according to the magnitude of the quantity value.
将获取的数量值作为判断文章重要性的依据, 数值大的, 文章重要 性就相对较低; 数值小的, 文章重要性就相对较高。  The quantity value obtained is used as the basis for judging the importance of the article. If the value is large, the importance of the article is relatively low; if the value is small, the article is relatively important.
如果文章整体词汇丰富但局部词汇贫乏, 局部存在大量词语重复的 现象。 本发明实施例滑动窗口滑过该部分时, 因滑动窗口对滑过的词语 进行不重复收集, 在收集的词语数量固定的条件下, 滑动窗口滑过的词 语数量相对其它部分较多, 该次滑动中记录的滑过词语的数量相对较 大, 会成为判断文章重要性的依据。 这样, 本发明实施例采用能够反映 文章词汇最贫乏部分的相关数值, 作为判断文章重要性的依据, 可以对 整体词汇丰富但局部词汇贫乏的文章进行有效的判断, 方便用户使用。  If the article's overall vocabulary is rich but the local vocabulary is poor, there are a lot of repeated words. In the embodiment of the present invention, when the sliding window slides over the portion, the slipped window is not repeatedly collected due to the sliding window. Under the condition that the number of collected words is fixed, the number of words sliding through the sliding window is relatively larger than other parts. The number of slipped words recorded in the slide is relatively large, which will become the basis for judging the importance of the article. In this way, the embodiment of the present invention adopts a correlation value that reflects the most inferior part of the article vocabulary, and serves as a basis for judging the importance of the article, and can effectively judge the article with rich overall vocabulary but partial vocabulary, which is convenient for the user to use.
以上实施例中, 详细描述了从文章起始处为起点开始滑动的技术方 案。 实际上, 本发明实施例也可以以文章结束处为起点, 向着文章起始 处开始滑动。 本发明实施例对于滑动起点的约定方式并无任何限定。  In the above embodiment, the technical scheme of starting the sliding from the beginning of the article as a starting point is described in detail. In fact, the embodiment of the present invention can also start from the beginning of the article and start to slide toward the beginning of the article. The embodiment of the present invention does not limit the manner of the sliding starting point.
本发明实施例在滑动窗口开始滑动前, 用空格将文章语句中相邻的 词语间隔开, 方便滑动窗口在滑动中识别词语。 并且本发明实施例还可 以通过设置预定值的方式来判断文章的重要性。  In the embodiment of the present invention, adjacent words in the sentence sentence are spaced apart by a space before the sliding window starts to slide, so that the sliding window can recognize the words in the sliding. And the embodiment of the present invention can also judge the importance of the article by setting a predetermined value.
参照图 3 , 为本发明第二实施例提供的判断文章重要性的方法流程 图, 具体步骤如下所述。 Referring to FIG. 3, a method flow for determining the importance of an article according to a second embodiment of the present invention is provided. Figure, the specific steps are as follows.
步骤 S301、 将文章中相邻的词语用空格间隔开。  Step S301: Space adjacent words in the article by spaces.
如, 文章的开始一句话为: "花也漂亮, 草也茂盛, 这里风光不错", 采用空格将相邻的词语分开后为: "花 也 漂亮 草 也 茂盛 这里 风光 不错"。  For example, the beginning of the article is: "The flowers are also beautiful, the grass is flourishing, the scenery here is good", using spaces to separate the adjacent words: "Flowers are also beautiful grass is also flourishing here is good scenery."
步骤 S302、 使用预先设置的滑动窗口以文章起始处为起点开始滑 动。  Step S302: Start sliding by using a preset sliding window starting from the beginning of the article.
该滑动窗口包括左边界"【"、 右边界"】 "和数据库, 数据库不重复 地存储左、 右边界之间的各词语。 数据库最多可包含预设数量的词语。 比如: 预设数量可以为 6。  The sliding window includes a left border "[", a right border "]" and a database, and the database stores the words between the left and right borders without repeating. The database can contain up to a preset number of words. For example: The preset number can be 6.
如,滑动窗口此时的位置为: "【】花 也 漂亮 草 也 茂盛 这里 风 光 不错", 开始滑动时, 滑动窗口的右边界"】 "开始右移。  For example, the position of the sliding window at this time is: "[] The flower is also beautiful, the grass is also flourishing, the scenery is good." When starting to slide, the right border of the sliding window "] " starts to move right.
步骤 S303、 滑动窗口对滑过的词语进行不重复收集。  Step S303: The sliding window does not repeatedly collect the slipped words.
如, 滑动窗口移动的位置为: "【花】 也 漂亮 草 也 茂盛 这里 风 光 不错"时, 判断词语"花"与数据库已收集的词语不重复, 则收集词语 "花"; 滑动窗口移动的位置为: "【花 也 漂亮 草 也 】茂盛 这里 风 光 不错"时, 判断词语"也"与数据库中已有的词语"也"重复, 则不再收 集词语"也"。  For example, the position of the sliding window is: "[Flower] Also beautiful grass is flourishing here is good scenery", when the word "flower" is judged not to overlap with the words collected in the database, the word "flower" is collected; the position of the sliding window is moved. For: "When the flowers are beautiful, the grass is good," when the word "also" is repeated with the words "also" in the database, the word "also" is no longer collected.
步骤 S304、 滑动窗口收集的词语达到预设数量时, 记录滑过词语的 数量。  Step S304: When the words collected by the sliding window reach a preset number, the number of words slipped is recorded.
比如, 滑动窗口移动的位置为: "【花 也 漂亮 草 也 茂盛 这里 】 风光不错"时, 滑动窗口收集的词语数为 6, 达到预设数值, 则记录滑动 窗口已滑过的词语数量, 即 "花、 也、 漂亮、 草、 也、 茂盛、 这里"等 7 个词语, 记录数值 7。  For example, when the position of the sliding window moves: "[Flower is also beautiful and grass is flourishing here] When the scenery is good", the number of words collected in the sliding window is 6, and when the preset value is reached, the number of words that the sliding window has slipped is recorded, that is, 7 words such as "flower, also, beautiful, grass, also, lush, here", record the value 7.
步骤 S305、 将滑动窗口最后收集的一个词语设置为起点, 清空滑动 窗口收集的词语, 继续滑动, 直至滑过整篇文章。 Step S305, setting a word collected at the last sliding window as a starting point, and clearing the sliding The words collected in the window continue to slide until the entire article is slid.
如, 以词语 "这里 "为起点, 滑动窗口的左边界移动到右边界处, 右 边界继续向右滑动,滑动窗口的位置为: "花 也 漂亮 草 也 茂盛 这里 【】 风光不错"。  For example, starting with the word "here", the left border of the sliding window moves to the right border, and the right border continues to slide to the right. The position of the sliding window is: "Flower is also beautiful and grass is also flourishing here [] Good scenery."
当滑动窗口收集的词语再次达到最大值时, 滑动窗口再次记录该次 滑过的词语总量, 并清空所收集的词语, 再次重新滑动, 直至滑过整篇 文章。  When the words collected by the sliding window reach the maximum again, the sliding window records the total number of words that have slipped again, and clears the collected words, and slides again until it slides over the entire article.
步骤 S306、 在滑动窗口记录的数量中获取数量值最大者。  Step S306: Obtaining the largest number of values in the number of records in the sliding window.
当滑动窗口滑完整篇文章后, 获取滑动窗口每次记录的滑过词语的 数量, 并在上述数量中提取数值最大的一个。  After the sliding window slides through the entire article, the number of slipped words per record is obtained for the sliding window, and the largest one is extracted from the above number.
如, 第一次记录的数量为 7、 第二次为 8、 第三次为 12... , 经比较, 12最大, 则将 12作为词语数量的最大值。  For example, the number of first records is 7, the second is 8, the third is 12..., and when compared, 12 is the largest, then 12 is the maximum number of words.
步骤 S307、 比较该数量值与预定值, 如小于, 则确定文章为重要文 章。  Step S307: Compare the quantity value with a predetermined value, if less than, determine that the article is an important article.
如, 预定值为 16, 将 12与 16比较, 12<16, 该文章是相对重要的 文章。  For example, a predetermined value of 16, comparing 12 to 16 and 12 < 16, this article is a relatively important article.
该实施例借助预定值的方式判断文章为重要文章或非重要文章, 可 直接对文章的重要性进行准确的判断, 方便实用。  In this embodiment, the article is judged to be an important article or a non-important article by means of a predetermined value, and the importance of the article can be directly judged accurately, which is convenient and practical.
本发明实施例可通过计算滑动窗口长度的方式记录滑过的词语数 量, 还可根据获取的词语数量的最大值对检索到的各文章进行排序, 使 各文章按其重要性顺序排列。 滑动窗口的长度为滑动窗口左右边界之间 所包含的词语量。  In the embodiment of the present invention, the number of words slipped can be recorded by calculating the length of the sliding window, and the retrieved articles can be sorted according to the maximum number of acquired words, so that the articles are arranged in order of importance. The length of the sliding window is the amount of words contained between the left and right borders of the sliding window.
参照图 4, 为本发明第三实施例提供的判断文章重要性的方法流程 图, 具体步骤如下所述。  Referring to FIG. 4, a flow chart of a method for determining the importance of an article according to a third embodiment of the present invention is provided. The specific steps are as follows.
步骤 S401、 将文章中相邻的词语用空格间隔开。 如, 文章的中部有一段为: "今天好高兴啊, 好高兴, 真高兴, 特别 高兴", 采用空格将相邻的词语分开后为: "今天好 高兴 啊好 高兴, 真 高兴 特别 高兴"。 In step S401, adjacent words in the article are separated by spaces. For example, there is a paragraph in the middle of the article: "I am so happy today, I am so happy, I am so happy, I am very happy." I used spaces to separate the adjacent words: "I am so happy today, I am so happy, I am very happy."
步骤 S402、 使用预先设置的滑动窗口以文章起始处为起点开始滑 动。  Step S402: Start sliding by using a preset sliding window starting from the beginning of the article.
该滑动窗口包括左边界"【"、 右边界"】 "和数据库, 数据库不重复 地存储左、 右边界之间的各词语。 数据库最多可包含预设数量的词语。 比如: 预设数量可以为 6。  The sliding window includes a left border "[", a right border "]" and a database, and the database stores the words between the left and right borders without repeating. The database can contain up to a preset number of words. For example: The preset number can be 6.
如,滑动窗口此时的位置为: "【】今天好 高兴 啊好 高兴,真 高 兴 特别 高兴"。 开始滑动时, 滑动窗口的右边界"】 "开始右移。  For example, the position of the sliding window at this time is: "[] Good day, happy, ah, happy, really happy, very happy." When you start to slide, the right edge of the sliding window "" " starts to move to the right.
此时, 滑动窗口的长度为 0, 滑动窗口收集的词语为 0。  At this point, the length of the sliding window is 0, and the word collected by the sliding window is 0.
步骤 S403、 滑动窗口对滑过的词语进行不重复收集, 同时记录滑动 窗口已滑过的词语的数量。  Step S403: The sliding window does not repeatedly collect the slipped words, and records the number of words that the sliding window has slipped.
如,滑动窗口移动的位置为: "【今天】 好 高兴 啊好 高兴,真 高 兴 特别 高兴"时, 判断词语"今天"与数据库已收集的词语不重复, 则收 集词语 "今天"; 滑动窗口移动的位置为: "【今天好 高兴 啊好】 高 兴, 真 高兴 特别 高兴"时, 判断词语"好"与数据库中已有的词语"好" 重复, 则不再收集词语"好"。  For example, the position of the sliding window is: "[Today] I am so happy, I am so happy, I am very happy." When judging the word "Today" does not overlap with the words collected in the database, collect the words "Today"; The position is: "[I am so happy today] I am happy, I am very happy." When the judgment word "good" is repeated with the word "good" in the database, the word "good" is no longer collected.
此时, 滑动窗口的长度为 5 , 滑动窗口收集的词语为 4。  At this point, the length of the sliding window is 5 and the sliding window collects 4 words.
步骤 S404、 滑动窗口收集的词语达到预设数量时, 记录滑过词语的 数量。  Step S404: When the words collected by the sliding window reach a preset number, the number of words slipped is recorded.
如, 滑动窗口移动的位置为: "【今天好 高兴 啊好 高兴, 真 高 兴 特别 高兴】 ", 此时, 滑动窗口的长度为 10, 滑动窗口收集的词语 为 6。  For example, the position of the sliding window is: "[Good day, happy, happy, really happy"], at this time, the sliding window has a length of 10, and the sliding window collects the word 6.
步骤 S405、 将滑动窗口最后收集的一个词语设置为起点, 清空滑动 窗口收集的词语, 继续滑动, 直至滑过整篇文章。 Step S405, setting a word collected at the last sliding window as a starting point, and clearing the sliding The words collected in the window continue to slide until the entire article is slid.
如, 滑动窗口的左边界移动到右边界处, 右边界继续向右滑动, 当 滑动窗口收集的词语再次达到最大值时, 滑动窗口再次记录该次滑过的 词语总量, 并清空所收集的词语, 再次重新滑动, 直至滑过整篇文章。  For example, the left border of the sliding window moves to the right border, and the right border continues to slide to the right. When the words collected by the sliding window reach the maximum again, the sliding window records the total number of words that have slipped again, and clears the collected words. Words, slide again again, until you slide over the entire article.
步骤 S406、 在滑动窗口记录的数量中获取数量值最大者。  Step S406: Obtaining the largest number of values in the number of records in the sliding window.
当滑动窗口滑完整篇文章后, 获取滑动窗口每次记录的滑过词语的 数量, 并在上述数量中提取数值最大的一个。  After the sliding window slides through the entire article, the number of slipped words per record is obtained for the sliding window, and the largest one is extracted from the above number.
如,第一次记录的滑动窗口长度为 10、第二次为 11、第三次为 18... , 经比较, 18最大, 则将 18作为词语数量的最大值。  For example, the length of the sliding window recorded for the first time is 10, the second time is 11, the third time is 18..., and when 18 is the largest, 18 is used as the maximum number of words.
步骤 S407、 将该数量值作为文章重要性的权值, 按权值大小确定文 章的重要程度。  Step S407, the quantity value is used as the weight of the article importance, and the importance degree of the article is determined according to the weight value.
如, 将 18作为该文章重要性的权值, 与其它文章的权值进行比较, 按权值大小进行排序。  For example, 18 is used as the weight of the article's importance, compared with the weights of other articles, and sorted by weight.
该实施例通过按权值大小进行排序, 可以将相对最重要的文章排列 的最前面, 其余按重要性依次排列, 非常方便用户使用。 并且该实施例 不需对判断文章是否重要设置相关数值, 可以更客观的反映该文章的重 要性。  In this embodiment, by sorting by weight, the frontmost of the most important articles can be arranged, and the rest are arranged in order of importance, which is very convenient for the user to use. Moreover, the embodiment does not need to set relevant values for judging whether the article is important, and can more objectively reflect the importance of the article.
基于上述判断文章重要性的方法, 本发明实施例提供一种判断文章 重要性的系统, 该系统能够对整体词汇丰富但局部词汇贫乏的文章进行 有效的识别, 方便用户使用。  Based on the above method for judging the importance of an article, the embodiment of the present invention provides a system for judging the importance of an article, which can effectively identify an article with a rich overall vocabulary but a lack of local vocabulary, and is convenient for the user to use.
参照图 5 , 为本发明第四实施例提供的判断文章重要性的系统示意 图, 包括滑动窗口 51、 最大数值获取单元 52和重要性判断单元 53 , 滑 动窗口 51包括词语收集单元 511、 词语记录单元 512、 及启动单元 513。  5 is a schematic diagram of a system for determining the importance of an article according to a fourth embodiment of the present invention, including a sliding window 51, a maximum value obtaining unit 52, and an importance determining unit 53. The sliding window 51 includes a word collecting unit 511 and a word recording unit. 512, and a starting unit 513.
启动单元 513控制滑动窗口 51从文章的初始滑动起始点处开始滑 动。 其中, 初始滑动起始点可以为文章的起始点、 文章的结束点或预先 设定的文章预定滑动点。 The activation unit 513 controls the sliding window 51 to start sliding from the initial sliding start point of the article. Wherein, the initial sliding starting point can be the starting point of the article, the ending point of the article, or the advance The set article is scheduled to slide.
词语收集单元 511对滑过的词语进行不重复收集, 当收集的词语达 到预设数量时, 发送启动信息到启动单元 513和词语记录单元 512。 启 动单元 513收到所述启动信息后重新设置起点, 启动滑动窗口 51 , 直至 滑过整篇文章。 预设数量优选可以为 6。  The word collecting unit 511 does not repeatedly collect the slipped words, and transmits the activation information to the activation unit 513 and the word recording unit 512 when the collected words reach the preset number. The startup unit 513 resets the starting point after receiving the startup information, and starts the sliding window 51 until the entire article is slid. The preset number may preferably be 6.
词语记录单元 512记录滑动窗口 51滑过词语的数量。词语记录单元 512可以在收到启动信息后记录所述滑动窗口已滑过的总词语数量。 词 语记录单元 512也可以在滑动过程中持续记录滑动窗口已滑过的总词语 数量, 并在收到启动信息后结束本次滑动的记录, 记录最终已滑过的总 词语数量。  The word recording unit 512 records the number of sliding windows 51 sliding over the words. The term recording unit 512 can record the total number of words that the sliding window has slid after receiving the activation information. The vocabulary recording unit 512 can also continuously record the total number of words that the sliding window has slid during the sliding process, and end the recording of the slid after receiving the activation information, and record the total number of words that have finally slipped.
最大值获取单元 52在词语记录单元 512记录的数量中获取数量值最 大者, 并将其发送到重要性判断单元 53。  The maximum value acquisition unit 52 acquires the largest value among the number recorded by the word recording unit 512, and transmits it to the importance judging unit 53.
重要性判断单元 53依据所述最大数量值判断文章的重要性。  The importance judging unit 53 judges the importance of the article based on the maximum number value.
为更好的便于词语收集单元 511、 词语记录单元 512收集和记录词 语, 本发明实施例可通过词语打散单元将文章中的词语打散。  In order to facilitate the word collecting unit 511 and the word recording unit 512 to collect and record words, the embodiment of the present invention can break up the words in the article by the word breaking unit.
以上实施例中,详细描述了滑动窗口 51从文章起始处为起点开始滑 动的技术方案。 实际上, 滑动窗口 51 也可以以文章结束处为起点, 向 着文章起始处开始滑动。 本发明实施例对于滑动起点的约定方式并无任 何限定。  In the above embodiment, the technical solution in which the sliding window 51 starts to slide from the beginning of the article is described in detail. In fact, the sliding window 51 can also start from the beginning of the article and start to slide toward the beginning of the article. The embodiment of the present invention does not limit the manner in which the sliding starting point is agreed.
参照图 6, 为本发明第五实施例提供的判断文章重要性的系统示意 图, 包括滑动窗口 51、 最大数值获取单元 52、 重要性判断单元 53和词 语打散单元 54,滑动窗口 51包括词语收集单元 511、词语记录单元 512、 及启动单元 513。  Referring to FIG. 6, a schematic diagram of a system for determining the importance of an article according to a fifth embodiment of the present invention includes a sliding window 51, a maximum value obtaining unit 52, an importance determining unit 53, and a word breaking unit 54, and the sliding window 51 includes word collection. The unit 511, the word recording unit 512, and the activation unit 513.
词语打散单元 54将文章中相邻的词语用空格间隔开。  The word breaking unit 54 spaces adjacent words in the article with spaces.
滑动窗口 51、 最大数值获取单元 52、 重要性判断单元 53在本实施 例中的作用和功能与图 5所示实施例相同, 不再赘述。 The sliding window 51, the maximum value obtaining unit 52, and the importance determining unit 53 are in this embodiment. The functions and functions in the examples are the same as those in the embodiment shown in FIG. 5 and will not be described again.
优选地, 该系统可以进一步包括重要程度输出单元。 重要程度输出 单元与重要性判断单元 53 连接。 重要程度输出单元, 用于在重要性判 断单元 53 判断文章的重要性之后, 向网络或本地输出所述文章的重要 程度。  Preferably, the system may further comprise an importance level output unit. The importance level output unit is connected to the importance judgment unit 53. The importance level output unit is configured to output the importance of the article to the network or locally after the importance judging unit 53 judges the importance of the article.
本发明实施例是通过滑动窗口遍历整篇文章, 能够有效地获取该文 章中词汇丰富程度的相关参数。  In the embodiment of the present invention, the entire article is traversed through a sliding window, and the relevant parameters of the vocabulary richness in the article can be effectively obtained.
参照图 7, 为本发明第六实施例提供的滑动窗口 51的结构示意图, 包括词语收集单元 511、 词语记录单元 512、 及启动单元 513。  Referring to FIG. 7, a schematic structural diagram of a sliding window 51 according to a sixth embodiment of the present invention includes a word collecting unit 511, a word recording unit 512, and a starting unit 513.
启动单元 512控制滑动窗口 51以文章起始处或文章结束处为起点开 始滑动。  The start unit 512 controls the sliding window 51 to start sliding at the beginning of the article or at the end of the article.
词语收集单元 512对滑过的词语进行不重复收集, 当收集的词语达 到预设数量时, 发送启动信息到启动单元 513和词语记录单元 512; 启 动单元 513以词语收集单元 511收集的最后一个词语为起点, 重新启动 滑动窗口 51 , 直至滑过整篇文章。 预设数量优选为 6。  The word collecting unit 512 performs non-repeating collection of the slipped words, and when the collected words reach the preset number, sends the activation information to the initiating unit 513 and the word recording unit 512; the last word collected by the initiating unit 513 by the word collecting unit 511 As a starting point, restart the sliding window 51 until it slides through the entire article. The preset number is preferably 6.
词语记录单元 512记录滑动窗口 51滑过词语的数量。  The word recording unit 512 records the number of sliding windows 51 sliding over the words.
滑动窗口 51还包括左边界和右边界,滑动时,右边界从起点开始右 移, 当词语收集单元 512收集的词语达到预设数量时,右边界停止移动, 左边界右移, 直至左右边界间只包含一个词语。  The sliding window 51 further includes a left boundary and a right boundary. When sliding, the right boundary moves rightward from the starting point. When the words collected by the word collecting unit 512 reach a preset number, the right boundary stops moving, and the left boundary moves to the right, until the left and right borders Contains only one word.
综上所述, 本发明实施例使用预先设置的滑动窗口以文章起始处或 文章结束处为起点开始滑动, 滑动窗口对滑过的词语进行不重复收集, 滑动窗口收集的词语达到预设数量时, 记录滑过词语的数量, 重新设置 起点, 继续滑动, 滑动窗口重复该过程, 直至滑过整篇文章; 在滑动窗 口记录的数量中获取数量值最大者, 并依据该数量的大小判断文章的重 要性。 如果文章整体词汇丰富但局部词汇贫乏, 局部存在大量词语重复 的现象。 本发明实施例滑动窗口滑过该部分时, 因滑动窗口对滑过的词 语进行不重复收集, 在收集的词语数量固定的条件下, 滑动窗口滑过的 词语数量相对其它部分较多, 该次滑动中记录的滑过词语的数量相对较 大, 会成为判断文章重要性的依据。 这样, 本发明实施例采用能够反映 文章词汇最贫乏部分的相关数值, 作为判断文章重要性的依据。 In summary, the embodiment of the present invention uses a preset sliding window to start sliding at the beginning of the article or the end of the article, and the sliding window does not repeatedly collect the slipped words, and the words collected by the sliding window reach a preset number. When recording, the number of words slipped over, reset the starting point, continue to slide, the sliding window repeats the process until the entire article is slid; the largest number of records is obtained in the number of sliding window records, and the article is judged according to the size of the number The importance of it. If the article's overall vocabulary is rich but the local vocabulary is poor, there are a lot of word duplication. The phenomenon. In the embodiment of the present invention, when the sliding window slides over the portion, the slipped window is not repeatedly collected due to the sliding window. Under the condition that the number of collected words is fixed, the number of words sliding through the sliding window is relatively larger than other parts. The number of slipped words recorded in the slide is relatively large, which will become the basis for judging the importance of the article. Thus, the embodiment of the present invention adopts a correlation value that reflects the most inferior part of the article vocabulary as a basis for judging the importance of the article.
相对于现有技术筒单统计文章中词语词频的方法, 本发明实施例可 以对整体词汇丰富但局部词汇贫乏的文章进行有效的判断, 方便用户使 用。  Compared with the method for counting the word frequency in the article in the prior art, the embodiment of the present invention can effectively judge the article with rich overall vocabulary but partial vocabulary, which is convenient for the user to use.
以上对本发明实施例所提供的一种判断文章重要性的方法、 系统及 滑动窗口, 进行了详细介绍。 上述说明应用了具体个例对本发明实施例 的原理及实施方式进行了阐述, 以上实施例的说明只是用于帮助理解本 发明实施例的方法及其核心思想; 同时, 对于本领域的一般技术人员, 依据本发明实施例的思想, 在具体实施方式及应用范围上均会有改变之 处, 综上所述, 本说明书内容不应理解为对本发明实施例的限制。  The method, system and sliding window for determining the importance of an article provided by the embodiment of the present invention are described in detail above. The foregoing description explains the principles and embodiments of the embodiments of the present invention by using specific examples. The description of the above embodiments is only for facilitating understanding of the method and the core idea of the embodiments of the present invention. Meanwhile, it is generally known to those skilled in the art. The present invention is not limited to the embodiments of the present invention. The details of the present invention are not limited to the embodiments of the present invention.

Claims

权利要求书 Claim
1、 一种判断文章重要性的方法, 其特征在于, 包括:  A method for determining the importance of an article, comprising:
滑动窗口从文章的初始滑动起始点处开始滑动, 所述滑动窗口对滑 过的词语进行不重复收集, 当所述滑动窗口收集的词语达到预设数量 时, 记录已滑过的总词语数量, 重新设置滑动起始点, 并从该重新设置 的滑动起始点处继续上述滑动过程;  The sliding window starts to slide from the initial sliding starting point of the article, and the sliding window does not repeatedly collect the slipped words, and when the words collected by the sliding window reach a preset number, the total number of words that have been slipped is recorded, Resetting the slide start point and continuing the above sliding process from the reset start point of the slide;
在所述滑动窗口记录的数量中获取最大数量值, 并依据获取的所述 最大数量值判断文章的重要性。  The maximum number of values is obtained in the number of the sliding window records, and the importance of the article is determined according to the obtained maximum number of values.
2、 如权利要求 1所述的方法, 其特征在于, 该方法还包括: 将文章中相邻的词语用空格间隔开。  2. The method of claim 1, further comprising: spacing adjacent words in the article with spaces.
3、如权利要求 1所述的方法, 其特征在于, 所述滑动窗口对滑过的 词语进行不重复收集包括:  3. The method of claim 1, wherein the sliding window does not repeatedly collect the slipped words comprises:
所述滑动窗口判断每一个滑过的词语是否与已收集的词语重复, 如 不重复则收集该词语, 如重复则不收集该词语。  The sliding window determines whether each of the slipped words is repeated with the collected words, and if not, the words are collected, and if repeated, the words are not collected.
4、如权利要求 1所述的方法, 其特征在于, 所述重新设置滑动起始 点包括:  The method of claim 1, wherein the resetting the sliding start point comprises:
将所述滑动窗口最后收集的一个词语设置为滑动起始点, 或将该最 后收集的词语的下一个词语设置为滑动起始点, 或将该最后收集的词语 的上一个词语设置为滑动起始点;  Setting a word collected lastly by the sliding window as a sliding starting point, or setting a next word of the last collected word as a sliding starting point, or setting a last word of the last collected word as a sliding starting point;
清空所述滑动窗口收集的词语。  Empty the words collected by the sliding window.
5、如权利要求 1至 4任一项所述的方法, 其特征在于, 所述依据该 最大数量值判断文章的重要性包括:  The method according to any one of claims 1 to 4, wherein the determining the importance of the article based on the maximum number of values comprises:
将获取的所述最大数量值与预定值比较, 如小于该预定值, 则确定 所述文章为重要文章。  The obtained maximum quantity value is compared with a predetermined value, such as less than the predetermined value, to determine that the article is an important article.
6、如权利要求 1至 4任一项所述的方法, 其特征在于, 所述依据该 最大数量值判断文章的重要性包括: The method according to any one of claims 1 to 4, wherein said The maximum number of values to determine the importance of the article includes:
将该最大数量值作为所述文章重要性的权值, 按权值大小确定所述 文章的重要程度。  The maximum quantity value is used as the weight of the article importance, and the importance of the article is determined according to the weight value.
7、如权利要求 1至 4任一项所述的方法, 其特征在于, 所述初始滑 动起始点为文章的起始点、 文章的结束点、 或预先设定的文章预定滑动 点。  The method according to any one of claims 1 to 4, characterized in that the initial slip starting point is a starting point of the article, an ending point of the article, or a predetermined predetermined sliding point of the article.
8、如权利要求 1至 4任一项所述的方法, 其特征在于, 所述文章为 网络文章或与网络脱机的本地文章。  The method according to any one of claims 1 to 4, characterized in that the article is a web article or a local article that is offline with the network.
9、如权利要求 8任一项所述的方法, 其特征在于, 该方法在判断文 章的重要性之后进一步包括:  The method according to any one of claims 8 to 8, wherein the method further comprises: after determining the importance of the document:
向网络或本地输出所述文章的重要程度。  The importance of exporting the article to the network or locally.
10、如权利要求 1至 4任一项所述的方法, 其特征在于, 该方法中, 所述从该重新设置的滑动起始点处继续上述滑动过程包括:  The method according to any one of claims 1 to 4, wherein in the method, the continuing the sliding process from the resetting start point of the sliding comprises:
从该重新设置的滑动起始点处继续上述滑动过程, 直至滑过整篇文 章。  The above sliding process is continued from the resetting start point of the slide until the entire article is slid.
11、 一种判断文章重要性的系统, 其特征在于, 包括: 滑动窗口、 最大数值获取单元和重要性判断单元, 所述滑动窗口包括词语收集单 元、 词语记录单元及启动单元:  11. A system for determining the importance of an article, comprising: a sliding window, a maximum value obtaining unit, and an importance determining unit, the sliding window comprising a word collecting unit, a word recording unit, and a starting unit:
所述启动单元, 用于控制滑动窗口从文章的初始滑动起始点处开始 滑动;  The starting unit is configured to control the sliding window to start sliding from an initial sliding starting point of the article;
所述词语收集单元, 用于对滑过的词语进行不重复收集, 当收集的 词语达到预设数量时, 发送启动信息到所述启动单元和所述词语记录单 元; 所述启动单元收到所述启动信息后重新设置滑动起始点, 并启动所 述滑动窗口从该重新设置的滑动起始点处继续上述滑动过程;  The word collecting unit is configured to perform non-repeating collection of the slipped words, and when the collected words reach a preset quantity, send startup information to the activation unit and the word recording unit; Resetting the sliding starting point after starting the information, and starting the sliding window to continue the sliding process from the reset starting point of the sliding;
所述词语记录单元, 用于在收到启动信息后记录所述滑动窗口已滑 过的总词语数量; The word recording unit is configured to record that the sliding window has been slid after receiving the startup information The total number of words used;
所述最大值获取单元, 用于在所述词语记录单元记录的数量中获取 最大数量值;  The maximum value obtaining unit is configured to obtain a maximum quantity value in the quantity recorded by the word recording unit;
所述重要性判断单元, 用于依据获取的所述最大数量值判断文章的 重要性。  The importance judging unit is configured to judge the importance of the article according to the obtained maximum quantity value.
12、 如权利要求 11所述的系统, 其特征在于, 该系统进一步包括词 语打散单元, 用于将文章中相邻的词语用空格间隔开。  12. The system of claim 11 wherein the system further comprises a word breaking unit for spacing adjacent words in the article with spaces.
13、 如权利要求 11或 12所述的系统, 其特征在于, 该系统进一步 包括重要程度输出单元;  13. The system of claim 11 or 12, wherein the system further comprises an importance level output unit;
所述重要程度输出单元, 用于在重要性判断单元判断文章的重要性 之后, 向网络或本地输出所述文章的重要程度。  The importance level output unit is configured to output the importance degree of the article to the network or the locality after the importance judging unit judges the importance of the article.
14、一种滑动窗口, 其特征在于, 所述滑动窗口包括词语收集单元、 词语记录单元及启动单元:  14. A sliding window, wherein the sliding window comprises a word collecting unit, a word recording unit and a starting unit:
所述启动单元, 用于控制滑动窗口从文章的初始滑动起始点处开始 滑动;  The starting unit is configured to control the sliding window to start sliding from an initial sliding starting point of the article;
所述词语收集单元, 用于对滑过的词语进行不重复收集, 当收集的 词语达到预设数量时, 发送启动信息到所述启动单元和所述词语记录单 元; 所述启动单元收到所述启动信息后重新设置滑动起始点, 并启动所 述滑动窗口从该重新设置的滑动起始点处继续上述滑动过程;  The word collecting unit is configured to perform non-repeating collection of the slipped words, and when the collected words reach a preset quantity, send startup information to the activation unit and the word recording unit; Resetting the sliding starting point after starting the information, and starting the sliding window to continue the sliding process from the reset starting point of the sliding;
所述词语记录单元, 用于在收到启动信息后记录所述滑动窗口已滑 过词语的数量。  The term recording unit is configured to record, after receiving the activation information, the number of words that the sliding window has slipped.
15、如权利要求 14所述的滑动窗口, 其特征在于, 所述滑动窗口还 包括左边界和右边界, 滑动时, 所述右边界从起点开始右移, 当所述词 语收集单元收集的词语达到预设数量时, 所述右边界停止移动, 所述左 边界右移, 直至所述左右边界间包含一个词语。 The sliding window according to claim 14, wherein the sliding window further comprises a left boundary and a right boundary, and when sliding, the right boundary is shifted right from the starting point, when the words collected by the word collecting unit When the preset number is reached, the right border stops moving, and the left border moves to the right until a word is included between the left and right borders.
16、 如权利要求 14或 15所述的滑动窗口, 其特征在于, 所述词语 收集单元, 用于从该重新设置的滑动起始点处继续上述滑动过程, 直至 滑过整篇文章。 The sliding window according to claim 14 or 15, wherein the word collecting unit is configured to continue the sliding process from the reset starting point of the sliding until the entire article is slid.
PCT/CN2008/070600 2007-05-31 2008-03-27 Method and system for judging of the inportance of article, and sliding window WO2008145031A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200710105297.8 2007-05-31
CNB2007101052978A CN100520767C (en) 2007-05-31 2007-05-31 Method and system for judging article importance in network, and sliding window

Publications (1)

Publication Number Publication Date
WO2008145031A1 true WO2008145031A1 (en) 2008-12-04

Family

ID=38898646

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2008/070600 WO2008145031A1 (en) 2007-05-31 2008-03-27 Method and system for judging of the inportance of article, and sliding window

Country Status (2)

Country Link
CN (1) CN100520767C (en)
WO (1) WO2008145031A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100520767C (en) * 2007-05-31 2009-07-29 腾讯科技(深圳)有限公司 Method and system for judging article importance in network, and sliding window
CN100545847C (en) * 2007-09-25 2009-09-30 腾讯科技(深圳)有限公司 A kind of method and system that blog articles is sorted
CN103336771B (en) * 2013-04-02 2016-12-28 江苏大学 Data similarity detection method based on sliding window

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5576954A (en) * 1993-11-05 1996-11-19 University Of Central Florida Process for determination of text relevancy
CN1133127C (en) * 1996-05-29 2003-12-31 松下电器产业株式会社 Document retrieval system
CN1818908A (en) * 2006-03-16 2006-08-16 董崇军 Feedbakc information use of searcher in search engine
CN101071419A (en) * 2007-05-31 2007-11-14 腾讯科技(深圳)有限公司 Method and system for judging article importance in network, and sliding window

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5576954A (en) * 1993-11-05 1996-11-19 University Of Central Florida Process for determination of text relevancy
CN1133127C (en) * 1996-05-29 2003-12-31 松下电器产业株式会社 Document retrieval system
CN1818908A (en) * 2006-03-16 2006-08-16 董崇军 Feedbakc information use of searcher in search engine
CN101071419A (en) * 2007-05-31 2007-11-14 腾讯科技(深圳)有限公司 Method and system for judging article importance in network, and sliding window

Also Published As

Publication number Publication date
CN101071419A (en) 2007-11-14
CN100520767C (en) 2009-07-29

Similar Documents

Publication Publication Date Title
CN105183897B (en) A kind of method and system of video search sequence
CN103414943B (en) television program comment information processing method and system
CN104636336B (en) A kind of method and apparatus of video search
WO2014056369A1 (en) Method and system for sorting online videos of search
EP2354981A2 (en) Image management apparatus, method of controlling the same, and storage medium storing program therefor
WO2014146550A1 (en) Search suggestion method and apparatus for map search, and computer storage medium and device
US20070265720A1 (en) Content marking method, content playback apparatus, content playback method, and storage medium
CN102566928A (en) System and method for automatically managing desktop application icons of mobile terminal
US8397263B2 (en) Information processing apparatus, information processing method and information processing program
CN103053156B (en) Present invention, interval manufacture method and interval production process
WO2008145031A1 (en) Method and system for judging of the inportance of article, and sliding window
CN103955533B (en) A kind of page tree data acquisition device based on buffer queue and method
JP2014506355A5 (en)
CN106682012A (en) Commodity object information searching method and device
CN105095251A (en) Terminal automatic display method and device based on user habit
JP2006319442A5 (en)
WO2014056370A1 (en) Method and system for use in providing personalized search list
KR20190022761A (en) Method and apparatus for updating search cache
CN102929954A (en) Method and device for controlling content displaying of search frame
CN106815284A (en) The recommendation method and recommendation apparatus of news video
KR20150004681A (en) Server for providing media information, apparatus, method and computer readable recording medium for searching media information related to media contents
CN105812917B (en) channel searching method and device
CN103294670A (en) Searching method and system based on word list
CN103365986A (en) Method for collecting short message in mobile terminal and mobile terminal
CN107239451A (en) Database index creation method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08715336

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 5543/CHENP/2009

Country of ref document: IN

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12/02/2010)

122 Ep: pct application non-entry in european phase

Ref document number: 08715336

Country of ref document: EP

Kind code of ref document: A1