WO2005024661A2 - Improved search engine optimisation - Google Patents

Improved search engine optimisation Download PDF

Info

Publication number
WO2005024661A2
WO2005024661A2 PCT/GB2004/003780 GB2004003780W WO2005024661A2 WO 2005024661 A2 WO2005024661 A2 WO 2005024661A2 GB 2004003780 W GB2004003780 W GB 2004003780W WO 2005024661 A2 WO2005024661 A2 WO 2005024661A2
Authority
WO
WIPO (PCT)
Prior art keywords
queries
query
measure
keywords
previous
Prior art date
Application number
PCT/GB2004/003780
Other languages
French (fr)
Other versions
WO2005024661A8 (en
Inventor
Peter H. Mowforth
Original Assignee
Teleit Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Teleit Limited filed Critical Teleit Limited
Priority to GB0604247A priority Critical patent/GB2419993A/en
Publication of WO2005024661A2 publication Critical patent/WO2005024661A2/en
Publication of WO2005024661A8 publication Critical patent/WO2005024661A8/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation

Definitions

  • the invention relates to the development and maintenance of a Web Site or other information resource where statistical data from the Web and from search engine usage is used to optimise the positioning of the Web site or information resource in search engine results .
  • Search Engines provide an automated index of the Web by systematically exploring the Web and recording and indexing the Web sites they visit.
  • a query is submitted, consisting of a set of search terms, and a ranked list of results is returned.
  • Search Engine Optimization is traditionally a largely manual process, which can involve some or all of the following steps 1. Identifying the market served by the web site (market segmentation) . 2. Identifying competitors (competitive analysis) . 3 . Selecting an appropriate set of search keywords . 4. Designing or modifying the site to maximize its search engine "visibility" .
  • a method of generating a measure of query relevance with respect to search engines comprising the steps of: • receiving first keywords; • retrieving search results from at least one search engine using said first keywords, said search results comprising web addresses; • retrieving web content associated with said web addresses; • generating an address similarity measure between said web addresses using their web content; • retrieving query statistics relating to at least one search engine using second keywords, said query statistics comprising queries and information associated with said queries ; • generating a query similarity measure between said queries using said query statistics; and • generating a measure of query relevance based on the difference between said address similarity measure and said query similarity measure.
  • the method further comprises the step of expanding said first keywords.
  • the method further comprises the step of processing said retrieved web content.
  • said first keywords further comprise phrases .
  • said step of generating an address similarity measure comprises positioning web addresses in a first parameter space.
  • Preferably said step of positioning web addresses uses latent semantic analysis.
  • Preferably said step of positioning web addresses is uses web address rank in search results .
  • said step of generating an address similarity measure further comprises associating a spread with said positioned web addresses.
  • said queries comprise second keywords .
  • said second keywords and said first keywords are identical.
  • said second keywords comprise keywords related to said first keywords.
  • said queries further comprise phrases.
  • said information associated with said queries comprises frequency information.
  • said information associated with said queries comprises clickthrough information.
  • said information associated with said queries comprises geographic information.
  • said information associated with said queries comprises demographic information.
  • said step of generating a query similarity measure comprises the step positioning said queries in a parameter space.
  • said step of positioning said queries in a parameter space comprises the step of positioning said queries in said first parameter space.
  • said step of generating a measure of query relevance further comprises normalising said address and query similarity measures.
  • said step of generating a measure of query relevance further comprises operating on said address and query similarity measures to provide a difference measure.
  • said step of operating on said address and query similarity measures comprises the steps of: • smoothing said query similarity measure in said parameter space; and • subtracting said address similarity measure from said query similarity measure in said parameter space.
  • a method of optimising web content comprising the steps: • generating a measure of query relevance according to the method of the first aspect; • determining at least one set of optimal queries using said measure of query relevance; • determining a set of optimised keywords from said at least one set of optimal queries; and • applying said set of optimised keywords to sample web content.
  • the method according to the second aspect is repeated so as to continuously optimise said sample web content .
  • said step of determining at least one set of optimal queries comprises clustering.
  • Preferably said step of determining at least one set of optimal queries comprises applying a threshold.
  • a method of evaluating web content comprising the steps: • generating a measure of query relevance according to the method of the first aspect; • determining a rating of sample web content using said sample web content and said measure of query relevance.
  • said step of determining a rating of sample web content further comprises the steps: • determining at least one set of optimal queries using said measure of query relevance; and • determining a rating of said sample web content using the overlap of said sample web content with said at least one set of optimal queries .
  • said step of determining at least one set of optimal queries comprises clustering.
  • Preferably said step of determining at least one set of optimal queries comprises applying a threshold.
  • a computer program comprising program instructions for causing a computer to perform the method of the first aspect of the present invention.
  • a computer program comprising program instructions for causing a computer to perform the method of the second aspect of the present invention.
  • a computer program comprising program instructions for causing a computer to perform the method of the third aspect of the present invention.
  • the computer programs may be embodied on a record medium, stored in a computer memory, embodied in read-only memory or carried on an electrical carrier signal .
  • Figure 1 shows the overall control flow of the preferred embodiment of the present invention.
  • Figure 2 shows the overall control flow of an embodiment of the present invention.
  • Figure 3 shows the means for determining overall market segmentation, and a measure of similarity between Web pages in one embodiment of the present invention.
  • Embodiments of the present invention provide a methodology for continuous adjustment to a site in order to optimise it with respect to search engines. Accordingly, an embodiment of the present invention provides a means for providing a quantitative ranking for a Web site by evaluating the site with respect to a given market segment and competitive environment. The ranking mechanism, the competitive environment and the market segmentation are produced automatically by appropriate interaction with one or more search engines. The automatic qualitative ranking allows a search procedure to continually optimise the ranking by providing small changes which can be continuously evaluated.
  • the principal driver for the invention is the idea that the principal external driver for optimization is to exploit the difference between the information and services which are currently available, as indicated by Search Engines and what the users/customers want, as indicated by searches and the choices made by users as a result of searches.
  • a measure of query relevance with respect to search engines is generated by: • generating keywords and optionally phrases 2.
  • the keywords may be expanded by thesaural expansion or by finding related keywords from an analysis of actual search engine query phrases; • submitting the keywords/phrases to one or more search engines 4 and retrieving search results that contain URLs.
  • the URLs may be pruned by removing duplicates; • retrieving web content 6 from the web sites and pages pointed to by the URLs . Java programs are used for this spidering software.
  • the web content is distilled by removing common words and stemming; • generating an URL address similarity measure 8 between the URLs using the web content by positioning the URLs in a parameter space using Latent Semantic Analysis .
  • the URL is a point and a spread can be assigned to the point based on the similarity of the words in the web content associated with it .
  • the URL can be represented by a Gaussian function in the parameter space.
  • the URL rank obtained from the search results may be used in the positioning; • retrieving query statistics relating to at least one search engine 4 using keywords that are the same, similar or related to those submitted to the search engines in the above step.
  • the query statistics are made up of phrases/keywords and frequency information.
  • They may also contain geographic, demographic and/or clickthrough information associated with the queries; • generating a query similarity measure 10 between the queries by positioning the queries in the same parameter space as the URLs (or one that can be mapped onto it) ; and • generating a measure of query relevance 12 based on the difference between the URL similarity measure and the query similarity measure. This is done by normalising the two similarity data (for example by scaling one of them) , smoothing the query similarity measure, then subtracting the URL similarity measure from the query similarity measure .
  • XML is used as the data structure for the storage of address and query similarity measure data. These technologies were used to enable the invention to be portable numerous different platforms, thus allowing target market user profiles to be computed using grid computing techniques . It should be noted that this can be computed on one or many computers and the software could have been written in numerous different programming languages.
  • the XML data structure can be represented as within a SQL (or similar) database table.
  • a web page may be optimised by; • determining one or more sets of optimal queries and keywords 14 by clustering the measure of query relevance.
  • the clustering can be achieved by applying a threshold.
  • the clustering may be done on the URL similarity measure and/or the query similarity measure before generating the measure of query relevance.
  • the clustering yields keywords that users are looking for, yet where relatively fewer search results are found; and • applying the set of optimised keywords 18 to the web page.
  • a web page After generating a measure of query relevance then a web page may be evaluated 16, giving a rating (or ranking compared to others) by: • determining at one or more clusters of optimal queries from the measure of query relevance; and • quantifying the overlap of the web page with the optimal query cluster (s) .
  • the decision whether to modify a web page may be based on the evaluation.
  • the starting point for analysis is a basic set of keywords 21 and phrases relevant to the domain.
  • web sites which are known to fall in the set can be used, as described below.
  • the similarity measure is the Euclidean distance between pages with respect to the first N latent components, where N is 50 in the preferred implementation. 6.
  • the web pages are clustered 34 using k-means clustering.
  • the clustering metric is the similarity measure determined by the Latent Semantic Analysis. Each cluster represents a sector of competition for the "product" defined via the keywords and phrases .
  • LSA Latent Semantic Analysis
  • the number of dimensions used from the Singular Value Decomposition in LSI is variable. In this embodiment of this invention, 50 dimensions are used.
  • the clusters can be used decomposed in various ways, for instance geographically, orthogonal to the original construction via semantic analysis of keywords .
  • the clustering analysis uses only Web addresses acquired from searches . More sophisticated strategies for clustering involving further search using web-bots to explore links not pursued by the standard search engines can be used where the market is specialized.
  • the objective of customer profiling 24 is to determine a set of different customer information requirements, each requirement relating to a particular "customer profile".
  • the starting point is a set of keywords and phrases.
  • Market Segmentation we are interested in the totality of available information within the range defined by the keywords and phrases.
  • customer profiling we are interested in clustering search queries, chosen from our set of keywords and phrases, in such a way that each cluster contains queries used by customers in search of a particular class of information resource.
  • the information available consists of triples (Q, U, R) where Q is a query, U is the URL (Uniform Resource Locator) which was selected in response to that query and R is the rank of the selected URL with respect to the query. This is termed "clickthrough" data and is available from search engines.
  • Similarity between queries can be. defined as the proportion of words or phrases which they have in common. Similarity between URLs is defined in terms of the distance measure described above in terms of Market Segmentation.
  • the link structure between queries and URLs can be described as a bipartite graph. The "white” nodes of the graph are the unique queries, and the "black” nodes are URLs. The similarity between two nodes of the same color is the proportion of links they share compared to their total number _of links . It is also possible to assign a weight to links in this graph, the weight being a function of the ranking of the URL selected from a query presented to a search engine.
  • search queries which are permutations of each other can be considered equal, and queries can be clustered by containment in a natural manner.
  • a similarity measure between the queries (white vertices) , based on a combination of the above similarities, and a complementary measure between the URLs can be generated by a weighted combination of the individual similarities .
  • the clustering algorithm which clusters both URLs and queries, proceeds as follows. It is a version of Hierarchical Agglomerative Clustering (HAC) . It proceeds on the assumption that the number of URLs is very considerably larger than the number of queries .
  • HAC Hierarchical Agglomerative Clustering
  • the end result of this algorithm is a hierarchical clustering of both queries and URLs. For any query cluster there is an associated set of URL clusters.
  • the goal of optimization is to position an information resource (a Web site, viewed as a set of web pages) so as to maximize its initial value, and to continually evaluate the situation on an ongoing basis so as to maximize its continuing value.
  • a Web site viewed as a set of web pages
  • the goal is not to maximize the number of visitors to the site, but the maximize the number of visitors who pay to consume the site's resources. This may mean making a purchase, or just taking the time to read articles and information on the site.
  • the goal of optimization is to generate a numerical measure of the "fitness" of a web page/site.
  • the following assumptions are made • Pages which are semantically similar with respect to the distance measure described in the section titled "Market Segmentation" and which lie in the same cluster with respect to the Customer Profile, and with similar strengths, will attract a similar number of visitors. This is a base assumption, in that it justifies the use a numerical measure of Web site utility. • The ratio of number of visitors to cluster size is a direct determinant of the value of a cluster, as larger values imply more visitors for any site in the cluster. • A query cluster which relates to a URL cluster with low average rank is interesting since users have consistently chosen low-ranking URLs from search queries.
  • Optimisation is typically performed for either a web site or a small group of interrelated pages, for example those describing a particular product.
  • 1. Produce an initial set of keywords and phrases 21.
  • 2. Produce a market segmentation 23 consisting of a similarity measure between. web pages, and a clustering of sites/pages as described in Algorithm One above.
  • 3. Produce a Customer Profile 24, as described in Algorithm Two above.
  • 4. Create initial site and web pages 25. 5.Determine the degree of membership of pages in the query clusters produced in Step 3.
  • This embodiment provides a means for continuous adjustment to a Web site or Web pages in order to optimise it with respect to search engines, comprising a means for automatic clustering of relevant pages and sites via enumerated permutations of keywords and phrases and the use of searches for relevant web sites, and comprising a second means for customer profiling based on the clustering of the first means and processing of aggregated queries to search engines, said first and second means being used by an evaluation means which provides an automatic numerical ranking of pages and sites, said evaluation means which provides continuous adjustment and optimisation of web sites and web pages .
  • the present invention provides a computer program that functions in an unique manner to provide an analysis of web content and user search request information.
  • the invention provides a novel tool for the analysis of web sites and search engines and modification of web sites to allow a user to improve their search engine rankings .
  • the ability of a computer operating a computer program in accordance with the present invention to retrieve and analyse web content and queries in the manner described in the application provides an extremely valuable new and inventive technical contribution to the art .

Abstract

A method of optimising and rating web pages and sites uses search engine query results to provide a measure of similarity between web sites and search engine statistics to provide a measure of similarity between queries. The difference between the measures is obtained, and clustering then yields keywords that users are looking for, yet where relatively fewer search results are found. The keywords are used for optimising web pages and sites. The clusters are used for evaluating web pages and sites, by testing web pages against the clusters.

Description

Improved Search Engine Optimisation
The invention relates to the development and maintenance of a Web Site or other information resource where statistical data from the Web and from search engine usage is used to optimise the positioning of the Web site or information resource in search engine results .
Users of the Web often need to find specific information from the Web. An increasingly common tool for searching out specific information requirements is the Search Engine. Search Engines provide an automated index of the Web by systematically exploring the Web and recording and indexing the Web sites they visit.
To find information via a search engine a query is submitted, consisting of a set of search terms, and a ranked list of results is returned.
It is in the interests of a Web site to be as highly ranked as possible with respect to relevant queries . The process of tuning a web site in order to maximize its ranking is known as Search Engine Optimization.
Search Engine Optimization is traditionally a largely manual process, which can involve some or all of the following steps 1. Identifying the market served by the web site (market segmentation) . 2. Identifying competitors (competitive analysis) . 3 . Selecting an appropriate set of search keywords . 4. Designing or modifying the site to maximize its search engine "visibility" .
It is an object of the present invention to generate a measure of query relevance with respect to search engines. It is a further object of the present invention to optimise web content with respect to search engines. It is a further object of the present invention to evaluate web content with respect to search engines .
According to a first aspect of the present invention, there is provided a method of generating a measure of query relevance with respect to search engines, the method comprising the steps of: • receiving first keywords; • retrieving search results from at least one search engine using said first keywords, said search results comprising web addresses; • retrieving web content associated with said web addresses; • generating an address similarity measure between said web addresses using their web content; • retrieving query statistics relating to at least one search engine using second keywords, said query statistics comprising queries and information associated with said queries ; • generating a query similarity measure between said queries using said query statistics; and • generating a measure of query relevance based on the difference between said address similarity measure and said query similarity measure.
Preferably the method further comprises the step of expanding said first keywords.
Preferably the method further comprises the step of processing said retrieved web content.
Optionally said first keywords further comprise phrases .
Preferably said step of generating an address similarity measure comprises positioning web addresses in a first parameter space.
Preferably said step of positioning web addresses uses latent semantic analysis.
Preferably said step of positioning web addresses is uses web address rank in search results .
Preferably said step of generating an address similarity measure further comprises associating a spread with said positioned web addresses.
Preferably said queries comprise second keywords .
Preferably said second keywords and said first keywords are identical. Alternatively said second keywords comprise keywords related to said first keywords.
Preferably said queries further comprise phrases.
Preferably said information associated with said queries comprises frequency information.
Optionally said information associated with said queries comprises clickthrough information.
Optionally said information associated with said queries comprises geographic information.
Optionally said information associated with said queries comprises demographic information.
Preferably said step of generating a query similarity measure comprises the step positioning said queries in a parameter space.
Preferably said step of positioning said queries in a parameter space comprises the step of positioning said queries in said first parameter space.
Preferably said step of generating a measure of query relevance further comprises normalising said address and query similarity measures.
Preferably said step of generating a measure of query relevance further comprises operating on said address and query similarity measures to provide a difference measure. Preferably said step of operating on said address and query similarity measures comprises the steps of: • smoothing said query similarity measure in said parameter space; and • subtracting said address similarity measure from said query similarity measure in said parameter space.
According to a second aspect of the present invention there is provided a method of optimising web content comprising the steps: • generating a measure of query relevance according to the method of the first aspect; • determining at least one set of optimal queries using said measure of query relevance; • determining a set of optimised keywords from said at least one set of optimal queries; and • applying said set of optimised keywords to sample web content.
Preferably the method according to the second aspect is repeated so as to continuously optimise said sample web content .
Preferably said step of determining at least one set of optimal queries comprises clustering.
Preferably said step of determining at least one set of optimal queries comprises applying a threshold.
According to a third aspect of the present invention there is provided a method of evaluating web content comprising the steps: • generating a measure of query relevance according to the method of the first aspect; • determining a rating of sample web content using said sample web content and said measure of query relevance.
Preferably said step of determining a rating of sample web content further comprises the steps: • determining at least one set of optimal queries using said measure of query relevance; and • determining a rating of said sample web content using the overlap of said sample web content with said at least one set of optimal queries .
Preferably said step of determining at least one set of optimal queries comprises clustering.
Preferably said step of determining at least one set of optimal queries comprises applying a threshold.
According to a fourth aspect of the present invention, there is provided a computer program comprising program instructions for causing a computer to perform the method of the first aspect of the present invention.
According to a fifth aspect of the present invention, there is provided a computer program comprising program instructions for causing a computer to perform the method of the second aspect of the present invention.
According to a sixth aspect of the present invention, there is provided a computer program comprising program instructions for causing a computer to perform the method of the third aspect of the present invention.
The computer programs may be embodied on a record medium, stored in a computer memory, embodied in read-only memory or carried on an electrical carrier signal .
Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings, in which:
Figure 1 shows the overall control flow of the preferred embodiment of the present invention.
Figure 2 shows the overall control flow of an embodiment of the present invention.
Figure 3 shows the means for determining overall market segmentation, and a measure of similarity between Web pages in one embodiment of the present invention.
Embodiments of the present invention provide a methodology for continuous adjustment to a site in order to optimise it with respect to search engines. Accordingly, an embodiment of the present invention provides a means for providing a quantitative ranking for a Web site by evaluating the site with respect to a given market segment and competitive environment. The ranking mechanism, the competitive environment and the market segmentation are produced automatically by appropriate interaction with one or more search engines. The automatic qualitative ranking allows a search procedure to continually optimise the ranking by providing small changes which can be continuously evaluated.
The principal driver for the invention is the idea that the principal external driver for optimization is to exploit the difference between the information and services which are currently available, as indicated by Search Engines and what the users/customers want, as indicated by searches and the choices made by users as a result of searches.
With reference to Figure 1, a preferred embodiment of the present invention is shown.
First, a measure of query relevance with respect to search engines is generated by: • generating keywords and optionally phrases 2. The keywords may be expanded by thesaural expansion or by finding related keywords from an analysis of actual search engine query phrases; • submitting the keywords/phrases to one or more search engines 4 and retrieving search results that contain URLs. The URLs may be pruned by removing duplicates; • retrieving web content 6 from the web sites and pages pointed to by the URLs . Java programs are used for this spidering software. The web content is distilled by removing common words and stemming; • generating an URL address similarity measure 8 between the URLs using the web content by positioning the URLs in a parameter space using Latent Semantic Analysis . In this space the URL is a point and a spread can be assigned to the point based on the similarity of the words in the web content associated with it . Thus the URL can be represented by a Gaussian function in the parameter space. The URL rank obtained from the search results may be used in the positioning; • retrieving query statistics relating to at least one search engine 4 using keywords that are the same, similar or related to those submitted to the search engines in the above step. The query statistics are made up of phrases/keywords and frequency information. They may also contain geographic, demographic and/or clickthrough information associated with the queries; • generating a query similarity measure 10 between the queries by positioning the queries in the same parameter space as the URLs (or one that can be mapped onto it) ; and • generating a measure of query relevance 12 based on the difference between the URL similarity measure and the query similarity measure. This is done by normalising the two similarity data (for example by scaling one of them) , smoothing the query similarity measure, then subtracting the URL similarity measure from the query similarity measure .
XML is used as the data structure for the storage of address and query similarity measure data. These technologies were used to enable the invention to be portable numerous different platforms, thus allowing target market user profiles to be computed using grid computing techniques . It should be noted that this can be computed on one or many computers and the software could have been written in numerous different programming languages. The XML data structure can be represented as within a SQL (or similar) database table.
After generating the measure of query relevance a web page may be optimised by; • determining one or more sets of optimal queries and keywords 14 by clustering the measure of query relevance. The clustering can be achieved by applying a threshold. Alternatively the clustering may be done on the URL similarity measure and/or the query similarity measure before generating the measure of query relevance. Hence the clustering yields keywords that users are looking for, yet where relatively fewer search results are found; and • applying the set of optimised keywords 18 to the web page.
In order to achieve continuous optimisation of web pages and sites the generation of query relevance and web page optimisation above is repeated. This provides a closed loop control system.
After generating a measure of query relevance then a web page may be evaluated 16, giving a rating (or ranking compared to others) by: • determining at one or more clusters of optimal queries from the measure of query relevance; and • quantifying the overlap of the web page with the optimal query cluster (s) .
The decision whether to modify a web page may be based on the evaluation.
A further embodiment of the present invention is presented below with reference to Figures 2 and 3, the objective of the competitive analysis of the market segmentation step 23 is to apply Statistical and Machine Learning techniques to develop clusters of web sites, where each cluster represents a related set of competitors.
The starting point for analysis is a basic set of keywords 21 and phrases relevant to the domain. In addition web sites which are known to fall in the set can be used, as described below.
Algorithm One l.The keywords and phrases are used to generate a set of permutations 31 of subsets of the words and phrases. If there are n keywords, then there are subsets of
Figure imgf000012_0001
length k and permutations of these subsets .
Figure imgf000012_0002
2.Each permutation is presented to a series of search engines 22. 3. Each web page, retrieved from the first N hits on the search engine, is saved 32, indexed by its associated permutation and ranked from 1 to N. 4. The web pages are processed to remove common words, and the words are stemmed 33 using the Porter algorithm. 5. The web pages retrieved in this way are filtered for duplicates and then Latent Semantic Analysis is used to create a similarity measure between sites 34. The similarity measure is the Euclidean distance between pages with respect to the first N latent components, where N is 50 in the preferred implementation. 6. The web pages are clustered 34 using k-means clustering. The clustering metric is the similarity measure determined by the Latent Semantic Analysis. Each cluster represents a sector of competition for the "product" defined via the keywords and phrases .
The use of Latent Semantic Analysis (LSI) on web pages retrieved by searches on permutations of the original keywords and phrases reduces dependence of results on the initial terms and phrases, since all the information in the retrieved pages is used to determine a clustering metric and the effects of polysemy and synonymy are reduced.
The number of dimensions used from the Singular Value Decomposition in LSI is variable. In this embodiment of this invention, 50 dimensions are used.
The clusters can be used decomposed in various ways, for instance geographically, orthogonal to the original construction via semantic analysis of keywords .
The clustering analysis uses only Web addresses acquired from searches . More sophisticated strategies for clustering involving further search using web-bots to explore links not pursued by the standard search engines can be used where the market is specialized.
The objective of customer profiling 24 is to determine a set of different customer information requirements, each requirement relating to a particular "customer profile". As with Market Segmentation the starting point is a set of keywords and phrases. In the case of Market Segmentation we are interested in the totality of available information within the range defined by the keywords and phrases. With customer profiling we are interested in clustering search queries, chosen from our set of keywords and phrases, in such a way that each cluster contains queries used by customers in search of a particular class of information resource.
The information available consists of triples (Q, U, R) where Q is a query, U is the URL (Uniform Resource Locator) which was selected in response to that query and R is the rank of the selected URL with respect to the query. This is termed "clickthrough" data and is available from search engines.
The value of this data for clustering queries is shown by the following related observations. If two different users search with terms "fly" and "ant" but select the same URL, there is evidence that the search terms are related to a common information requirement. Similarly if two distinct users search on the same term "ant" and visit different URLs there is some evidence that these two URLs are related. Note that such evidence is statistical, the term "law" for example might relate to either the legal system or physics .
There are three kinds of information available for clustering. l . The similarity between queries 2. The similarity between URLs 3. The link structure between queries and URLs
Similarity between queries can be. defined as the proportion of words or phrases which they have in common. Similarity between URLs is defined in terms of the distance measure described above in terms of Market Segmentation. The link structure between queries and URLs can be described as a bipartite graph. The "white" nodes of the graph are the unique queries, and the "black" nodes are URLs. The similarity between two nodes of the same color is the proportion of links they share compared to their total number _of links . It is also possible to assign a weight to links in this graph, the weight being a function of the ranking of the URL selected from a query presented to a search engine.
There are some natural variations of these similarity measures. For example, search queries which are permutations of each other can be considered equal, and queries can be clustered by containment in a natural manner.
A similarity measure between the queries (white vertices) , based on a combination of the above similarities, and a complementary measure between the URLs can be generated by a weighted combination of the individual similarities .
The clustering algorithm, which clusters both URLs and queries, proceeds as follows. It is a version of Hierarchical Agglomerative Clustering (HAC) . It proceeds on the assumption that the number of URLs is very considerably larger than the number of queries .
Algorithm Two l.The two query nodes with the greatest similarity are merged. Record this merger. 2. he most similar URLs are merged. This is done a reasonably large number of times, since there are many more URLs than queries. Record these mergers. 3.Goto step 1 unless the number of queries has been reduced below threshold.
The end result of this algorithm is a hierarchical clustering of both queries and URLs. For any query cluster there is an associated set of URL clusters.
We make the assumption for any query cluster that each URL cluster corresponds to a distinct market for the query cluster.
The goal of optimization is to position an information resource (a Web site, viewed as a set of web pages) so as to maximize its initial value, and to continually evaluate the situation on an ongoing basis so as to maximize its continuing value.
The goal is not to maximize the number of visitors to the site, but the maximize the number of visitors who pay to consume the site's resources. This may mean making a purchase, or just taking the time to read articles and information on the site.
It is difficult to determine the conversion rate of visitors who consume unless one can investigate in detail the behavior of visitors and perform experiments. We assume that it is possible to optimise conversion rate separately from visitor rate once the initial positioning of the site or page has been determined.
The goal of optimization is to generate a numerical measure of the "fitness" of a web page/site. The following assumptions are made • Pages which are semantically similar with respect to the distance measure described in the section titled "Market Segmentation" and which lie in the same cluster with respect to the Customer Profile, and with similar strengths, will attract a similar number of visitors. This is a base assumption, in that it justifies the use a numerical measure of Web site utility. • The ratio of number of visitors to cluster size is a direct determinant of the value of a cluster, as larger values imply more visitors for any site in the cluster. • A query cluster which relates to a URL cluster with low average rank is interesting since users have consistently chosen low-ranking URLs from search queries.
These observations can be used to assign a numerical rank to web pages and web sites. The factors used to determine ranking are . l.The cluster value (visitors/element) . 2. he within-cluster ranking, determined by ranking score on queries within the cluster's associated queries. This ranking is comparative. 3. he relevance of the cluster to the product being sold.
We now describe the optimisation process with reference to Figure 2. Optimisation is typically performed for either a web site or a small group of interrelated pages, for example those describing a particular product. 1. Produce an initial set of keywords and phrases 21. 2. Produce a market segmentation 23 consisting of a similarity measure between. web pages, and a clustering of sites/pages as described in Algorithm One above. 3. Produce a Customer Profile 24, as described in Algorithm Two above. 4. Create initial site and web pages 25. 5.Determine the degree of membership of pages in the query clusters produced in Step 3. 6.Assign pages to clusters 26 based on a determination of how its numerical ranking can be optimised using the difference between the similarity measure produced in Step 2 (the information and services which are currently available, as indicated by Search Engines) and the similarity measure between queries produced in Step 3 (what the users/customers want, as indicated by searches and the choices made by users as a result of searches) . This process consists of the following steps: (a) Estimate the clusters most relevant to the product being sold. (b) Modify the page in terms of keywords and language to minimize its distance to the cluster centre 27. 7.Make the pages/site live. 8.Monitor the numerical ranking of pages 28. This is necessary to determine factor 2 above (the within-cluster ranking) . 9.Monitor the ranking of the site on a continual basis by repeating steps 2, 3 and 8.
This embodiment provides a means for continuous adjustment to a Web site or Web pages in order to optimise it with respect to search engines, comprising a means for automatic clustering of relevant pages and sites via enumerated permutations of keywords and phrases and the use of searches for relevant web sites, and comprising a second means for customer profiling based on the clustering of the first means and processing of aggregated queries to search engines, said first and second means being used by an evaluation means which provides an automatic numerical ranking of pages and sites, said evaluation means which provides continuous adjustment and optimisation of web sites and web pages .
The present invention provides a computer program that functions in an unique manner to provide an analysis of web content and user search request information. The invention provides a novel tool for the analysis of web sites and search engines and modification of web sites to allow a user to improve their search engine rankings . The ability of a computer operating a computer program in accordance with the present invention to retrieve and analyse web content and queries in the manner described in the application provides an extremely valuable new and inventive technical contribution to the art .

Claims

Claims
1. A method of generating a measure of query relevance with respect to search engines, the method comprising the steps of : • receiving first keywords; • retrieving search results from at least one search engine using said first keywords, said search results comprising web addresses ; • retrieving web content associated with said web addresses; • generating an address similarity measure between said web addresses using their web content; • retrieving query statistics relating to at least one search engine using second keywords, said query statistics comprising queries and information associated with said queries; • generating a query similarity measure between said queries using said query statistics; and • generating a measure of query relevance based on the difference between said address similarity measure and said query similarity measure.
2. The method of claim 1 further comprising the step of expanding said first keywords .
3. The method of any previous claim wherein the method further comprises the step of processing said retrieved web content.
4. The method of any previous claim wherein said first keywords further comprise phrases.
5. The method of any previous claim wherein said step of generating an address similarity .measure comprises positioning web addresses in a first parameter space.
6. The method of claim 5 wherein said step of positioning web addresses uses latent semantic analysis .
7. The method of any of claims 5 to 6 wherein said step of positioning web addresses uses web address rank in search results.
8. The method of any previous claim wherein said step of generating an address similarity measure further comprises associating a spread with said positioned web addresses.
9. The method of any previous claim wherein said queries comprise second keywords.
10. The method of claim 9 wherein said second keywords and said first keywords are identical .
11. The method of claim 9 wherein said second keywords comprise keywords related to said first keywords .
12. The method of any previous claim wherein said queries further comprise phrases.
13. The method of any previous claim wherein said information associated with said queries comprises frequency information.
14. The method of any previous claim wherein said information associated with said queries comprises clickthrough information.
15. The method of any previous claim wherein said information associated with said queries comprises geographic information.
16. The method of any previous claim wherein said information associated with said queries comprises demographic information.
17. The method of any previous claim wherein said step of generating a query similarity measure comprises the step positioning said queries in a parameter space.
18. The method of claim 17 wherein said step of positioning said queries in a parameter space comprises the step of positioning said queries in said first parameter space.
19. The method of any previous claim wherein said step of generating a measure of query relevance further comprises normalising said address and query similarity measures .
20. The method of any previous claim wherein said step of generating a measure of query relevance further comprises operating on said address and query similarity measures to provide a difference measure.
21. The method of claim 20 wherein said step of operating on said address and query similarity measures comprises the steps of: • smoothing said query similarity measure in said parameter space ; and • subtracting said address similarity measure from said query similarity measure in said parameter space.
22. A method of optimising web content comprising the steps : • generating a measure of query relevance according to any previous claim; • determining at least one set of optimal queries using said measure of query relevance; • determining a set of optimised keywords from said at least one set of optimal queries; and • applying said set of optimised keywords to sample web content .
23. A method of continuously optimising web content comprising the step of repeating the method of claim 22.
24. The method of any of claims 22 to 23 wherein said step of determining at least one set of optimal queries comprises clustering.
25. The method of any of claims 22 to 24 wherein said step of determining at least one set of optimal queries comprises applying a threshold.
26. A method of evaluating web content comprising the steps: • generating a measure of query relevance according to any of claims 1 to 21; • determining a rating of sample web content using said sample web content and said measure of query relevance. 1 2 27. The method of claim 26 wherein said step of determining a 3 rating of sample web content further comprises the steps: 4 • determining at least one set of optimal queries using 5 said measure of query relevance; and 6 • determining a rating of said sample web content using the 7 overlap of said sample web content with said at least one 8 set of optimal queries . 9
10 28. The method of claim 27 wherein said step of determining
11 at least one set of optimal queries comprises clustering. 12
13 29. The method of any of claims 27 to 28 wherein said step of
14 determining at least one set of optimal queries comprises
15 applying a threshold. 16
17 30. A computer program comprising program instructions for
18 causing a computer to perform the method of any previous
19 claim. 20
21 31 . The computer program of claim 30 , embodied on a record
22 medium . 23
24 32 . The computer program of claim 30 , stored in a computer
25 memory . 26
27 33. The computer program of claim 30, embodied in read-only
28 memory. 29.
30 34. The computer program of claim 30, carried on an
31 electrical carrier signal.
PCT/GB2004/003780 2003-09-03 2004-09-03 Improved search engine optimisation WO2005024661A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB0604247A GB2419993A (en) 2003-09-03 2004-09-03 Improved search engine optimisation

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB0320583.8 2003-09-03
GB0320583A GB2405709A (en) 2003-09-03 2003-09-03 Search engine optimization using automated target market user profiles

Publications (2)

Publication Number Publication Date
WO2005024661A2 true WO2005024661A2 (en) 2005-03-17
WO2005024661A8 WO2005024661A8 (en) 2006-01-05

Family

ID=28686802

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2004/003780 WO2005024661A2 (en) 2003-09-03 2004-09-03 Improved search engine optimisation

Country Status (2)

Country Link
GB (2) GB2405709A (en)
WO (1) WO2005024661A2 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7487144B2 (en) 2006-05-24 2009-02-03 Microsoft Corporation Inline search results from user-created search verticals
WO2012027022A1 (en) * 2010-08-23 2012-03-01 Vistapprint Technologies Limited Search engine optmization assistant
FR3032291A1 (en) * 2015-02-04 2016-08-05 Jalis TOOL AND METHOD FOR IMPROVING THE REFERENCING OF AN INTERNET SITE
EP3079076A1 (en) * 2015-04-10 2016-10-12 Pixalione Method, device and program for determining a semantic gap
EP3155533A4 (en) * 2015-06-29 2018-04-11 Nowfloats Technologies Pvt. Ltd. System and method for optimizing and enhancing visibility of the website
US10127314B2 (en) 2012-03-21 2018-11-13 Apple Inc. Systems and methods for optimizing search engine performance

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108170665B (en) * 2017-11-29 2021-06-04 有米科技股份有限公司 Keyword expansion method and device based on comprehensive similarity

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6327590B1 (en) * 1999-05-05 2001-12-04 Xerox Corporation System and method for collaborative ranking of search results employing user and group profiles derived from document collection content analysis
US6564210B1 (en) * 2000-03-27 2003-05-13 Virtual Self Ltd. System and method for searching databases employing user profiles
US6687696B2 (en) * 2000-07-26 2004-02-03 Recommind Inc. System and method for personalized search, information filtering, and for generating recommendations utilizing statistical latent class models

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
No Search *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7487144B2 (en) 2006-05-24 2009-02-03 Microsoft Corporation Inline search results from user-created search verticals
WO2012027022A1 (en) * 2010-08-23 2012-03-01 Vistapprint Technologies Limited Search engine optmization assistant
US8990206B2 (en) 2010-08-23 2015-03-24 Vistaprint Schweiz Gmbh Search engine optimization assistant
US10127314B2 (en) 2012-03-21 2018-11-13 Apple Inc. Systems and methods for optimizing search engine performance
FR3032291A1 (en) * 2015-02-04 2016-08-05 Jalis TOOL AND METHOD FOR IMPROVING THE REFERENCING OF AN INTERNET SITE
EP3079076A1 (en) * 2015-04-10 2016-10-12 Pixalione Method, device and program for determining a semantic gap
FR3034893A1 (en) * 2015-04-10 2016-10-14 Pixalione METHOD FOR DETERMINING SEMANTIC GAP, DEVICE AND PROGRAM THEREOF
EP3155533A4 (en) * 2015-06-29 2018-04-11 Nowfloats Technologies Pvt. Ltd. System and method for optimizing and enhancing visibility of the website

Also Published As

Publication number Publication date
GB2405709A (en) 2005-03-09
GB0604247D0 (en) 2006-04-12
GB2419993A (en) 2006-05-10
GB0320583D0 (en) 2003-10-01
WO2005024661A8 (en) 2006-01-05

Similar Documents

Publication Publication Date Title
US6112203A (en) Method for ranking documents in a hyperlinked environment using connectivity and selective content analysis
Diligenti et al. Focused Crawling Using Context Graphs.
US6871202B2 (en) Method and apparatus for ranking web page search results
US6738678B1 (en) Method for ranking hyperlinked pages using content and connectivity analysis
US7636714B1 (en) Determining query term synonyms within query context
US8312035B2 (en) Search engine enhancement using mined implicit links
US7197497B2 (en) Method and apparatus for machine learning a document relevance function
JP4994243B2 (en) Search processing by automatic categorization of queries
US6418433B1 (en) System and method for focussed web crawling
US6795820B2 (en) Metasearch technique that ranks documents obtained from multiple collections
CN105045875B (en) Personalized search and device
US20030014501A1 (en) Predicting the popularity of a text-based object
US20050060290A1 (en) Automatic query routing and rank configuration for search queries in an information retrieval system
EP1669895A1 (en) Intent-based search refinement
WO2006007229A1 (en) Method and apparatus for retrieving and indexing hidden web pages
Nasraoui et al. A framework for mining evolving trends in web data streams using dynamic learning and retrospective validation
Singh et al. A comparative study of page ranking algorithms for information retrieval
Ding et al. User modeling for personalized Web search with self‐organizing map
WO2005024661A2 (en) Improved search engine optimisation
Hansen et al. Using navigation data to improve IR functions in the context of web search
Ustinovskiy et al. An optimization framework for weighting implicit relevance labels for personalized web search
Mishra et al. An effective algorithm for web mining based on topic sensitive link analysis
Maratea et al. An heuristic approach to page recommendation in web usage mining
Yuan et al. Automatic user goals identification based on anchor text and click-through data
Cheng Knowledgescapes: A probabilistic model for mining tacit knowledge for information retrieval

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
D17 Declaration under article 17(2)a
WWE Wipo information: entry into national phase

Ref document number: 0604247.7

Country of ref document: GB

Ref document number: 0604247

Country of ref document: GB

122 Ep: pct application non-entry in european phase