Improved Search Engine Optimisation
The invention relates to the development and maintenance of a Web Site or other information resource where statistical data from the Web and from search engine usage is used to optimise the positioning of the Web site or information resource in search engine results .
Users of the Web often need to find specific information from the Web. An increasingly common tool for searching out specific information requirements is the Search Engine. Search Engines provide an automated index of the Web by systematically exploring the Web and recording and indexing the Web sites they visit.
To find information via a search engine a query is submitted, consisting of a set of search terms, and a ranked list of results is returned.
It is in the interests of a Web site to be as highly ranked as possible with respect to relevant queries . The process of tuning a web site in order to maximize its ranking is known as Search Engine Optimization.
Search Engine Optimization is traditionally a largely manual process, which can involve some or all of the following steps
1. Identifying the market served by the web site (market segmentation) . 2. Identifying competitors (competitive analysis) . 3 . Selecting an appropriate set of search keywords . 4. Designing or modifying the site to maximize its search engine "visibility" .
It is an object of the present invention to generate a measure of query relevance with respect to search engines. It is a further object of the present invention to optimise web content with respect to search engines. It is a further object of the present invention to evaluate web content with respect to search engines .
According to a first aspect of the present invention, there is provided a method of generating a measure of query relevance with respect to search engines, the method comprising the steps of: • receiving first keywords; • retrieving search results from at least one search engine using said first keywords, said search results comprising web addresses; • retrieving web content associated with said web addresses; • generating an address similarity measure between said web addresses using their web content; • retrieving query statistics relating to at least one search engine using second keywords, said query statistics comprising queries and information associated with said queries ; • generating a query similarity measure between said queries
using said query statistics; and • generating a measure of query relevance based on the difference between said address similarity measure and said query similarity measure.
Preferably the method further comprises the step of expanding said first keywords.
Preferably the method further comprises the step of processing said retrieved web content.
Optionally said first keywords further comprise phrases .
Preferably said step of generating an address similarity measure comprises positioning web addresses in a first parameter space.
Preferably said step of positioning web addresses uses latent semantic analysis.
Preferably said step of positioning web addresses is uses web address rank in search results .
Preferably said step of generating an address similarity measure further comprises associating a spread with said positioned web addresses.
Preferably said queries comprise second keywords .
Preferably said second keywords and said first keywords are identical.
Alternatively said second keywords comprise keywords related to said first keywords.
Preferably said queries further comprise phrases.
Preferably said information associated with said queries comprises frequency information.
Optionally said information associated with said queries comprises clickthrough information.
Optionally said information associated with said queries comprises geographic information.
Optionally said information associated with said queries comprises demographic information.
Preferably said step of generating a query similarity measure comprises the step positioning said queries in a parameter space.
Preferably said step of positioning said queries in a parameter space comprises the step of positioning said queries in said first parameter space.
Preferably said step of generating a measure of query relevance further comprises normalising said address and query similarity measures.
Preferably said step of generating a measure of query relevance further comprises operating on said address and query similarity measures to provide a difference measure.
Preferably said step of operating on said address and query similarity measures comprises the steps of: • smoothing said query similarity measure in said parameter space; and • subtracting said address similarity measure from said query similarity measure in said parameter space.
According to a second aspect of the present invention there is provided a method of optimising web content comprising the steps: • generating a measure of query relevance according to the method of the first aspect; • determining at least one set of optimal queries using said measure of query relevance; • determining a set of optimised keywords from said at least one set of optimal queries; and • applying said set of optimised keywords to sample web content.
Preferably the method according to the second aspect is repeated so as to continuously optimise said sample web content .
Preferably said step of determining at least one set of optimal queries comprises clustering.
Preferably said step of determining at least one set of optimal queries comprises applying a threshold.
According to a third aspect of the present invention there is
provided a method of evaluating web content comprising the steps: • generating a measure of query relevance according to the method of the first aspect; • determining a rating of sample web content using said sample web content and said measure of query relevance.
Preferably said step of determining a rating of sample web content further comprises the steps: • determining at least one set of optimal queries using said measure of query relevance; and • determining a rating of said sample web content using the overlap of said sample web content with said at least one set of optimal queries .
Preferably said step of determining at least one set of optimal queries comprises clustering.
Preferably said step of determining at least one set of optimal queries comprises applying a threshold.
According to a fourth aspect of the present invention, there is provided a computer program comprising program instructions for causing a computer to perform the method of the first aspect of the present invention.
According to a fifth aspect of the present invention, there is provided a computer program comprising program instructions for causing a computer to perform the method of the second aspect of the present invention.
According to a sixth aspect of the present invention, there is
provided a computer program comprising program instructions for causing a computer to perform the method of the third aspect of the present invention.
The computer programs may be embodied on a record medium, stored in a computer memory, embodied in read-only memory or carried on an electrical carrier signal .
Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings, in which:
Figure 1 shows the overall control flow of the preferred embodiment of the present invention.
Figure 2 shows the overall control flow of an embodiment of the present invention.
Figure 3 shows the means for determining overall market segmentation, and a measure of similarity between Web pages in one embodiment of the present invention.
Embodiments of the present invention provide a methodology for continuous adjustment to a site in order to optimise it with respect to search engines. Accordingly, an embodiment of the present invention provides a means for providing a quantitative ranking for a Web site by evaluating the site with respect to a given market segment and competitive environment. The ranking mechanism, the competitive environment and the market segmentation are produced automatically by appropriate interaction with one or more search engines. The automatic qualitative ranking allows a
search procedure to continually optimise the ranking by providing small changes which can be continuously evaluated.
The principal driver for the invention is the idea that the principal external driver for optimization is to exploit the difference between the information and services which are currently available, as indicated by Search Engines and what the users/customers want, as indicated by searches and the choices made by users as a result of searches.
With reference to Figure 1, a preferred embodiment of the present invention is shown.
First, a measure of query relevance with respect to search engines is generated by: • generating keywords and optionally phrases 2. The keywords may be expanded by thesaural expansion or by finding related keywords from an analysis of actual search engine query phrases; • submitting the keywords/phrases to one or more search engines 4 and retrieving search results that contain URLs. The URLs may be pruned by removing duplicates; • retrieving web content 6 from the web sites and pages pointed to by the URLs . Java programs are used for this spidering software. The web content is distilled by removing common words and stemming; • generating an URL address similarity measure 8 between the URLs using the web content by positioning the URLs in a parameter space using Latent Semantic Analysis . In this space the URL is a point and a spread can be assigned to the point based on the similarity of the words in the web content associated with it . Thus the URL can be
represented by a Gaussian function in the parameter space. The URL rank obtained from the search results may be used in the positioning; • retrieving query statistics relating to at least one search engine 4 using keywords that are the same, similar or related to those submitted to the search engines in the above step. The query statistics are made up of phrases/keywords and frequency information. They may also contain geographic, demographic and/or clickthrough information associated with the queries; • generating a query similarity measure 10 between the queries by positioning the queries in the same parameter space as the URLs (or one that can be mapped onto it) ; and • generating a measure of query relevance 12 based on the difference between the URL similarity measure and the query similarity measure. This is done by normalising the two similarity data (for example by scaling one of them) , smoothing the query similarity measure, then subtracting the URL similarity measure from the query similarity measure .
XML is used as the data structure for the storage of address and query similarity measure data. These technologies were used to enable the invention to be portable numerous different platforms, thus allowing target market user profiles to be computed using grid computing techniques . It should be noted that this can be computed on one or many computers and the software could have been written in numerous different programming languages. The XML data structure can be represented as within a SQL (or similar) database table.
After generating the measure of query relevance a web page may
be optimised by; • determining one or more sets of optimal queries and keywords 14 by clustering the measure of query relevance. The clustering can be achieved by applying a threshold. Alternatively the clustering may be done on the URL similarity measure and/or the query similarity measure before generating the measure of query relevance. Hence the clustering yields keywords that users are looking for, yet where relatively fewer search results are found; and • applying the set of optimised keywords 18 to the web page.
In order to achieve continuous optimisation of web pages and sites the generation of query relevance and web page optimisation above is repeated. This provides a closed loop control system.
After generating a measure of query relevance then a web page may be evaluated 16, giving a rating (or ranking compared to others) by: • determining at one or more clusters of optimal queries from the measure of query relevance; and • quantifying the overlap of the web page with the optimal query cluster (s) .
The decision whether to modify a web page may be based on the evaluation.
A further embodiment of the present invention is presented below with reference to Figures 2 and 3, the objective of the competitive analysis of the market segmentation step 23 is to apply Statistical and Machine Learning techniques to develop clusters of web sites, where each cluster represents a related
set of competitors.
The starting point for analysis is a basic set of keywords 21 and phrases relevant to the domain. In addition web sites which are known to fall in the set can be used, as described below.
Algorithm One l.The keywords and phrases are used to generate a set of permutations 31 of subsets of the words and phrases. If there are n keywords, then there are subsets of
length k and permutations of these subsets .
2.Each permutation is presented to a series of search engines 22. 3. Each web page, retrieved from the first N hits on the search engine, is saved 32, indexed by its associated permutation and ranked from 1 to N. 4. The web pages are processed to remove common words, and the words are stemmed 33 using the Porter algorithm. 5. The web pages retrieved in this way are filtered for duplicates and then Latent Semantic Analysis is used to create a similarity measure between sites 34. The similarity measure is the Euclidean distance between pages with respect to the first N latent components, where N is 50 in the preferred implementation.
6. The web pages are clustered 34 using k-means clustering. The clustering metric is the similarity measure determined by the Latent Semantic Analysis. Each cluster represents a sector of competition for the "product" defined via the keywords and phrases .
The use of Latent Semantic Analysis (LSI) on web pages retrieved by searches on permutations of the original keywords and phrases reduces dependence of results on the initial terms and phrases, since all the information in the retrieved pages is used to determine a clustering metric and the effects of polysemy and synonymy are reduced.
The number of dimensions used from the Singular Value Decomposition in LSI is variable. In this embodiment of this invention, 50 dimensions are used.
The clusters can be used decomposed in various ways, for instance geographically, orthogonal to the original construction via semantic analysis of keywords .
The clustering analysis uses only Web addresses acquired from searches . More sophisticated strategies for clustering involving further search using web-bots to explore links not pursued by the standard search engines can be used where the market is specialized.
The objective of customer profiling 24 is to determine a set of different customer information requirements, each requirement relating to a particular "customer profile". As with Market Segmentation the starting point is a set of keywords and phrases. In the case of Market Segmentation we
are interested in the totality of available information within the range defined by the keywords and phrases. With customer profiling we are interested in clustering search queries, chosen from our set of keywords and phrases, in such a way that each cluster contains queries used by customers in search of a particular class of information resource.
The information available consists of triples (Q, U, R) where Q is a query, U is the URL (Uniform Resource Locator) which was selected in response to that query and R is the rank of the selected URL with respect to the query. This is termed "clickthrough" data and is available from search engines.
The value of this data for clustering queries is shown by the following related observations. If two different users search with terms "fly" and "ant" but select the same URL, there is evidence that the search terms are related to a common information requirement. Similarly if two distinct users search on the same term "ant" and visit different URLs there is some evidence that these two URLs are related. Note that such evidence is statistical, the term "law" for example might relate to either the legal system or physics .
There are three kinds of information available for clustering. l . The similarity between queries 2. The similarity between URLs 3. The link structure between queries and URLs
Similarity between queries can be. defined as the proportion of
words or phrases which they have in common. Similarity between URLs is defined in terms of the distance measure described above in terms of Market Segmentation. The link structure between queries and URLs can be described as a bipartite graph. The "white" nodes of the graph are the unique queries, and the "black" nodes are URLs. The similarity between two nodes of the same color is the proportion of links they share compared to their total number _of links . It is also possible to assign a weight to links in this graph, the weight being a function of the ranking of the URL selected from a query presented to a search engine.
There are some natural variations of these similarity measures. For example, search queries which are permutations of each other can be considered equal, and queries can be clustered by containment in a natural manner.
A similarity measure between the queries (white vertices) , based on a combination of the above similarities, and a complementary measure between the URLs can be generated by a weighted combination of the individual similarities .
The clustering algorithm, which clusters both URLs and queries, proceeds as follows. It is a version of Hierarchical Agglomerative Clustering (HAC) . It proceeds on the assumption that the number of URLs is very considerably larger than the number of queries .
Algorithm Two l.The two query nodes with the greatest similarity are merged. Record this merger.
2. he most similar URLs are merged. This is done a reasonably large number of times, since there are many more URLs than queries. Record these mergers. 3.Goto step 1 unless the number of queries has been reduced below threshold.
The end result of this algorithm is a hierarchical clustering of both queries and URLs. For any query cluster there is an associated set of URL clusters.
We make the assumption for any query cluster that each URL cluster corresponds to a distinct market for the query cluster.
The goal of optimization is to position an information resource (a Web site, viewed as a set of web pages) so as to maximize its initial value, and to continually evaluate the situation on an ongoing basis so as to maximize its continuing value.
The goal is not to maximize the number of visitors to the site, but the maximize the number of visitors who pay to consume the site's resources. This may mean making a purchase, or just taking the time to read articles and information on the site.
It is difficult to determine the conversion rate of visitors who consume unless one can investigate in detail the behavior of visitors and perform experiments. We assume that it is possible to optimise conversion rate separately from visitor
rate once the initial positioning of the site or page has been determined.
The goal of optimization is to generate a numerical measure of the "fitness" of a web page/site. The following assumptions are made • Pages which are semantically similar with respect to the distance measure described in the section titled "Market Segmentation" and which lie in the same cluster with respect to the Customer Profile, and with similar strengths, will attract a similar number of visitors. This is a base assumption, in that it justifies the use a numerical measure of Web site utility. • The ratio of number of visitors to cluster size is a direct determinant of the value of a cluster, as larger values imply more visitors for any site in the cluster. • A query cluster which relates to a URL cluster with low average rank is interesting since users have consistently chosen low-ranking URLs from search queries.
These observations can be used to assign a numerical rank to web pages and web sites. The factors used to determine ranking are . l.The cluster value (visitors/element) . 2. he within-cluster ranking, determined by ranking score on queries within the cluster's associated queries. This
ranking is comparative. 3. he relevance of the cluster to the product being sold.
We now describe the optimisation process with reference to Figure 2. Optimisation is typically performed for either a web site or a small group of interrelated pages, for example those describing a particular product. 1. Produce an initial set of keywords and phrases 21. 2. Produce a market segmentation 23 consisting of a similarity measure between. web pages, and a clustering of sites/pages as described in Algorithm One above. 3. Produce a Customer Profile 24, as described in Algorithm Two above. 4. Create initial site and web pages 25. 5.Determine the degree of membership of pages in the query clusters produced in Step 3. 6.Assign pages to clusters 26 based on a determination of how its numerical ranking can be optimised using the difference between the similarity measure produced in Step 2 (the information and services which are currently available, as indicated by Search Engines) and the similarity measure between queries produced in Step 3 (what the users/customers want, as indicated by searches and the choices made by users as a result of searches) . This process consists of the following steps:
(a) Estimate the clusters most relevant to the product being sold. (b) Modify the page in terms of keywords and language to minimize its distance to the cluster centre 27. 7.Make the pages/site live. 8.Monitor the numerical ranking of pages 28. This is necessary to determine factor 2 above (the within-cluster ranking) . 9.Monitor the ranking of the site on a continual basis by repeating steps 2, 3 and 8.
This embodiment provides a means for continuous adjustment to a Web site or Web pages in order to optimise it with respect to search engines, comprising a means for automatic clustering of relevant pages and sites via enumerated permutations of keywords and phrases and the use of searches for relevant web sites, and comprising a second means for customer profiling based on the clustering of the first means and processing of aggregated queries to search engines, said first and second means being used by an evaluation means which provides an automatic numerical ranking of pages and sites, said evaluation means which provides continuous adjustment and optimisation of web sites and web pages .
The present invention provides a computer program that functions in an unique manner to provide an analysis of web content and user search request information. The invention
provides a novel tool for the analysis of web sites and search engines and modification of web sites to allow a user to improve their search engine rankings . The ability of a computer operating a computer program in accordance with the present invention to retrieve and analyse web content and queries in the manner described in the application provides an extremely valuable new and inventive technical contribution to the art .