WO2007096659A1

WO2007096659A1 - Phishing mitigation

Info

Publication number: WO2007096659A1
Application number: PCT/GB2007/000673
Authority: WO
Inventors: Jianxin Yan
Original assignee: University Of Newcastle Upon Tyne
Priority date: 2006-02-27
Filing date: 2007-02-27
Publication date: 2007-08-30
Also published as: GB0603888D0

Abstract

The invention provides methods and systems operable in a computer network to detect and deal with fraudulent resource pages in a linked web of resource pages hosted on the network and accessible to end users, the resource pages being organised in a directory structure that identifies the physical location of each page. In a first embodiment, new or modified pages in the directory structure are compared with a set of features characteristic of fraudulent resource pages and, based on the comparison, a determination as to whether the new or modified page is fraudulent or not can be made. Fraudulent pages can be, for example, deleted from the directory structure once detected. In another embodiment, pages are scanned on-the-fly when requested by users and the user is prevented from accessing fraudulent pages. In such on-the-fly scanning, pages from domains that are protected in accordance with the first embodiment can be assumed to be legitimate and made accessible to a user without additional scanning.

Description

PH)SHlNG MITIGATION

Field of the Invention The present invention relates to methods and systems for mitigating phishing attacks.

Background

"Phishing" is a relatively new but increasingly prevalent form of Internet fraud. Phishers typically masquerade as a trustworthy individual or business, sending an electronic communication such as an email or instant message to lure end users to a spoofed web page wh ere they are requested to enter sensitive information (e.g. user IDs, passwords, credit card and bank account numbers). The spoofed web pages are created to look like a page provided by the trustworthy individual or business that the phisher is masquerading as.

As the problem grows there is an increasing desire to provide ways to mitigate the effects of these so called phishing attacks and/or to catch the perpetrators. Known systems for mitigating phishing attacks generally involve one of three approaches.

The first approach is to use anti-spam software to filter out the fraudulent email in the first place. This approach is, however, unreliable because it is a relatively straightforward task for determined phishers to get their emails past anti-spam software. In some cases, such anti-spam software will also filter out legitimate emails and to avoid missing such messages it is not uncommon for users to switch off anti-spam software altogether.

The second approach to mitigating phishing attacks is what might be referred to as "client-side defence". This requires the use of a software application (normally a browser extension) executed on the user's client to check, in real time, each web page downloaded to the user's browser application. The check looks for tell-tale characteristics of phishing web pages and alerts the user (e.g. with a pop-up dialogue box) if a web page they have accessed appears to be a phishing page. The user then decides whether to take heed of the warning or to ignore it and continue to use the page in question. One example of a browser plug-in that provides this sort of functionality is "SpoofGuard" developed at Stanford University and described in the paper by N Chou et al, "Client Side Defence Against Web-Based Identity Theft", Proceedings of the 2004 Network and Distributed System Security Symposium, San Diego, CA, February 2004.

The third known approach is to provide a "trusted path" between client and server, sensitive information only being communicated between client and server once both parties have been adequately authenticated to one another. This requires both the client and server to be modified, which in practice is a significant barrier to widely deploying such a solution. The onus is also still on the end user to decide whether the server they are transacting with is genuine or not. One example of this type of solution is a system known as "Dynamic Skin" described in the paper by R. Dhamija et al, "The Battle Against Phishing: Dynamic security Skins," Proceedings of the 2005 Symposium on Usable Privacy and Security (SOUPS), Pittsburgh, Pennsylvania, July 2005.

It has also been proposed by Liu Wenyin (City University of Hong Kong) et al in their paper "Detection of Phishing WebPages based on Visual Similarity", WWW 2005, May 10-14 2005, Chiba, Japan, that a legitimate web page owner can use the approach disclosed in the paper to search the Web for suspicious web pages which are visually similar to the legitimate owner's own pages. It is impractical, however, to do this with conventional web searching technology (e.g. web 'crawlers') as fraudulent pages are often very short lived and may well have been moved or removed before they are indexed by a conventional search engine.

Summary of the Invention

It is a general aim of the present invention to provide an approach to mitigating phishing attacks that removes from the end user the burden of deciding whether any particular web page is a fraudulent ("phishing" or "spoofed") page by automatically identifying the location of fraudulent web pages and either removing them or alerting a skilled user (e.g. network administrator) to their presence at the earliest opportunity.

The approach adopted by embodiments of the present invention is to monitor one or more web page directories and when a web page in the directory is modified or a new page added to analyse the page to determine the likelihood that, in the context of the specific web directory, it is a fraudulent page. Pages that are very likely to be fraudulent may be removed automatically (or access to them blocked). Pages that are not clearly fraudulent but might be are preferably notified to a user with the appropriate knowledge and skills to decide whether the page is fraudulent or not.

A key feature of preferred embodiments of the present invention is that it provides a proactive approach to the mitigation of phishing attacks that is specifically designed to be deployed by internet service providers (including web hosting service providers) and other network operators. The invention is implemented by way of a daemon program that runs, preferably continuously or at least frequently, in the background. It enables these parties to monitor their domains and remove fraudulent web pages at the earliest opportunity, i.e. to 'clean their own house', thus removing the potential of becoming embroiled in subsequent legal proceedings.

In a first aspect, the present invention provides a method operable in a computer network for detecting fraudulent resource pages in a linked web of resource pages hosted on the network and accessible to end users, the resource pages being organised in a directory structure that identifies the physical location of each page, the method comprising: scanning the directory of resource pages and identifying when a new page is added to the directory and/or a page in the directory is modified; comparing each new or modified page with a set of features characteristic of fraudulent resource pages; and based on the comparison, making a determination as to whether the new or modified page is fraudulent or not.

The result of the determination may, for example, be:

- that the page is or is likely to be a fraudulent page; - that the page is not a fraudulent page;

- that it is uncertain whether the page is fraudulent or not.

The determination may be arrived at by assigning the page a numeric suspicion score (e.g. an integer value between 1 and 100), based on the comparison with the set of features, which can then be compared against one or more predetermined threshold values. For example there may be a first, higher threshold above which the page is determined to be a fraudulent page and a second, lower threshold below which the page is determined not to be a fraudulent page, pages with a suspicion score between the two thresholds being one for which there is uncertainty as to whether the page is fraudulent or not. The thresholds may be configurable.

In the first case, the page is preferably immediately made inaccessible to end users without the need for human intervention, for example by removing it from the directory structure, moving it to a different directory within the directory structure that is inaccessible to end users or by deleting the page altogether.

In the second case, no action need be taken.

In the third case, it may be desirable for an administrator with the appropriate knowledge and skill to make a final judgement as to whether the page is fraudulent or not. It may be desirable, pending such a judgement being made, to temporarily make the page in question inaccessible to end users, for example by moving it to an alternative directory within the directory structure that is accessible to the administrator only. At the same time, the administrator is preferably notified that a possibly fraudulent page has been detected. If on reviewing the page the administrator reaches the conclusion that it is not fraudulent then it can be returned to its original location in the directory structure. If, on the other hand, the administrator's determination is that the page is fraudulent, it can be permanently removed from the directory structure and deleted.

The complete directory of resource pages is preferably scanned periodically, typically at regular intervals, or even continuously in some cases (the next scan starting as soon as the preceding scan has finished). In this way, fraudulent pages can be quickly located and removed before many (in a best case before any) end users have accessed the page and been tricked into disclosing sensitive information. Once the page has been made inaccessible, a phisher's attempts to lure end users to the page, e.g. with email and/or instant messages providing a link to the page, are in vain because there is no longer a page to link to.

If there are any judgement calls to be made by human intervention, they can be made by a skilled and knowledgeable individual. There is no need for an end user to make such judgements.

In preferred embodiments of the invention, the web of linked resource pages is part of the "World Wide Web" (www) on the Internet, the resource pages being so called "web pages". The directory structure is in this case, referred to as the web page directory (WPD). The WPD is an organised, conceptual directory structure which includes all document roots which are hosted within a given subnetwork ("subnet"). The WPD may include as few as one document root on a given server or may include all document roots from multiple servers in multiple physical locations which are maintained within the same subnet.

The Internet is a collection of many subnets, each with its own WPD. A subnet may be run by an organisation such as an internet service provider (ISP) and a web hosting service provider or by other businesses or academic institutions for example. Each subnet may include one or more Internet domains. A subnet and its associated directory may be physically located at a single location/device or it may be distributed across multiple networked devices. The directory structure will typically map directly to the physical location of each page.

Multiple, separate instances of the method of the invention are preferably executed within a plurality of such subnets. One instance of the method may be run in each of the plurality of subnets or, in some embodiments, two or more instances of the method may run within a subnet. Where multiple instances of the method are running within any one subnet they preferably operate on distinct portions of the subnet's WPD.

A single deployment of the invention on a single subnet will prove effective at identifying and removing fraudulent web pages hosted within that subnet's domain(s) and will assist the subnet operator to avoid any potential legal liability. However, the invention is most effective if instances of the method are executed in a majority of the subnets within the overall network (e.g. Internet). Preferably, the invention would be deployed by the majority of large internet service providers (including specialised web hosting service providers), e-commerce sites and networks operated by financial institutions, among others. To facilitate the identification of new and/or modified pages within the directory structure, a representation of the directory structure and the pages within it is preferably initially captured. As the directory is subsequently scanned a comparison can then be made with this previously captured representation and any differences identified. The captured representation of the directory structure can be updated over time to reflect the changes that have been identified (e.g. new pages, modified pages, new directories, and deleted or moved pages or directories).

The captured representation of the directory structure may be a simple copy of the complete directory structure and the pages in it. More preferably, to enable a more rapid comparison between the pages of the current directory structure and the previously captured representation, the representation comprises a signature for each page in place of the page itself. The signature is an identifier for the page that will change if the content of the page itself changes but which is significantly smaller than the page. The signature may, for example, be a hashed value, generated for instance by any one of a number of known hash functions including cryptographic has functions such as the "Secure Hash Algorithm" (SHA) family of hash functions, e.g. SHA-256.

In use, as the directory structure is scanned, the signature (e.g. hash value) for each page is calculated and compared with the previously stored hash value. If the hash value has changed, the page is identified as one that has changed and it can be flagged for subsequent analysis to determine whether it might be fraudulent page or not.

To gain the most benefit from embodiments of the method of the present invention, the comparison between calculated and stored hash values for all of the pages should be as quick as possible. A fast matching algorithm is preferably used, one example of which is a Bloom filter (B. Bloom, "Space time tradeoffs in hash coding with allowable errors", Communications of ACM, 13(7): 422-426, 1970).

As already noted above, any new or modified page in the directory structure is analysed to determine whether it might be fraudulent by comparing it with a set of features characteristic of fraudulent pages.

Some features of a web page (or other resource page on a network) may be indicative of a fraudulent page irrespective of the specific domain or other context in which the page resides. For convenience, these will be referred to as "common features". Where there are multiple instances of the method running in multiple subnets, the set of common features can be the same for all instances.

As with other inexact detection mechanisms, e.g. virus detection and spam email filtering, the common features can be deduced by looking at the characteristics of fraudulent web pages used in previously detected phishing attacks. In their discussion of "SpoofGuard" (see reference to paper above), Neil Chou ef a/ suggest several possible characteristics of phishing pages. Some of those characteristics can conveniently be used in a set of common features in embodiments of the present invention. Due to the location of deployment of the present invention, i.e. by an internet service provider on their own subnet server(s), more features which are characteristic of fraudulent web pages can be employed. In the present invention, a set of common features might include, for example, amongst other things, the presence of obfuscated HTML or JavaScript source code which could be either escape encoded or encrypted, or the presence of JavaScript or cascading stylesheets which have been plagiarised from spoofed websites.

In some cases, a fraudulent page might also be recognisable by identifying page features that of themselves are innocuous but which are out of place in the context (e.g. specific domain) of the page. For example, it would be unusual for the web pages of an academic institution to include eBay or PayPal logos, so the presence of such logos could raise a suspicion that the page was fraudulent. It may also be that pages within some domains, or at certain levels in a directory structure all have fixed, predetermined structures or standard content elements or other standard features and the absence of this fixed structure or standard elements could suggest the page is fraudulent.

Preferably, therefore, as an alternative to or more preferably in addition to a set of common features, the determination in embodiments of the present invention of whether a page is fraudulent or not is made with reference to a set of context-specific (e.g. subnet specific, and/or domain specific and/or site specific) features that are characteristic of fraudulent pages in the respective context.

The context specific features will typically vary from subnet to subnet and, where multiple instances of the method are operating within any one subnet, they may vary between instances. The context specific features are preferably configurable (i.e. the specific features and their parameters can be selected and modified) individually for each instance of the method (e.g. each instance of the daemon program), e.g. by a subnet or domain administrator for instance.

The common feature set, on the other hand, is preferably updated centrally. For example, as is now common with anti-virus software, a central copy (or multiple 'central' copies) of the common feature set can be updated and copies of the updated feature set then distributed (preferably automatically) to the subnets in which instances of the method are operating. They may be distributed via the Internet for example.

The method of the invention is preferably implemented by way of a daemon; that is a software application that runs, preferably continuously or at least frequently, in the background without the need for regular human intervention. In a second aspect, the invention provides a system for detecting fraudulent resource pages in a linked web of resource pages accessible to end users, the resource pages being organised in a directory structure that identifies the physical location of each page, the system comprising: means for scanning the directory of resource pages and identifying when a new page is added to the directory and/or a page in the directory is modified; means for comparing each new or modified page with a set of features characteristic of fraudulent resource pages; and means for making a determination as to whether a page is fraudulent or not based on an output from the comparison with said set of features.

The components of the system are preferably implemented in software code executable on one or more networked devices, for example one or more daemon programs. The components are preferably operable in accordance with any one or more of the preferred and optional features of the method set out above.

Advantageously, knowledge that a web page (or other resource page) is being served from a subnet (e.g. domain or domains) in which an embodiment of the present invention is actively detecting and removing fraudulent pages can be used to streamline "on-the-fly" (i.e. real time) detection of fraudulent pages.

More specifically, pages served from subnets that are protected by an embodiment of the present invention are much less likely to be phishing pages than those served from unprotected domains. Approaches to on-the-fly traffic scanning can therefore be adapted to treat pages retrieved from protected domains ("trusted pages") differently from pages retrieved from unprotected domains. For instance, trusted pages might undergo a less thorough (and consequently less resource intensive) check than unprotected pages. In some cases, trusted pages may not be checked at all by a phishing filter.

This approach can significantly reduce the work load of gateway devices (including e.g. firewalls, routers, proxy servers, etc) that operate to filter Internet traffic (looking e.g. for spam, worms, viruses, phishing pages, etc). Similarly, the work load of client devices running Internet filtering applications can also be reduced by adopting this approach of distinguishing trusted pages and treating them differently from other pages.

Applications that operate to filter web pages (or other resource pages) can, for instance, use a "white list" of IP addresses or URLs of domains that are protected by an embodiment of the present invention to determine whether any particular page is to be treated as a trusted page or not; pages retrieved from domains on the white list being treated as trusted pages. In a third aspect there is provided a method operable in a computer network (on-the-fly) for detecting fraudulent resource pages retrievable via the network by an end user for display on a client device associated with the user, the method comprising: in response to a request from the user for a resource page, retrieving the resource page; comparing the retrieved page with a set of features characteristic of fraudulent resource pages; based on the comparison, making a determination as to whether the retrieved page is fraudulent or not; and if the page is determined to be fraudulent, preventing the user from accessing the page.

Preferably the comparison is carried out by a network device other than the user's client device and a page that is determined to be fraudulent is blocked and not delivered to the client device.

When a page is identified as being fraudulent, a page identifier (e.g. URL) for the page is preferably added to a black list of page identifiers (e.g. URLs). The black list can be checked whenever a page is retrieved and if the page identifier of the retrieved page is the same as one that is already in the black list then the retrieved page can be determined to be a fraudulent one without any further checks being carried out.

The method of this aspect can also adopt the approach described above to identify trusted pages, reducing or avoiding altogether the need to scan such pages with the set of features characteristic of fraudulent resource pages.

In a fourth aspect there is provided a network device, for example a proxy server (typically a web proxy, a HTTP proxy or a web cache server), adapted to implement the method of the third aspect.

The invention also provides a computer program that when executed on a computer or computer network causes the computer or network to operate in accordance with any one or more of the methods according to aspects of the invention set forth above. The computer program may be distributed across several devices. The computer program may be stored on a computer readable medium. The program may be executed and maintained locally by, e.g. an internet service provider or network operator, and it could be executed and operated remotely as a managed service by another party.

Brief Description of the Drawings

Embodiments of the invention are described below, by way of example, with reference to the accompanying drawings, in which: Fig. 1 is an illustration of part of a web page directory (WPD) with which an embodiment of the present invention can be used;

Fig. 2 is a schematic illustration of the architecture of an embodiment of the present invention;

Fig. 3 illustrates the process followed when the embodiment of Fig. 2 is run for the first time for a specific WPD;

Fig. 4 illustrates the process for subsequently and regularly scanning the WPD to detect possible fraudulent web pages within the WPD.

Fig. 5 is a schematic illustration of a network system in which another embodiment of the invention adapted for on-the-fly scanning for phishing web pages can be operates; and

Fig. 6 illustrates the process by which a web proxy in the system of Fig. 5 operates to detect and prevent access to phishing pages.

Detailed Description

The present invention is based in part on an observation that existing solutions to mitigate phishing attacks fail because they are relying on the wrong people to execute them and are addressing the problem at the wrong place. The present inventor has recognised that the network operators are in the best position to defend phishing attacks, rather than end-users or financial service providers for example. The present inventor has therefore sought to provide an effective solution for network operators to provide a defence against phishing attacks. Three important features of preferred embodiments of the proposed solution are that they:

1) Require no change at all in the internet software infrastructure such as a web browser, and applications hosted in a web server;

2) End-users, who are very often not sufficiently technically sophisticated or interested, do not need to get involved at all. The critical decision-making step concerning whether a web page is fraudulent, e.g. phishing page or not can be carried out either by a software tool automatically, or, when necessary, by technically savvy system administrators; and

3) The solution is easy to deploy and to maintain.

Another observation underlying the present invention is that each phishing web page must be hosted somewhere on the Internet, in a certain subnet or domain. The problem is to identify the location of the page and remove it before it can do too much harm. This is not easy task as the pages tend to be very short lived (e.g. days or even hours) and will often disappear before they can be indexed by any conventional search engine.

The solution provided by embodiments of the present invention is designed for each subnet operator such as commercial ISPs (internet service providers), web hosting service providers, financial institutions, universities, etc, and thus it makes it feasible to mitigate phishing by automatically and very rapidly identifying (physical) locations of phishing pages and then removing them from the Internet in the first place.

In its preferred embodiments, the key component of the present invention is a phishing page detection daemon ("P2D2", described in more detail below), one or more instances of which reside in each subnet, run around the clock, scan periodically (or continuously) the whole population of web pages hosted in the subnet (e.g. domain), and distinguish phishing web pages from normal ones by comparing the pages with a set of features that are characteristic of phishing pages generally and in the specific context of the subnet/domain.

Once a phishing page is detected, it is preferably automatically removed from the hosting web server. Pages that trigger a suspicion but cannot be confidently marked as phishing pages by

P2D2 are preferably quarantined immediately upon identification, for example moved to a specific directory which is not available to the web server or otherwise made inaccessible to end users.

• Meanwhile, a system administrator for the subnet/ domain is automatically notified so that they can manually check the quarantined page to determine whether it is legitimate or not. If the page is legitimate it can be quickly restored to its original location. If not then the page can be removed.

By adopting this approach, since phishing pages can be removed at the earliest possibility, people who are redirected to them by fraudulent emails will see nothing but a message showing, for example, that the page does not exist, and thus they will face no more risk of their personal information being stolen.

In more detail and with reference to the figures, preferred embodiments of the invention are implemented to operate on a Web Page Directory (WPD). P2D2 generates the WPD which is a conceptual tree structure representing all web pages hosted in a subnet. For example, Fig. 1 illustrates part of the conceptual tree structure of the WPD for an organisation. /org_web is the Document root directory, i.e., the root of the tree in which webpage files for all hosted websites are stored. The next level down in the directory tree structure includes an individual directory for each website and the files for individual web pages associated with the website are stored in a plurality of levels below that. Although there may be only one WPD for a subnet hosting an enormous number of web pages, individual sub trees in the WPD (and the web page files associated with them) can be physically distributed across multiple computers networked to one another.

Fig. 2 illustrates the architecture of the solution provided by a preferred embodiment of the invention. As noted above, a daemon program (P2D2) lies at the heart of the solution. The daemon includes a change detector (CD) that detects changes in the WPD by comparing the web page files of the WPD with a previously captured representation of the WPD stored in a signature database (SD). Any pages that have changed are passed to a feature detector (FD) that compares the page content with two sets of features, a site-specific feature set (FS1) and a common feature set (FS2). The FD can compare a page content against the site-specific feature set first followed by the common features set, or it may compare the page against the common feature set first followed by the site-specific feature set. The order of comparison is immaterial. Based on this comparison the FD calculates a suspicion score and if this score exceeds a predetermined threshold it calls an action module (AM), which can act to remove or move suspected phishing pages and, if necessary, alert a system administrator in the manner discussed above.

A site-specific feature set updater (FS1 Updater) is provided to enable the site-specific feature set FS1 to be configured locally by a system administrator. The site-specific feature sets may vary from one instance of P2D2 to another. The common feature set (FS2), on the other hand, is maintained centrally, a local common feature updater (FS2 Updater) receiving updates via the Internet from a remotely located server (FS2 Server) and updating the common feature set (FS2) accordingly.

Turning to Fig. 3, when the solution is first implemented it is necessary to capture a representation of the complete WPD and it is also prudent to check all of the web pages in the WPD to ensure that there are not already one or more possible phishing pages present.

Thus, all of the files in the WPD are scanned and for each file:

1. The CD calculates a signature for the file, which can be a hashed value generated by a cryptographic hash function such as SHA-256. The CD then stores <file path name, file signature> tuples in the Signature Database (SD).

2. Then the FD checks each file in WPD against two feature sets: o Site-specific features (FS1) o Common features (FS2) 3. If the FD identifies a suspicious file using these features, it will call the Action Module (AM). If a phishing pages is confidently identified by the FD (by a suspicion score above a high threshold), it will be removed from its current physical location by the AM, so that it becomes unavailable to its serving web server in this domain. Otherwise, when its suspicion level is not high enough to warrant removal of a page but is above a lower threshold, the AM will move this page to one dedicated directory which is not available to the web server, and inform system administrators for further actions. The system administrators can then manually check the page and decide whether to remove it or return it to its original location.

After the initial configuration in the manner described above, the P2D2 daemon then periodically scans the WPD to look for new or modified webpage files and determines whether they might be phishing pages. More specifically, as illustrated in Fig. 4, the P2D2 daemon operates in the following manner;

1. The CD regularly scans the WPD and detects file updates such as any new file, sub directory and files modified since the previous 'snapshot' (i.e. the captured representation of the WPD in the SD). The frequency for such a scanning is configurable. A Bloom filter is used to quickly identify whether a given file is included in the previous snapshot.

2. Each modified file and new file identified by the CD is checked against features in both FS1 and FS2 by the FD.

3. The FD and AM then interact in the same manner as described in point 3. above for the initial run of the daemon.

The common feature set (FS2) can be based on features extracted (e.g. by forensic analysis) from known phishing pages. They are referred to as common features because they can trigger suspicion about a webpage regardless of where the page is hosted, and also because they can appear in the same form in each P2D2 installation.

Site-specific feature (FS1) on the other hand may be unique to each P2D2 installation. For example:

^■ o If an organisation maintains a standard structure (including page layout, html/JavaScript style, etc.) for all its web pages, a page that does not comply with this structure will trigger suspicion. o A webpage hosted at the domain of an academic institution might trigger suspicion if it contains many EBay or PayPal logos or links to them.

Turning to Figs. 5 and 6, an adaptation of the P2D2 concept that is suitable for on-the-fly scanning of web pages to detect phishing pages that are hosted in the wild on the Internet will be described. Looking first at Fig. 5 a typical network setup by which a user accesses web pages is shown. The web pages are hosted on servers 10, 12 that are connected to the Internet 14. A user's local network (e.g. corporate network) is connected to the Internet 14 for web browsing via a web proxy server 16, to which one or more user client devices 18 are connected.

Using a browser application on a client device 18, the user sends (e.g. by typing a URL or clicking on a link) an HTTP request for a web page to the web proxy 16. The web proxy retrieves the requested page from the server 10, 12 hosting the page in question, via the Internet using standardised internet protocols.

In this embodiment, the web proxy then scans the retrieved web page before sending it on to the user to determine whether or not the page is a phishing page. If the page is a phishing page then it is not sent on to the user. Instead, the user is sent a message page (e.g. an HTML page displayed in their browser) to notify them that the page they have requested has been determined to be a phishing page.

Thus, the user is safeguarded against unknowingly accessing phishing pages that are hosted outside their local networks or domains.

Turning to Fig. 6, the scanning operation of the web proxy will be explained in more detail.

First, the users web page request is received. The first step in the scanning procedure carried out by a P2D2 application running on the proxy is then to determine whether the page is on a white list of pages (and/or domains) maintained by the web proxy or the P2D2 application. Domains are added to the white list if it is known that they are protected by a P2D2 daemon, as described above with reference to Figs. 1 to 4. Pages are added to the white list (or treated as being on the white list) if they are hosted by / retrieved from a domain that is known to be protected by a P2D2 daemon. The white list can also include trusted domains / URLs that are identified by other means.

If the requested page is (or is treated as being) on the white list, then it is deemed to be okay and the page is retrieved to the web proxy and forwarded straight to the user without further scanning. In an alternative embodiment (not illustrated) white listed pages are still scanned but using a less stringent (and more rapidly applied) set of criteria.

If the page is not on the white list, the next step in the scanning process is to check whether the page is on a black list of known phishing pages. As with the white list, the black list is maintained at the web proxy or the P2D2 application. If the requested page is on the black list then it is automatically deemed to be a phishing page and the user's access to it is blocked: the requested page is not retrieved and, instead, the web proxy serves the user with a message page to notify them that the page they have requested has been determined to be a phishing page. The blacklist can include phishing pages identified by the P2D2 application, phishing pages identified by other means (e.g. URLs of suspect pages provided by a third party), or both.

In some embodiments the black list may be checked before the white list. In other embodiments there may be no white list and/or no blacklist. The particular configuration can, for instance, be selected dependent on the context of the application.

Assuming the requested page is neither on the white list nor the black list, it will be retrieved to the web proxy and then processing passes to a feature detector (FD) module of the P2D2 application that, as with the embodiment discussed above, compares the page content with a common feature set (FS2). Based on this comparison the FD calculates a suspicion score and if this score exceeds a predetermined threshold it calls an action module (AM). This AM differs from the one described further above in that (generally) it will not have access to the directory structure containing the phishing page so will not be able to move / remove the page. Instead, the AM of this embodiment can act to prevent user access to the page and, if necessary, alert a system administrator in the manner discussed above.

If a page is determined by the system administrator to be a phishing page it will be added to the black list maintained at the web proxy or the P2D2 application. If a page is determined by the system administrator to be a legitimate page then it will be made accessible to the user and, optionally, can be added to the white list maintained at the web proxy or the P2D2 application.

In summary, typical phishing attacks aim to trick users into giving out their personal information such as passwords and credit card numbers by impersonating a legitimate institution's website. In order to achieve this goal, criminals take advantage of the limited technical knowledge of average online users, add social engineering tactics and launch never ending attacks. On the other hand, security solutions aimed to mitigate such phishing attempts assume that users have a certain level of technical understanding and are able to successfully use these solutions. In reality, they do not. Hence, these solutions are deemed insufficient and fail to protect the users.

Preferred embodiments of the present invention are based on the premise that network operators are in the best position to provide end-users and victim organisations protection from phishing attacks and they provide a defence against phishing attacks in which phishing pages can be detected and stopped before ever reaching the client's machine.

It will be appreciated that the embodiments described above are given by way of example only and many modifications to that which has been specifically described are possible within the scope of the present invention.

Claims

CLAIMS:

1. A method operable in a computer network for detecting fraudulent resource pages in a linked web of resource pages hosted on the network and accessible to end users, the resource pages being organised in a directory structure that identifies the physical location of each page, the method comprising: scanning the directory of resource pages and identifying when a new page is added to the directory and/or a page in the directory is modified; comparing each new or modified page with a set of features characteristic of fraudulent resource pages; and based on the comparison, making a determination as to whether the new or modified page is fraudulent or not.

2. A method according to claim 1 , wherein the result of the determination is: - that the page is or is likely to be a fraudulent page;

- that the page is not a fraudulent page; or

- that it is uncertain whether the page is fraudulent or not.

3. A method according to claim 1 or claim 2, wherein said step of making a determination comprises assigning the page a numeric suspicion score based on the comparison with the set of features and comparing the suspicion score with one or more predetermined threshold values.

4. A method according to claim 3, wherein said comparison is with at least a first, higher threshold above which the page is determined to be a fraudulent page and a second, lower threshold below which the page is determined not to be a fraudulent page.

5. A method according to any one of the preceding claims, wherein following a determination that a page is fraudulent, the page is made inaccessible to end users without the need for human intervention.

6. A method according to claim 5, wherein the fraudulent page is made inaccessible by removing it from the directory structure,

7. A method according to claim 5, wherein the fraudulent page is made inaccessible by moving it to a different directory within or outside the directory structure that is inaccessible to end users

8. A method according to claim 5, wherein the fraudulent page is made inaccessible by deleting the page.

9. A method according to any one of claims 1 to 4, wherein if it is indeterminate whether a page is fraudulent or not, the page is made temporarily inaccessible to end users but is accessible to an administrator in order that they can make a judgement as to whether the page is fraudulent or not.

10. A method according to claim 9, wherein the page is made temporarily inaccessible by moving it to an alternative directory within or outside the directory structure that is accessible to the administrator only.

11. A method according to claim 9 or claim 10, wherein further comprising notifying an administrator that a possibly fraudulent page has been detected.

12. A method according to any one of the preceding claims, comprising periodically scanning the directory of resource pages to identify new and modified pages.

13. A method according to any one of the preceding claims wherein the web of linked resource pages is part of the "World Wide Web" (www) on the Internet, the resource pages being web pages.

14. A method according to any one of the preceding claims, comprising executing multiple instances of the method within multiple sub-networks of the overall network.

15. A method according to any one of the preceding claims, wherein the step of identifying new and/or modified pages within the directory structure comprises comparing scanned portions of the directory structure and its pages with a previously captured representation of the directory structure and the pages within it.

16. A method according to claim 15, wherein the previously captured representation of the directory structure and its pages is updated over time to reflect identified changes.

17. A method according to claim 15 or claim 16, wherein the captured representation of the directory structure the pages in it comprises a signature for each page in place of the page itself, the signature comprising less data than the page it represents to enable a more rapid comparison.

18. A method according to any one of the preceding claims, wherein said set of features characteristic of fraudulent resource pages comprises a set of one or more context-specific features that are characteristic of fraudulent pages in the respective context.

19. A method according to any one of the preceding claims, wherein said set of features characteristic of fraudulent resource pages comprises a common set of one or more features that are indicative of a fraudulent page irrespective of the specific context in which the page resides.

20. A method according to claim 19, comprising updating the common feature set based on a remotely located copy of the common feature set.

21. A method according to any one of the preceding claims implemented by a daemon program.

22. A system for detecting fraudulent resource pages in a linked web of resource pages accessible to end users, the resource pages being organised in a directory structure that identifies the physical location of each page, the system comprising: means for scanning the directory of resource pages and identifying when a new page is added to the directory and/or a page in the directory is modified; means for comparing each new or modified page with a set of features characteristic of fraudulent resource pages; and means for making a determination as to whether a page is fraudulent or not based on an output from the comparison with said set of features.

23. A computer program that when executed on a computer or computer network causes the computer or network to operate in accordance with a method according to any one of claims 1 to 20.

24. A computer program according to claim 23 stored on a computer readable medium.

25. A method of classifying Internet domains with regards to the likelihood that a web page retrieved from a domain is a phishing page, the method comprising determining whether a domain is one in which fraudulent resource pages are detected in accordance with a method according to any one of claims 1 to 21 and if it is then classifying the domain as trusted and if it is not then classifying the domain as not trusted.

26. A method of classifying web pages with regards to the likelihood that a web page is a phishing page, the method comprising determining whether an Internet domain from which a web page is retrieved is one in which fraudulent resource pages are detected in accordance with a method according to any one of claims 1 to 21 and if it is then classifying the web page as trusted and if it is not then classifying the web page as not trusted.

27. A method according to claim 25 or claim 26, wherein the step of determining whether the Internet domain is one in which fraudulent resource pages are detected in accordance with a method according to any one of claims 1 to 21 , comprises checking whether the domain is included in a white-list of trusted domains.

28. A method operable on a computer or computer network for filtering web pages in real time to detect phishing pages, the method comprising: receiving a web page; determining in accordance with a method according to any one of claims 25 to 27 whether or not the web page and/or the domain from which the web page has been retrieved is classified as a trusted web page and/or a trusted domain; if the web page or the domain from which it has been retrieved is classified as trusted then filtering the page using a first set of filter criteria; and if the web page or the domain from which it has been retrieved is not classified as trusted then filtering the page using a second set of filter criteria, the second set of filter criteria being more stringent than the first set of filter criteria.

29. A method according to claim 28, wherein if the web page or the domain from which it has been retrieved is classified as trusted then the page does not undergo any filtering aimed at detecting phishing pages.

30. A method operable in a computer network for detecting fraudulent resource pages retrievable via the network by an end user for display on a client device associated with the user, the method comprising: in response to a request from the user for a resource page, retrieving the resource page; comparing the retrieved page with a set of features characteristic of fraudulent resource pages; based on the comparison, making a determination as to whether the retrieved page is fraudulent or not; and if the page is determined to be fraudulent, preventing the user from accessing the page.

31. A method according to claim 30, wherein the comparing step is carried out by a network device other than the user's client device and a page that is determined to be fraudulent is blocked and not delivered to the client device.

32. A method according to claim 30 or claim 31 , wherein: when a page is identified as a fraudulent page a page identifier for the page is added to a black list; the comparing step of the method further comprising comparing the page identifier of a retrieved page with the page identifiers in the black list; and if the page identifier of the retrieved page matches a page identifier already in the black list, the method determining that the retrieved page is fraudulent, thereby preventing the user from accessing it.

33. A method according to any one of claims 30 to 32, comprising, prior to the comparing step, a step to determine whether the retrieved page has been retrieved from a linked web of resource pages hosted on the network in which fraudulent resource pages are detected in accordance with a method according to any one of claims 1 to 21 and if it has been so retrieved then classifying the retrieved page as a trusted page.

34. A method according to claim 33, wherein the comparing step comprises comparing pages classified as trusted pages with a different, less stringent set of features characteristic of fraudulent resource pages than the set of features used for pages that are not classified as trusted.

35. A method according to claim 33, wherein retrieved pages classified as trusted are automatically determined not to be fraudulent and the comparing step is not executed for such pages.

36. A method according to any one of claims 30 to 35, wherein the result of the determination is:

- that the page is or is likely to be a fraudulent page;

- that the page is not a fraudulent page; or

- that it is uncertain whether the page is fraudulent or not.

37. A method according to any one of claims 30 to 36, wherein said step of making a determination comprises assigning the page a numeric suspicion score based on the comparison with the set of features and comparing the suspicion score with one or more predetermined threshold values.

38. A method according to claim 37, wherein said comparison is with at least a first, higher threshold above which the page is determined to be a fraudulent page and a second, lower threshold below which the page is determined not to be a fraudulent page.

39. A method according to any one of claims 30 to 38, wherein following a determination that a page is fraudulent, the user is prevented from accessing the page and a message is displayed to the user informing them that a fraudulent page has been detected.

40. A method according to any one of claims 30 to 38, wherein if it is indeterminate whether a page is fraudulent or not, the page is made temporarily inaccessible to the user but is accessible to an administrator in order that they can make a judgement as to whether the page is fraudulent or not.

41. A method according to claim 40, further comprising notifying an administrator that a possibly fraudulent page has been detected.

42. A method according to any one of the preceding claims wherein the web of linked resource pages is part of the "World Wide Web" (www) on the Internet, the resource pages being web pages.

43. A network device for detecting fraudulent resource pages retrievable via the network by an end user for display on a client device associated with the user, the system comprising: means for retrieving a resource page in response to a request from the user for the resource page; means for comparing a retrieved page with a set of features characteristic of fraudulent resource pages; means for making a determination as to whether a retrieved page is fraudulent or not based on an output from the comparison with said set of features; and means for preventing the user from accessing the page if the page is determined to be fraudulent.

44. A network device according to claim 43, comprising: means for maintaining a black list of fraudulent resource pages; and means for adding a page identifier for a fraudulent page to the black list; the means for comparing comprising means for comparing the page identifier of a retrieved page with the page identifiers in the black list; and the means for making a determination determining that a page is fraudulent if the page identifier of the retrieved page matches a page identifier already in the black list.

45. A network device according to claim 43 or claim 44, comprising: means for determining whether the retrieved page has been retrieved from a linked web of resource pages hosted on the network in which fraudulent resource pages are detected in accordance with a method according to any one of claims 1 to 21 ; and means for classifying the retrieved page as a trusted page if it has been so retrieved.

46. A network device according to claim 45, wherein the means for comparing comprises means for comparing pages classified as trusted pages with a different, less stringent set of features characteristic of fraudulent resource pages than the set of features used for pages that are not classified as trusted.

47. A network device according to claim 45, comprising means for automatically determining pages classified as trusted not fraudulent.

48. A network device according to any one of claims 43 to 47, wherein the network device is a gateway or a proxy (e.g. a web proxy).

49. A computer program that when executed on a computer or computer network causes the computer or network to operate in accordance with a method according to any one of claims 30 to 43.

50. A computer program according to claim 49 stored on a computer readable medium.