1
APPROACH FOR NEAR DUPLICATE IMAGE DETECTION
CROSS-REFERENCE TO RELATED
APPLICATION 5
This application is related to and claims the benefit of priority from Indian Patent Application No. 3233/DEL/2005 filed in India on Dec. 1, 2005, entitled APPROACH FOR NEAR DUPLICATE IMAGE DETECTION, the entire con- 10 tent of which is incorporated by this reference for all purposes as if fully disclosed herein.
FIELD OF THE INVENTION
15
The present invention relates to detecting near duplicate images.
BACKGROUND
20
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art 25 merely by virtue of their inclusion in this section.
Near duplicate images are images that are visually identical to the human eye but that do not have identical data representations. Various data processing techniques, such as scaling, down sampling, clipping, rotating and color process- 30 ing can generate near duplicate images. For example, an original image may be copied and the copy modified by performing color processing on the copy. The modified copy of the image may appear visually identical to the original image, but have a different data representation because of the 35 color processing applied to the copy.
Various issues have arisen relating to near duplicate images on the Internet. In the context of Internet searching, it is not uncommon for the results of a search to include near duplicate images. One reason for this is that most search engines iden- 40 tify matching images based upon keyword matching. That is, keywords contained in a search query are compared to keywords associated with images. An image having an associated keyword that matches a keyword contained in a query is determined to be a match for that query and is included in the 45 search results. When an image is copied and the copy modified to create a near duplicate image, the near duplicate image may have the same associated keywords as the original image. In this situation, both the original image and the modified near duplicate image are included in the search results. 50 From the prospective of both the search engine host and end users, it is desirable to not include near duplicate images in search results.
Although approaches exist and have been employed to detect duplicate images, using these approaches to detect near 55 duplicate images has proven to be ineffective. One such approach involves comparing pixel information at fixed locations. While this approach may be useful in detecting exact duplicate images, it has significant limitations when used to detect near duplicate images. For example, the approach is 60 effective when a copy of an image is cropped and the pixels being compared are not in the portion that has been cropped. In this situation, the comparison of pixel information would correctly identify the images as near duplicate images. On the other hand, this approach is not useful when the changes 65 include slight changes in color, scaling or rotation. When a copy of an image is modified in this manner, a comparison of
2
pixel information would indicate that the original image and the modified copy are not near duplicate images. In a search engine application, this would result in both the original and modified copy being included in search results as different images, even though the modified copy is a near duplicate of the original image because it appears visually identical to the original image.
Based upon the foregoing, an approach for detecting near duplicate images that does not suffer from limitations of prior approaches is highly desirable.
BRIEF DESCRIPTION OF THE DRAWINGS
Embodiments of the present invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:
FIG. 1 is a flow diagram that depicts an approach for detecting near duplicate images according to one embodiment of the invention.
FIGS. 2A and 2B are images that are visually similar but that have different center of masses.
FIGS. 3A and 3B are two other example images that are visually similar but that have different center of masses.
FIG. 4 is a block diagram depicts an image having a center of mass at a location.
FIG. 5 is a color histogram of the image in FIGS. 2A and 2B.
FIG. 6 is a color histogram of the image in FIGS. 3A and 3B.
FIG.7isablock diagram that illustrates a computer system upon which an embodiment of the invention may be implemented.
DETAILED DESCRIPTION
In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the invention described herein. It will be apparent, however, that the embodiments of the invention described herein may be practiced without these specific details. In other instances, wellknown structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the embodiments of the invention described herein. Various embodiments of the approach are described hereinafter in the following sections:
I. OVERVIEW
II. CENTER OF MASS AND IMAGE SEGMENTS
III. COLOR HISTOGRAM DESCRIPTOR DATA
IV. TEXTURE HISTOGRAM DESCRIPTOR DATA
V. IMAGE SIGNATURE DATA GENERATION
VI. DETERMINING WHETHER IMAGES ARE NEAR DUPLICATE IMAGES
VII. IMPLEMENTATION MECHANISMS
I. Overview
A content-based approach is provided for detecting near duplicate images. The approach generally involves analyzing the content of images to be compared and generating color and texture histogram descriptor data for each image. The images may then be compared based upon the color and texture histogram descriptor data to determine whether the images are near duplicate images. Content-based image signature data may also be generated for each of the images based upon the color and texture histogram descriptor data.
The image signature data may then be compared to determine whether the corresponding images are near duplicate images.
FIG. 1 is a flow diagram 100 that depicts an approach for detecting near duplicate images according to one embodiment of the invention. In step 102, the center of mass of each 5 image is determined and segments defined around the center of gravity as described in more detail hereinafter and as depicted in FIG. 4. In step 104, color histogram descriptor data is generated for each segment of each image. In step 106, texture histogram descriptor data is generated for each seg- 10 ment of each image. In step 108, image signature data is generated for each image based upon the color histogram descriptor data and the texture histogram descriptor data. In step 110, a determination is made whether the images are near duplicate images. As described in more detail hereinafter, this 15 determination may be performed using either the color and texture histogram descriptor data or the image signature data. In implementations where the determination of whether images are near duplicate images is performed using the color and texture histogram descriptor data without the image sig- 20 nature data, then step 108 may not be performed. Additional details of the approach are described hereinafter.
The approach has been found to be effective in identifying near duplicate images because of the use of color and texture analysis to generate the image signatures. The approach is 25 applicable to a wide variety of contexts and implementations and is not limited to any particular context or limitation. For example, the approach may be used in the search engine context to reduce the number of near duplicate images contained in search results. As another example, the approach 30 may be used to reduce the number of near duplicate images stored in databases. As a further example, the approach may be employed to identify unauthorized copies of images where the unauthorized copies were modified in an attempt to avoid detection. The approach is also not limited to images per se 35 and may be applied to video data, for example, video frames. The image signatures described herein may be used in combination with other information used to identify duplicate images. Image signatures may also be used to support high level functionality, for example, determining whether an 40 image is a photo or graphic in an image search engine.
II. Center of Mass and Image Segments
According to one embodiment of the invention, a center of mass on luminance is determined for an image for which color and texture histogram descriptor data is to be generated as follows:
For an image,
Where, lum,.=0.299xR+0.587xG+0.114xB with(R, G, B) colors at pixel (i,j). Center of mass (Cx, C ) is then defined as 55 C=SJS and C^S^/S FIGS. 2A and 2B are two images that are visually similar but that have different center of masses. The image in FIG. 2A has a center of mass of (45, 60), while the image in FIG. 2B has a center of mass of (38, 51). As another example, FIGS. 3A and 3B are also images that are 60 visually similar but that have different center of masses. The image in FIG. 3A has a center of mass of (65, 43), while the image in FIG. 3B has a center of mass of (56, 46).
Once the center of mass on luminance has been determined, the image is segmented based upon the location of the 65 center of mass. FIG. 4 is a block diagram that depicts an image 400 having a center of mass at a location 402. A center
segment SI is defined by the center of mass of image 400. Four other segments are defined for image 400, including a top segment S2, a right segment S3, a bottom segment S4 and a left segment S5. The number, size and shape of segments may vary, from image to image, depending upon a particular implementation, and the approach is not limited to segments of any particular number, size or shape.
III. Color Histogram Descriptor Data
According to one embodiment of the invention, color histogram descriptor data is generated for each segment in an image. Various color spaces, for example hue saturation value (HSV) or red, green blue (RGB), and quantization schemes may be employed, depending upon a particular implementation, and the approach described herein is not limited to any particular color space or quantization scheme. According to one embodiment of the invention, color histogram descriptor data is generated for each segment based upon a quantization scheme of twenty seven (27) colors in RGB space. It has been determined that a quantization scheme of 27 colors provides sufficient color representation while providing significant savings in computational cost over quantization schemes that use a larger number of colors. Thus, in this embodiment of the invention, the color histogram descriptor data for a particular segment indicates the number of occurrences within the particular segment of each of 27 colors. For an image divided into five segments, the color histogram descriptor data includes 27x5 or 135 values. The color histogram descriptor data may also be normalized. According to one embodiment of the invention, the color histogram descriptor data is normalized by dividing each of the values by the maximum value. The color histogram descriptor data for each segment may be separately normalized. FIG. 5 is a color histogram of the image in FIGS. 2A and 2B.
IV. Texture Histogram Descriptor Data
According to one embodiment of the invention, texture histogram descriptor data is generated for each segment in an image. An image is first converted to a gray-scaled image. A gray-scaled image has data values of, for example, 0-255 for each pixel. A variety of well-known techniques may be used to generate the gray-scaled image data and the invention is not limited to any particular approach. For example, the grayscaled image data may be generated from the color image data using digital image manipulation software application, such as Adobe Photoshop, available from Adobe Systems, Inc. of San Jose, Calif. Texture analysis is then performed on each segment of the gray-scaled image to determine the edge orientation at each pixel within a segment. According to one embodiment of the invention, this is performed using oriented Gaussian filtering and each edge orientation is quantized to fourteen (14) directions. Although any quantization scheme may be employed, it has been determined that a quantization scheme of 14 directions provides sufficient texture representation while providing significant savings in computations costs over quantization schemes that use a larger number of edge orientations. Thus, in this embodiment of the invention, the texture histogram descriptor data for a particular segment indicates the number of occurrences within the particular segment of each of 14 edge orientations. For an image divided into five segments, the texture histogram descriptor data includes 14x5 or 70 values. The texture histogram descriptor data may also be normalized. According to one embodiment of the invention, the texture histogram descriptor data is normalized by dividing each of the values by the maximum value. The texture histogram descriptor data for each segment may be separately normalized. FIG. 6 is a color histogram of the image in FIGS. 3A and 3B.
V. Image Signature Data Generation
Different types of image signature data may be generated, depending upon the requirements of a particular implementation. According to one embodiment of the invention, wavelet-based signature data is generated. According to this 5 approach, the wavelet-based signature data includes X number of bits, with X/2 bits for color and X/2 bits for texture. Signature encoding of the color histogram descriptor data and texture histogram descriptor data is performed using a wavelet transform-based compression. According to one embodi- 10 ment of the invention, a Haar transform with eight (8) average coefficients with uniform quantization is used to create either a 64 bit or 128 bit signature. Both the color and texture histograms are encoded to 32 or 64 bits for a total signature size of 64 or 128 bits. 15
According to another embodiment of the invention, vector quantization-based signature data is generated. According to this approach, a set of vector quantization code books is generated from random images. The code books contain representative vectors in color and texture space. The number of 20 code books used may vary depending upon a particular implementation and the invention is not limited to any particular number of code books. According to one embodiment of the invention, three random 64 k images are processed to create three code books for color and three code books for texture. 25
To generate a vector quantization-based signature for a particular image, first the color and texture histogram descriptor data for the particular image are matched to color and texture histogram descriptor data contained in the code books. For example, the 135 color histogram values are 30 matched to color histogram values contained in each of the color code books. Similarly, the 70 texture histogram values are matched to texture histogram values contained in each of the texture code books. A vector quantization-based signature for the particular image is then generated based upon index 35 values for the matching color and texture histogram data in the code books. According to one embodiment of the invention, the vector quantization-based signature includes the index values for each color and texture code book and other data. The other data may indicate, for example, the type class 40 of the particular image, i.e., graphic or picture, and the texture class of the particular image. For example, suppose that three images are used to create the code books, providing three color code books and three texture code books. In this situation, a 64 bit vector quantization-based signature may include 45 10 bits for each of the six matching code book index values (60 bits) and four bits for the other data.
VI. Determining Whether Images are Near Duplicate Images Various approaches may be used to determine whether 50
images are near duplicate images. For example, these approaches may use the color and texture histogram descriptor data for images or the image signature data. According to one embodiment of the invention, the color and texture histogram descriptor data is used to determine whether two 55 images are near duplicate images. Two images may be considered to be near duplicate images if their corresponding color and texture histogram descriptor data satisfy a threshold level of similarity. Alternatively, LSH-based similarity may be used. 60
According to another embodiment of the invention, image signature data is used to determine whether two images are near duplicate images. For implementations where waveletbased signature data is generated, the wavelet-based signature data may be compared using a Hamming distance. For 65 implementations where vector quantization-based signature data is generated, the vector quantization-based signature
data for images may be directly compared to determine whether there are exact matches on any parts of the vector quantization-based signature data.
VII. Implementation Mechanisms
The approach described herein for detecting near duplicate images may be used in a variety of contexts and applications. For example, the approach may be used to "de-duplicate" a database of images. In this example, a process may be used to generate color and texture histogram descriptor data and image signature data for each image in the database and then store this information in an index. The information in the index may then be used to identify near duplicate images. A decision can then be made on which of the near duplicate images should be deleted from the database.
As another example, in the search context, the approach may be used to eliminate near duplicate images in search results. In this example, when a search query is processed and search results are generated, images contained in the search results may be evaluated as described herein to identify near duplicate images. A decision can then be made about which of the near duplicate images are to be removed from the search results. The approach may be applied to many other contexts and implementations and is not limited to these example contexts and implementations.
The approach described herein for detecting near duplicate images may be implemented on any type of computing architecture or platform and in any type of process. As described herein, the approach may be implemented as part of a database management process or in association with a search engine platform. For purposes of explanation, an example computing architecture is described hereinafter.
FIG. 7 is a block diagram that illustrates an example computer system 700 upon which an embodiment of the invention may be implemented. Computer system 700 includes a bus 702 or other communication mechanism for communicating information, and a processor 704 coupled with bus 702 for processing information. Computer system 700 also includes a main memory 706, such as a random access memory (RAM) or other dynamic storage device, coupled to bus 702 for storing information and instructions to be executed by processor 704. Main memory 706 also may be used for storing temporary variables or other intermediate information during execution of instructions to be executed by processor 704. Computer system 700 further includes a read only memory (ROM) 708 or other static storage device coupled to bus 702 for storing static information and instructions for processor 704. A storage device 710, such as a magnetic disk or optical disk, is provided and coupled to bus 702 for storing information and instructions.
Computer system 700 may be coupled via bus 702 to a display 712, such as a cathode ray tube (CRT), for displaying information to a computer user. An input device 714, including alphanumeric and other keys, is coupled to bus 702 for communicating information and command selections to processor 704. Another type of user input device is cursor control 716, such as a mouse, a trackball, or cursor direction keys for communicating direction information and command selections to processor 704 and for controlling cursor movement on display 712. This input device typically has two degrees of freedom in two axes, a first axis (e.g., x) and a second axis (e.g., y), that allows the device to specify positions in a plane.
The invention is related to the use of computer system 700 for implementing the techniques described herein. According to one embodiment of the invention, those techniques are performed by computer system 700 in response to processor 704 executing one or more sequences of one or more instruc
« PrécédentContinuer » |