Search engines judge the queries a web page is relevant to and then rank the page according to its importance for those queries. The purpose of this article is to provide a quick overview of search engines as a whole for the areas of “importance” and “relevance,” pointing out weaknesses in search engine algorithms today.
Relevance Overview
In the past year or so, search engines have made good progress in improving relevance of search results by filtering out obvious spam. They are able to identify artificial linking networks, gateway domains, duplicate content, and hidden text with keywords. However, there is still a lot of work to be done in the area of relevance by discovering invisible content on the web, some of which is currently unreadable by search engine spiders but most of which is readable.
Unreadable Content
A large part of a search engine optimizer’s job is to make sure a site is friendly for search engine spiders. A search engine spider is a program that views a site; the information it acquires is used to determine rankings. Spiders have trouble with a lot of content that is actually quite popular for sites to use. With this type of content it is important to either provide supplemental content or use different methods for displaying the same information. In general, a search engine spider ought to try to see and interpret a page roughly the same way a human visitor would, and this is not easy, or even close to possible in most cases. Video content and images are an obvious example.
A picture might contain a 1,000 words, but there really is no way for a spider to see a single one without a human telling it what he/she sees. Methods for displaying what a picture or video communicates require alternate text that is often not visible to a human visitor. When determining rankings, search engines try to avoid using content that is not readily visible to, and regularly interpreted by, human visitors. A search engine spider’s inability to interpret the meaning of an image is a huge deficit. Images look better and are often used even to display text. Sites that are made completely of images are not uncommon. Search engines are missing a great deal of content by not having accurate ways to interpret important images. The same is true of video content.
Search engines use a lot of “on the page” factors to determine relevance. For content that is not visible to a spider, however, “off the page” factors are often important for doing the job. In this case, it is all about links. How people link to a video (or a page with a video) for example might determine to a search engine what that video is about. This type of indirect evaluation is necessary; however, it isn’t advanced. For example, one site can talk about another site or the company the site represents or the content in that site, and it doesn’t make any difference to how a spider sees that site unless there is a link. Content that is unreadable by search engines is popular. To improve relevance of search results, spiders that can read the content need to be developed, or algorithms need to develop advanced methods for determining the subject matter of that content through indirect means (or “off the page” factors).
Readable Content
Search engine databases have grown over the years, but a large portion of the web is still invisible. This is because search engines use factors that a lot of site owners are unaware of or don’t bother with. One such factor is the html title tag.
The html title tag is a special case of nearly invisible content. This tag is all but completely useless. Visitors rarely see or notice it, unless they’re trying to bookmark a site. And if a visitor is bookmarking a site, he/she is likely to get some kind of ugly string of keyphrases that nobody would really want to use to quickly pick a link from a list of other sites. The reason he/she would get this string is because site owners know html titles are used for rankings, and they throw as many keyphrases as possible in the tag. The use of html titles is a leftover from the olden days that search engines haven’t gotten rid of yet. In fact, a lot of weight is still given to the title tag to determine rankings, even though it has little to do with what a human visitor sees when visiting a site.
If search engines want to provide the best results possible, they should at least be aware of all the possibilities out there. This means not using “on the page” elements that website developers may not naturally include. This also means making an effort to find every resource the web provides. Work has been done to make search engine spiders more efficient in the past year, which may be a sign of progress in finding more of what’s out there. Search engines are so similar, however, that the content they are aware of is usually a subset of another search engine (usually Google). To gain a good sampling of what the web provides, search engines need to continue expanding their databases and honing spiders to search faster.
Improving Search Relevance - Summary
To improve relevance, search engines need to find a way to read images or video content directly, and/or they need to become more sophisticated in their indirect measurements. The use of alternate text is not a good method. It is set by the site’s designer and usually is not visible to visitors.
Search engines should stop using html title tags. These too are set by the site’s designer and usually are not noticed by visitors. A lot of site designers don’t know to write title tags and effectively get excluded from search results because of it. Search engine results pages (SERPs) often use them as the linking text to the page being listed, but this could easily be changed.
Search engines should continue to expand their databases to include all pages that are live on the web.
Importance Overview
Judgements of importance are largely subjective, depending on the goals of the search engine’s creators. Although it is difficult to define what can be considered progress for search engines in this area, there are still some areas that can be improved upon, considering the interests of most search engines.
Search Engine Goals
When someone enters a short query into a search engine, it is very easy for a search engine to return results that are relevant to that query in some way. This is the nature of most user queries. Of the thousands of possibilities it is a search engine’s job to list the most relevant (i.e. most important) sites at the top. If you believe that, then you’re already making an assumption that is wrong. For most queries, there usually isn’t a single most important resource. There will be a huge pool of equally useful (or not useful, as the case may be) possibilities. Most users don’t enter in specific enough queries for there to be enough information to retrieve the “best” results – they rely on the search engines to tell them what the best results are. The main goals of a search engine are:
- To continue with the idea that there is an order of importance for search results, according to most user queries (this only deals with the bulk of searches).
- To convince users that the most important results are being displayed.
Importance
According to these search engine goals, difficulties for determining importance arise from the fact that:
- The importance of a resource is largely personalized. What someone is interested in can’t be completely surmised from a short query typed in an engine.
- Search engines are commercial entities that favor commercial results. It will always be in a search engine’s interest to be a commercial voice.
- Search engines can’t interpret information and understand it. A search engine could be looking at the cure for cancer and wouldn’t know that it is important unless a bunch of people saw it and told a search engine it was important (through indirect means).
There are probably more reasons than these, but these are the big ones.
Personalization
Search engines are experimenting with personalized search programs. In these programs, the results that are displayed are influenced by a user’s background and history of queries. These programs are still babies and aren’t really that popular. The major problem with personalizing results is users don’t want to provide personal information to a search engine. And for good reason . . . search engines will give this information to third parties – i.e. the government. The government has already acquired information about user queries from search engines, and in the case of AOL, were actually able to trace that information to user names. Personalized programs also require more effort from the user. It is easier to just go to a search engine and type in a query, instead of starting an account and entering in a bunch of information.
Commercial Voice
Currently search engines rely a great deal on linking to determine the order of rankings. They reinforce what has already been found and referenced to by other sites to determine what is most important. They evaluate popular content to be the most important content. Google describes this as being “democratic.” So if a company starts a website and invests time and money promoting it getting links from “established” and “reputable” sites, it is democracy at work according to Google. The linking network on the web that is probably closest to being democratic is social bookmarking – and these links don’t count toward rankings (or if they do, it’s not by much).
Understanding Information
How can a program understand the meaning of what a human is saying? Even if it could happen, how could it be implemented on a large scale?
Sort of trivial isn’t it. The point is search engines have different ways to fake intelligence, but for people looking for valuable resources, you’re not going to get a truly intelligent response unless you talk directly to a person.
Search engines have gone more in this direction. They “listen” to what other sites say about a site to determine how important it is. The methodology is extremely simplistic; even if every factor that anyone else thinks might be employed by a search engine is employed by a search engine.
Improving Importance
According to search engine goals stated above:
- To continue with the idea that there is an order of importance for search results, according to most user queries (this only deals with the bulk of searches).
- To convince users that the most important results are being displayed.
improving importance is largely a marketing effort. It is important for search engines to convince users that they are providing information that is personalized to them; are unbiased, representing a culmination of mass opinion; and are providing results based on a broad base of understanding. Ever since Google came up with PageRank, the search engine ran away with the last two factors. Now they are expanding into different niches and offering various options and tools, effectively staying a search engine with all the benefits of a portal. Google is winning the race for importance. Yahoo! and MSN Search need to follow suite to catch up. And it’s not going to be easy, since Google can probably duplicate any efforts that prove successful for Yahoo! and MSN.
Search engines can “improve importance” by using new ranking factors that convince users that they are really listening to their voice. This might mean analyzing text that discusses a site or refers to the content of a site in greater depth. It might mean giving weight to more “democratic” networks. Work could also be done to establish an unbiased reputation. This could mean automating more processes or taking more from resources that have a reputation for being unbiased. Efforts should also be made to make everything seem more personal – letting users see only what they want to see, or use only what they want to use. This might mean offering more options in the search interface – perhaps, let users choose where ads are displayed or how many are displayed – options of this nature.
Summary
Search engines have made progress in filtering obvious spam from search results. The relevance of search results could be improved by expanding databases to include more content, finding better ways to view unreadable content, and getting rid of html title tags as a ranking factor. The importance of search results could be improved by making greater marketing efforts to highlight new features that show personalization, a lack of bias, and a more complete understanding of information on the web.










