Information Science: setembro 2004

terça-feira, setembro 21, 2004

How Google works Sep 16th 2004 From The Economist print edition

Internet searching: With all the fuss over Google's IPO, it is easy to overlook its broader social significance. For many people, Google made the internet truly useful. How did it do it?

ONE thing that distinguishes the online world from the real one is that it is very easy to find things. To find a copy of The Economist in print, one has to go to a news-stand, which may or may not carry it. Finding it online, though, is a different proposition. Just go to Google, type in “economist” and you will be instantly directed to economist.com. Though it is difficult to remember now, this was not always the case. Indeed, until Google, now the world's most popular search engine, came on to the scene in September 1998, it was not the case at all. As in the physical world, searching online was a hit-or-miss affair.

Google was vastly better than anything that had come before: so much better, in fact, that it changed the way many people use the web. Almost overnight, it made the web far more useful, particularly for non-specialist users, many of whom now regard Google as the internet's front door. The recent fuss over Google's stockmarket flotation obscures its far wider social significance: few technologies, after all, are so influential that their names become used as verbs.

Google began in 1998 as an academic research project by Sergey Brin and Lawrence Page, who were then graduate students at Stanford University in Palo Alto, California. It was not the first search engine, of course. Existing search engines were able to scan or “crawl” a large portion of the web, build an index, and then find pages that matched particular words. But they were less good at presenting those pages, which might number in the hundreds of thousands, in a useful way.

Mr Brin's and Mr Page's accomplishment was to devise a way to sort the results by determining which pages were likely to be most relevant. They did so using a mathematical recipe, or algorithm, called PageRank. This algorithm is at the heart of Google's success, distinguishing it from all previous search engines and accounting for its apparently magical ability to find the most useful web pages.

Untangling the web

PageRank works by analysing the structure of the web itself. Each of its billions of pages can link to other pages, and can also, in turn, be linked to. Mr Brin and Mr Page reasoned that if a page was linked to many other pages, it was likely to be important. Furthermore, if the pages that linked to a page were important, then that page was even more likely to be important. There is, of course, an inherent circularity to this formula—the importance of one page depends on the importance of pages that link to it, the importance of which depends in turn on the importance of pages that link to them. But using some mathematical tricks, this circularity can be resolved, and each page can be given a score that reflects its importance.

The simplest way to calculate the score for each page is to perform a repeating or “iterative” calculation. To start with, all pages are given the same score. Then each link from one page to another is counted as a “vote” for the destination page. Each page's score is recalculated by adding up the contribution from each incoming link, which is simply the score of the linking page divided by the number of outgoing links on that page. (Each page's score is thus shared out among the pages it links to.)

Once all the scores have been recalculated, the process is repeated using the new scores, until the scores settle down and stop changing (in mathematical jargon, the calculation “converges”). The final scores can then be used to rank search results: pages that match a particular set of search terms are displayed in order of descending score, so that the page deemed most important appears at the top of the list.

While this is the simplest way to perform the PageRank calculation, however, it is not the fastest. Google actually uses sophisticated techniques from a branch of mathematics known as linear algebra to perform the calculation in a single step. (And the actual PageRank formula, still visible on a Stanford web page includes an extra “damping factor” to prevent pages' scores increasing indefinitely.)

Furthermore, the PageRank algorithm has been repeatedly modified from its original form to prevent people from gaming the system. Since Google's debut in 1998, the importance of a page's Google ranking, particularly for businesses that rely on search engines to send customers their way, has increased dramatically: Google is now responsible for one in three searches on the web. For this reason, an entire industry of “search-engine optimisers” has sprung up. For a fee, they will try to manipulate your page's ranking on Google and other search engines.

The original PageRank algorithm could be manipulated in a fairly straightforward fashion, by creating a “link farm” of web pages that link to one another and to a target page, and thus give an inflated impression of its importance. So Google's original ranking algorithm has grown considerably more complicated, and is now able to identify and blacklist pages that try to exploit such tricks.

Mr Page and Mr Brin made another important innovation early on. This was to consider the “anchor text”—the bit of text that is traditionally blue and underlined and forms a link from one page to another—as a part of the web page it referred to, as well as part of the page it was actually on. They reasoned that the anchor text served as an extremely succinct, if imprecise, summary of the page it referred to. This further helps to ensure that when searching for the name of a person or company, the appropriate website appears at the top of the list of results.

Ranking the order in which results are returned was the area in which Google made the most improvement, but it is only one element of search—and it is useless unless the rest of the search engine works efficiently. In practice, that means compiling a comprehensive and up-to-date index of the web's ever-changing pages. PageRank sits on top of Google's extremely powerful and efficient search infrastructure—one that draws on the lessons learned from previous, and now mostly forgotten, search engines.

As the web grew in the early 1990s, a number of search engines, most of them academic research projects, started crawling and indexing its pages. The first of these, the World Wide Web Wanderer and the World Wide Web Worm, used very simple techniques, and did not even index entire web pages, but only their titles, addresses and headers. A number of commercial engines followed, springing out of academic projects (as Google later did). WebCrawler, the first to index entire pages, emerged in 1994 at the University of Washington and was later bought by America Online. It was followed by Lycos and InfoSeek. But the first really capable search engine was AltaVista, unveiled by Louis Monier of Digital Equipment Corporation in December of 1995.

The day before the site opened for business, on December 15th, it already had 200,000 visitors trying to use it. That was because AltaVista successfully met two of the three requirements that later led to Google's success. First, it indexed a much larger portion of the web than anything that had come before. This, says Dr Monier, was because AltaVista used several hundred “spiders” in parallel to index the web, where earlier search engines had used only one. Second, AltaVista was fast, delivering results from its huge index almost instantly. According to Dr Monier, all earlier search engines had been overwhelmed as soon as they became popular. But the AltaVista team had used a modular design right from the start, which enabled them to add computing power as the site's popularity increased. Among some geeks, at least, AltaVista came into use as a verb.

Seek, and Google shall find

Even so, AltaVista still lacked Google's uncanny ability to separate the wheat from the chaff. Experienced users could use its various query options (borrowed from the world of database programming) to find what they were looking for, but most users could not. Although AltaVista's unprecedented reach and speed made it an important step forward, Google's combination of reach, speed and PageRank added up to a giant leap.

When you perform a Google search, you are not actually searching the web, but rather an index of the copy of the web stored on Google's servers. (Google is thought to have several complete copies of the web distributed across servers in California and Virginia.) The index is compiled from all the pages that have been returned by a multitude of spiders that crawl the web, gathering pages, extracting all the links from each page, putting them in a list, sorting the links in the list in order of priority (thus balancing breadth and depth) and then gathering the next page from the list.

When a user types in a query, the search terms are looked up in the index (using a variety of techniques to distribute the work across tens of thousands of computers) and the results are then returned from a separate set of document servers (which provide preview “snippets” of matching pages from Google's copies of the web), along with advertisements, which are returned from yet another set of servers. All of these bits are assembled, with the help of PageRank, into the page of search results. Google manages to do this cheaply, in less than a second, using computers built from cheap, off-the-shelf components and linked together in a reliable and speedy way using Google's own clever software. Together, its thousands of machines form an enormous supercomputer, optimised to do one thing—find, sort and extract web-based information—extremely well.

Mr Page and Mr Brin created the prototype of Google on Stanford's computer systems. However, as visionaries do, they thought ahead clearly, and from the beginning had sound ideas both for searching and for creating the system of servers capable of handling the millions of queries a day that now pass through Google. It was the clarity of their ideas for scaling the server architecture, and their ability to think big, that made it so easy for them to turn their research project into a business. Andy Bechtolsheim, one of the founders of Sun Microsystems and an early investor in Google, did not even wait to hear all the details: when Mr Page and Mr Brin approached him, he reputedly said, “Why don't I just write you a cheque for $100,000?” He wrote the cheque to “Google Inc.”—a firm which did not yet exist. So Mr Page and Mr Brin were forced to incorporate a business very quickly, and the company was born.

What was still missing, though it was unfashionable to worry about it in the early days of the dotcom boom, was a way of making money. Initially, Google sold targeted banner advertisements and also made money by providing search services to other websites, including Yahoo! and a number of other, smaller portals. But, says John Battelle, a professor at the University of California, Berkeley, who is writing a book about search engines, Google's revenues did not really take off until 2000, when it launched AdWords—a system for automatically selling and displaying advertisements alongside search results.

Advertisers bid for particular search terms, and those who bid the highest for a particular term—“digital cameras”, say—have their text advertisements displayed next to Google's search results when a user searches for that term. Google does not simply put the highest bidder's advertisement at the top of the list, however. It also ranks the advertisements according to their popularity, so that if more people click on an advertisement halfway down the list, it will be moved up, even if other advertisers are paying more. Google's philosophy of ranking results according to their usefulness is thus applied to advertisements too.

The only fly in the ointment, from Google's point of view, was that Overture, a rival firm, claimed to have patented the idea for AdWords-style sponsored links. Overture filed a lawsuit against Google in 2002: it was settled out of court last month when Google agreed to give Yahoo! (which acquired Overture last year) 2.7m shares, worth around $230m, to resolve the matter. Google was eager to settle the AdWords dispute before its initial public offering, which took place on August 19th.

Google now faces a three-way fight with Yahoo! and Microsoft, which have both vowed to dethrone it as the dominant internet search engine. Yahoo!'s strategy is to interconnect its various online services, from search to dating to maps, in increasingly clever ways, while Microsoft's plan is to integrate desktop and internet searching in a seamless manner, so that search facilities will be embedded in all its software, thus doing away (the company hopes) with the need to use Google. Both firms are also working to improve their basic search technology in order to compete with Google.

Beyond searching?

In response, Google has gradually diversified itself, adding specialist discussion groups, news and shopping-related search services, and a free e-mail service, Gmail, which is currently being tested by thousands of volunteers. It has also developed “toolbar” software that can be permanently installed on a PC, allowing web searches to be performed without having to visit the Google website, and establishing a toe-hold on its users'PCs.

Google's technical credentials are not in doubt. The question is whether it can maintain its position, as search, the activity where it is strongest, moves from centre stage to being just part of a bundle of services. Yet the example of Gmail shows how search can form the foundation of other services: rather than sorting mail into separate folders, Gmail users can simply use Google's lightning-fast search facility to find a specific message. So the technology that made Google great could yet prove to be its greatest asset in the fight ahead. Let battle commence.

quinta-feira, setembro 16, 2004

Novas fronteiras tecnológicas das acções de informação

Maria Nélida Gonzalez de Gomez, IBICT/DEP
In Ciência da Informação > Vol. 33, N° 1 (2004)

Resumo

Mudanças significativas dos conceitos e abordagens acerca da informação e da comunicação acontecem ao mesmo tempo em que muda a relação entre os usos da linguagem e as mediações tecnológicas desses usos – ainda que não seja de forma causal. O mesmo aconteceria com a reformulação dos dispositivos de informação pela crescente convergência entre as tecnologias culturais e de reprodução e as tecnologias digitais. Nesse contexto, dir-se-ia que, ao desativar constrangimentos físicos das ações de transferência de informação, colocam-se em relevo outros problemas e possibilidades de recuperação e busca de informação. Ocupa-se assim da comunicação de informação entre diferentes comunidades de interlocução, das quais a comunicação multicultural seria um caso e, quiçá também, um paradigma, indagando-se acerca dos usos metafóricos da linguagem como novas categorias de entendimento e descrição das ações de informação.

Vale a pena ler o artigo todo aqui

O Código de Ética para os Profissionais de Informação em Portugal

Os profissionais de informação em Portugal assumem como próprias as seguintes responsabilidades:

1.1. Facilitar o acesso dos utilizadores dos serviços de informação a todo o género de informações publicadas sob qualquer suporte.

1.2. Construir colecções adequadas às necessidades de informação dos utilizadores dos serviços, com atitude proactiva para que essas necessidades estejam previstas mesmo antes de serem expressas.

1.3. Fazer uma selecção de materiais, equilibrando a oferta e a procura, a actualização e a preservação, a diversidade de assuntos e o equilíbrio entre os diversos pontos de vista.

1.4. Efectuar o tratamento de toda a informação por forma a facilitar o acesso à mesma.

1.5. Disponibilizar o acesso à informação existente no seu serviço.

1.6. Explicitar, na definição da política de informação do serviço a seu cargo, que a missão principal deste é a disponibilização da informação, de todos os géneros, em todos os suportes, para todos os utilizadores.

1.7. Não permitir interferências exteriores, que possam impedir ou dificultar o acesso à informação disponível nos seus serviços.

1.8. Não permitir que as suas opiniões pessoais interfiram na liberdade de acesso à informação.

1.9. Opôr-se à implementação de qualquer solução tecnológica que possa limitar ou manipular o acesso à informação.

1.10. Elaborar, participar na elaboração, conhecer, apoiar e divulgar a legislação que diz respeito ao direito de acesso à informação sem qualquer interferência.

2.1. Utilizar os dados de carácter pessoal apenas para o fim para que foram recolhidos.

2.2. Considerar como dados em situação de privacidade: registos de leitura, de empréstimos, consultas bibliográficas e quaisquer dados que identifiquem os utilizadores dos seus serviços e as suas actividades.

2.3. Não divulgar dados de carácter privado e observar os requisitos de segurança para que estes dados não possam ser interceptados.

2.4. Garantir que os registos em papel ou automatizados, não sejam deixados em lugares de fácil acesso a outros utilizadores.

2.5. Ter todo o cuidado para que a manipulação e acesso a registos automatizados seja unicamente realizada pelos elementos autorizados do seu serviço.

2.6. Garantir que os dados sobre os hábitos de leitura ou de interesses bibliográficos dos utilizadores dos serviços sejam recolhidos para o normal funcionamento dos serviços e só seja possível usá-los para fins de investigação ou de estatística.

2.7. Não informar qualquer utilizador dos seus serviços sobre as tarefas realizadas por outro utilizador.

2.8. Considerar abusivo qualquer pedido de informação cuja intenção seja violar a privacidade de um utilizador.

2.9. Se, por algum motivo, forem pressionados a fornecer informação de carácter privado, os profissionais só o poderão fazer com a autorização prévia e escrita dos utilizadores que os disponibilizaram.

3.1. Assegurar um desempenho profissional competente.

3.2. Considerar o sentido do dever para com os utilizadores dos serviços de informação como o seu dever central.

3.3. Aumentar o conhecimento público das possibilidades inerentes ao serviço que realizam e dos serviços que disponibilizam.

3.4. Procurar um contínuo desenvolvimento profissional, apoiando os colegas que pretendam fazer o mesmo.

3.5. Apoiar todas as normas profissionais cujo objectivo seja fomentar a competência profissional.

3.6. Considerar as necessidades de informação dos utilizadores dos serviços e do público em geral, acima dos seus próprios interesses e dos da organização na qual trabalham.

3.7. Informar os seus empregadores, responsáveis dos serviços, colegas e utilizadores, da existência de conflitos de interesse que possam surgir durante a actividade profissional.

3.8. Contribuir para a definição de uma política de informação.

3.9. Promover, pelo seu modo de agir, a confiança do público na correcção de processos e na eficiência profissional.

3.10. Manter a confidencialidade da informação dentro das organizações nas quais trabalham. Este respeito mantém-se mesmo para além da cessação do vínculo laboral.

3.11. Ter consciência do âmbito da sua actividade profissional, não dando de si próprios, ou da organização na qual estão inseridos, uma visão que ultrapasse os limites da sua especificidade profissional.

3.12. Estabelecer contratos justos, quer com os utilizadores dos seus serviços, quer com os fornecedores, e de nenhum modo permitir que os seus interesses pessoais sejam beneficiados nesses contratos.

3.13. Proceder de tal modo nas relações com os utilizadores dos serviços que a sua conduta seja objectiva e imparcial.

3.14. Assegurar-se de que a informação fornecida aos utilizadores é adequada, completa e claramente apresentada.

3.15. Aceitar a responsabilidade pela qualidade do seu trabalho e pelas consequências de erros cometidos por descuido.

3.16. Fornecer a melhor informação possível de acordo com as necessidades dos utilizadores, ou indicar o serviço mais adequado para a encontrar.

3.17. Adquirir uma formação que corresponda às necessidades concretas de um bom desempenho profissional.

3.18. Considerar que estar actualizado é parte essencial da ética profissional.

3.19. Completar lacunas na sua formação, mantendo um actualizado conhecimento das práticas profissionais, com uma atitude activa de procura de conhecimentos de âmbito profissional.

3.20. Contribuir para o desenvolvimento da investigação científica nas ciências da informação.

3.21. Trocar informação de âmbito profissional, através das associações profissionais, dando informações, publicando artigos, livros, ou propondo iniciativas de formação.

3.22. Apoiar a participação em cursos, seminários, conferências ou quaisquer outras acções que alarguem o espectro de conhecimentos de carácter profissional.

3.23. Partilhar conhecimentos entre os profissionais e os utilizadores de serviços de informação, de modo a aumentar a eficácia da profissão.

3.24. Informar o público das acções de âmbito profissional que neste domínio se realizam.

Information Science

terça-feira, setembro 21, 2004

How Google works Sep 16th 2004 From The Economist print edition

quinta-feira, setembro 16, 2004

Novas fronteiras tecnológicas das acções de informação

O Código de Ética para os Profissionais de Informação em Portugal

sexta-feira, setembro 03, 2004

Humor

quarta-feira, setembro 01, 2004

estaremos de volta muito em breve...

Contribuidores

Email

PÃ¡gina Web

IrmÃ£os

Blogs e PÃ¡ginas Web a visitar

Previous Posts

Archives